From biopython at maubp.freeserve.co.uk Tue Jun 1 05:05:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 10:05:43 +0100 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: 2010/6/1 Eric Talevich: > On Mon, May 31, 2010 at 11:53 AM, Peter wrote: > > Under this proposed scheme, what would you see as the basic record type >> (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO >> and Bio.Phylo)? It would be nice to say a protein chain, but there is the >> issue of multiple models (e.g. from NMR). I presume you'd go with the >> model as the basic unit (where each model may contain multiple chains). >> > > I'd consider a structure to be the basic unit of I/O. If we're going to make > better use of header info, that's generally associated with the whole > structure and not individual models -- we'd have to duplicate the header > info in each Model object emitted, which would be weird. > > Are there any formats that store more than one structure in a file? If not, > then there's probably no need for a parse() function in Bio.Struct. OK, yes - a whole structure as the unit would work, so we would only need the read function (one file is one structure) and not the parse function (no point in iterating over one thing). >> > from Bio.Struct import WHATIF, Jpred >> > # Servers each get their own module >> >> Hmm - perhaps we may need have another level here, Bio.Struct.Servers >> or Bio.Struct.WWW or something. How many of these do you expect? >> > > Jo?o's project plan includes Dali and WHATIF: > http://biopython.org/wiki/GSOC2010_Joao > > These servers do different things so I wouldn't expect any similarity in the > code between them. There are lots of servers that we *could* support... > Aesthetically, a Servers or WWW subdirectory would match > Bio.Struct.Applications and make the whole package a little more > self-documenting. My thoughts exactly. > Here's one more idea: Fetching a single PDB file from RCSB requires a > separate import and a couple of calls. Should we make this even easier by > mimicking the efetch function in Bio.Entrez, something like > >>>> handle = Bio.PDB.fetch("1MOT") > > or > >>>> from Bio.Struct.WWW import RCSB >>>> handle = RCSB.fetch("1MOT", "pdb") > > ? > That seems nice. Peter From krother at rubor.de Tue Jun 1 05:59:31 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 1 Jun 2010 11:59:31 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: Message-ID: Hi, Got some comments & questions. > 2. PDB headers seem to have become better structured in recent years, in > ... parse_pdb_header needs some attention as well. I haven't looked into this code for years .. I think it might be a little messy. > 3. Kristian asked on this list awhile ago about the proper location for > his new code that works with RNA structures. While RCSB's PDB contains > some RNA structures, the RNA world doesn't revolve around it. Similarly, > Jo?o needs a place to put code for structure prediction/validation > servers, command-line wrappers, secondary structures, etc. > > I propose a new sub-package called Bio.Struct for these enhancements: > > from Bio.Struct import RNA > # Would this work for you, Kristian? Yes, it would be more descriptive than the originally proposed Bio.RNA . I am just concerned whether I could keep the 2D structure-related modules in the same package. > Alternatively, we could do all of this within the PDB module -- so picture > the above examples with "PDB" in place of "Struct". This raises the chance > of naming collisions, though, and doesn't solve issue #3 above. I like Bio.PDB.RNA less for the same reasons plus the 2D structure issue. > We'll leave the existing PDB module layout alone, in general. I think it > will be necessary to add a few more attributes to the > Bio.PDB.Structure.Structure class, but we can do this without breaking > compatibility. > > Comments? What about the modules for constructing coordinates & Loop Closure (currently available on my Github branch)? I placed them in Bio.PDB because they are not limited to RNA and are conceptually similar to the operations performed by Bio.PDB.NeighborSearch and Bio.PDB.SVDSuperimposer - or would it be better to gather such things in some other package within Bio.PDB.Struct? Cheers, Kristian From biopython at maubp.freeserve.co.uk Tue Jun 1 07:42:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 12:42:53 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: Message-ID: 2010/6/1 Kristian Rother : > >> 3. Kristian asked on this list awhile ago about the proper location for >> his new code that works with RNA structures. While RCSB's PDB >> contains some RNA structures, the RNA world doesn't revolve around >> it. Similarly, Jo?o needs a place to put code for structure prediction/ >> validation servers, command-line wrappers, secondary structures, etc. >> >> I propose a new sub-package called Bio.Struct for these enhancements: >> >> from Bio.Struct import RNA >> # Would this work for you, Kristian? > > Yes, it would be more descriptive than the originally proposed Bio.RNA . I > am just concerned whether I could keep the 2D structure-related modules > in the same package. I don't necessarily see a problem with Bio.Struct or Bio.Structure covering both 2D and 3D structures. Does this 2D stuff include file parsers? That would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better. Peter From biopython at maubp.freeserve.co.uk Tue Jun 1 09:10:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 14:10:05 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 3:50 PM, Peter wrote: > Hi all, > > With the new command line wrappers and the tutorial pushing > users towards using subprocess we've had more queries > about how to use it. The subprocess module itself is rather > scary I guess, and things could be made a lot easier. > > I think the most typical use cases are: > > (1) Run the command, return the error code (integer) > (2) Run the command, return stdout, stderr and error code > > In theory the function subprocess.call() would take care > of the first example, but there is a cross platform annoyance > here with the shell parameter. Also, if you want the output > too things get even more tricky. It hasn't helped that there > are a few platform specific quirks/bugs in subprocess itself > (the different behaviour of the shell option on Windows, > bug http://bugs.python.org/issue1124861 in old Pythons, > the risk of deadlocks with large output files, etc). In fact I've often found using os.system() much easier than subprocess for the first use case - running a command and getting the return code. I wondered about adding an example of this to the tutorial but didn't find time before the last release (even if the Python documentation does try and encourage using subprocess instead). Peter From chapmanb at 50mail.com Tue Jun 1 09:23:55 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Jun 2010 09:23:55 -0400 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: Message-ID: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Peter; > With the new command line wrappers and the tutorial pushing > users towards using subprocess we've had more queries > about how to use it. The subprocess module itself is rather > scary I guess, and things could be made a lot easier. [...] > We could instead make the wrapper objects callable (define > the magic method __call__) to offer this kind of functionality. > This seems quite elegant to me. This is a good idea, although I'm 50/50 on the __call__ idea. Having a run() command or something similar might be more intuitive then the more magical call, if the idea is to appeal to users who find subprocess too problematic. I'd suggest having an option to not capture stdout and stderr, which would help users avoid those cases where a program spews a lot to stdout and it's unwieldy to capture and stick it into a string. Brad From biopython at maubp.freeserve.co.uk Tue Jun 1 09:48:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 14:48:30 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <1275332206.4c04066ed4ec5@webmail.upv.es> References: <1275332206.4c04066ed4ec5@webmail.upv.es> Message-ID: On Mon, May 31, 2010 at 7:56 PM, Blanca Postigo Jose Miguel wrote: > Mensaje citado por Michael Sandford : > >> I've got a few comments as well: >> > 4) The current Blast record stores its information in attributes. If you >> use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the >> necessary DTDs to do so), the information is stored in dictionaries. This has >> some advantages. For example, it allows you to use record.keys() to find out >> what the record contains. Ideally, I think that a Blast Record class should >> inherit from a dictionary. > > I've developed for my own use a dict structure that represents a blast result. > This structure also can represent many other results, like exonerate, SSAHA or > any other number of aligners. Having a common representations for all of them > allows you to create common filters that work with the same interface. I don't > know if it is very efficient, but it has proven to be very convinient for us. > You can take a look at: > > http://github.com/JoseBlanca/franklin/blob/master/franklin/alignment_search_result.py > > Best regards, > > Jose Blanca It has some similarities to what I was imagining for a BioPerl-SearchIO-like module. I'm still not convinced that we should just be using (subclasses of) dictionaries - I would rather have important core properties like the hit co-ordinates held explicitly as properties or attributes (and always using Python counting, not whatever a given file format uses, like one-based locations in BLAST output). Peter From krother at rubor.de Tue Jun 1 10:11:51 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 1 Jun 2010 16:11:51 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: Message-ID: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Hi, >>> from Bio.Struct import RNA >>> # Would this work for you, Kristian? >> >> Yes, it would be more descriptive than the originally proposed Bio.RNA . >> I >> am just concerned whether I could keep the 2D structure-related modules >> in the same package. > > I don't necessarily see a problem with Bio.Struct or Bio.Structure > covering > both 2D and 3D structures. Does this 2D stuff include file parsers? That > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better. Yes, currently, RNA contains 2D stuff. It would complicate Struct.read(). On the other hand, the 2D stuff is independent from the 3D modules - could be split into two packages -- but I think keeping RNA is simpler. Best Regards, Kristian From biopython at maubp.freeserve.co.uk Tue Jun 1 11:15:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 16:15:03 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <20100601132355.GU1054@sobchak.mgh.harvard.edu> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: > Peter; > >> With the new command line wrappers and the tutorial pushing >> users towards using subprocess we've had more queries >> about how to use it. The subprocess module itself is rather >> scary I guess, and things could be made a lot easier. > [...] >> We could instead make the wrapper objects callable (define >> the magic method __call__) to offer this kind of functionality. >> This seems quite elegant to me. > > This is a good idea, although I'm 50/50 on the __call__ idea. > Having a run() command or something similar might be more intuitive > then the more magical call, if the idea is to appeal to users who > find subprocess too problematic. Fair point. We'd have to audit all the existing wrappers to make sure we have some suitable names free (e.g run or execute). > I'd suggest having an option to not capture stdout and stderr, which > would help users avoid those cases where a program spews a lot to > stdout and it's unwieldy to capture and stick it into a string. We need to avoid any risk of deadlocks, so I guess the safe implementation here would be call subprocess with stdout and stderr sent to dev null. Peter From eric.talevich at gmail.com Tue Jun 1 14:25:52 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 1 Jun 2010 14:25:52 -0400 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Tue, Jun 1, 2010 at 10:11 AM, Kristian Rother wrote: > Hi, > > >>> from Bio.Struct import RNA > >>> # Would this work for you, Kristian? > >> > >> Yes, it would be more descriptive than the originally proposed Bio.RNA . > >> I > >> am just concerned whether I could keep the 2D structure-related modules > >> in the same package. > > > > I don't necessarily see a problem with Bio.Struct or Bio.Structure > > covering > > both 2D and 3D structures. Does this 2D stuff include file parsers? That > > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is > better. > > Yes, currently, RNA contains 2D stuff. It would complicate Struct.read(). > On the other hand, the 2D stuff is independent from the 3D modules - could > be split into two packages -- but I think keeping RNA is simpler. > > Best Regards, > Kristian > > I could be totally wrong here, but I think it's useful to lay out some assumptions and intuitions explicitly. To me, secondary structure is not really a separate dimension in its own right, the way tertiary structure corresponds to 3D space and primary structure corresponds to a linear sequence. Instead, secondary structure has meaning in 3D space, but is usually serialized as a linear sequence. That is, we want to parse something that resembles a sequence, but be able to map it onto a 3D structure. (More for proteins than for RNA, usually.) (For non-RNA folk, here's an example of RNA secondary structure: http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna ) For instance, the output of DSSP and Jpred describes a protein's secondary structure, but the input to DSSP is a 3D structure, while Jpred accepts a protein sequence. The representation of secondary structure isn't distinct from either of these. I'd want both of these available in Bio.Struct (eventually). This means that some interaction between Bio.Struct and SeqIO is necessary. It would be neat if secondary structure regions were represented as SeqFeature instances, and secondary-structure parsers returned some kind of subclass of SeqRecord -- or a standard SeqRecord containing a special kind of Seq. The secondary-structure parsers for RNA and proteins should be separate, too, since the annotated features are different. So the function Bio.Struct.read() can apply exclusively to 3D structures. Would it be reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary structures -- assuming that anything that's not a secondary structure, 3D structure, or nucleotide sequence is something special that belongs in its own module? As for protein secondary structure, it's usually associated with a sequence or a structure, so maybe we could get by with storing that information in an ordinary Structure or SeqRecord object without inventing a new subclass. Best, Eric From jblanca at btc.upv.es Wed Jun 2 02:21:36 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 2 Jun 2010 08:21:36 +0200 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: <201006020821.36486.jblanca@btc.upv.es> On Tuesday 01 June 2010 17:15:03 Peter wrote: > On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: > > Peter; > > > >> With the new command line wrappers and the tutorial pushing > >> users towards using subprocess we've had more queries > >> about how to use it. The subprocess module itself is rather > >> scary I guess, and things could be made a lot easier. We had the same need. We solved it with a call function. You can take a look at: http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From krother at rubor.de Wed Jun 2 04:17:01 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 2 Jun 2010 10:17:01 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: Hi, >> >>> from Bio.Struct import RNA .. >> > I don't necessarily see a problem with Bio.Struct or Bio.Structure >> > covering both 2D and 3D structures. Eric, I agree with you - the secondary structure of RNA maps nicely to 3D space. Generally, I think it is a little more common to work with RNA 2D structures in absence of 3D information than in proteins - 2D prediction of RNA is maybe simply a less nasty target. Eric wrote: > I could be totally wrong here, but I think it's useful to lay out some > assumptions and intuitions explicitly. > > To me, secondary structure is not really a separate dimension in its own > right, the way tertiary structure corresponds to 3D space and primary > structure corresponds to a linear sequence. Instead, secondary structure > has > meaning in 3D space, but is usually serialized as a linear sequence. That > is, we want to parse something that resembles a sequence, but be able to > map > it onto a 3D structure. (More for proteins than for RNA, usually.) > > (For non-RNA folk, here's an example of RNA secondary structure: > http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna > ) > > For instance, the output of DSSP and Jpred describes a protein's secondary > structure, but the input to DSSP is a 3D structure, while Jpred accepts a > protein sequence. The representation of secondary structure isn't distinct > from either of these. I'd want both of these available in Bio.Struct > (eventually). > > This means that some interaction between Bio.Struct and SeqIO is > necessary. > It would be neat if secondary structure regions were represented as > SeqFeature instances, and secondary-structure parsers returned some kind > of > subclass of SeqRecord -- or a standard SeqRecord containing a special kind > of Seq. So far the Secstruc parsers I've implemented just return (sequence,secstruc) tuples. But putting this into a SeqRecord makes sense - I understand this fits better to the BioPython architecture. Maybe instead of a Seq or SeqRecord subclass we could use the decorator pattern (decorating a class, not the Python decorator function syntax). A potential problem that I'd like to point out early is that we are working with modified RNA nucleotides a lot (up to 20% of residues in every tRNA). This would require extending the RNA Alphabet (which now just is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread. > The secondary-structure parsers for RNA and proteins should be separate, > too, since the annotated features are different. So the function > Bio.Struct.read() can apply exclusively to 3D structures. Would it be > reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary > structures -- assuming that anything that's not a secondary structure, 3D > structure, or nucleotide sequence is something special that belongs in its > own module? To summarize, we could use: 1) protein 3D structures: Bio.Struct.read() --> Bio.PDB.Structure 2) RNA 3D structures: Bio.Struct.read() --> Bio.PDB.Structure 3) RNA 2D structures: Bio.Struct.RNA.read() --> Bio.SeqRecord (extended/decorated by a secstruc field) 4) protein 2D structures: uses special parser module?? 5) plain sequences: Bio.read() --> Bio.SeqRecord Eric, does this summarize your thoughts correctly? This would work for me. Any comments from the others. Best, Kristian From biopython at maubp.freeserve.co.uk Wed Jun 2 04:44:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 09:44:54 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <201006020821.36486.jblanca@btc.upv.es> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <201006020821.36486.jblanca@btc.upv.es> Message-ID: On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca wrote: > On Tuesday 01 June 2010 17:15:03 Peter wrote: >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >> > Peter; >> > >> >> With the new command line wrappers and the tutorial pushing >> >> users towards using subprocess we've had more queries >> >> about how to use it. The subprocess module itself is rather >> >> scary I guess, and things could be made a lot easier. > > We had the same need. We solved it with a call function. You can take > a look at: > > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py > It looks complicated (and I'm sure with good reason), but I'd guess you've never tried this on Windows? We used to have the Bio.Application.generic_run function for calling a command - but making the command line wrapper callable or having a method on the command line wrapper is much easier to use (no extra import needed). Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 05:23:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 10:23:15 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Tue, Jun 1, 2010 at 7:25 PM, Eric Talevich wrote: >> > I could be totally wrong here, but I think it's useful to lay out some > assumptions and intuitions explicitly. > > To me, secondary structure is not really a separate dimension in its own > right, the way tertiary structure corresponds to 3D space and primary > structure corresponds to a linear sequence. Instead, secondary structure has > meaning in 3D space, but is usually serialized as a linear sequence. That > is, we want to parse something that resembles a sequence, but be able to map > it onto a 3D structure. (More for proteins than for RNA, usually.) > > (For non-RNA folk, here's an example of RNA secondary structure: > http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna > ) > > For instance, the output of DSSP and Jpred describes a protein's secondary > structure, but the input to DSSP is a 3D structure, while Jpred accepts a > protein sequence. The representation of secondary structure isn't distinct > from either of these. I'd want both of these available in Bio.Struct > (eventually). > > This means that some interaction between Bio.Struct and SeqIO is necessary. > It would be neat if secondary structure regions were represented as > SeqFeature instances, and secondary-structure parsers returned some kind of > subclass of SeqRecord -- or a standard SeqRecord containing a special kind > of Seq. > > ... > > As for protein secondary structure, it's usually associated with a sequence > or a structure, so maybe we could get by with storing that information in an > ordinary Structure or SeqRecord object without inventing a new subclass. Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for both proteins, RNA and DNA). We can store a secondary structure string as per-letter-annotation, or things like helix regions as SeqFeature objects. Peter From jblanca at btc.upv.es Wed Jun 2 05:24:24 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 2 Jun 2010 11:24:24 +0200 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <201006020821.36486.jblanca@btc.upv.es> Message-ID: <201006021124.24499.jblanca@btc.upv.es> On Wednesday 02 June 2010 10:44:54 Peter wrote: > On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca wrote: > > On Tuesday 01 June 2010 17:15:03 Peter wrote: > >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: > >> > Peter; > >> > > >> >> With the new command line wrappers and the tutorial pushing > >> >> users towards using subprocess we've had more queries > >> >> about how to use it. The subprocess module itself is rather > >> >> scary I guess, and things could be made a lot easier. > > > > We had the same need. We solved it with a call function. You can take > > a look at: > > > > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_util > >s.py > > It looks complicated (and I'm sure with good reason), but I'd guess > you've never tried this on Windows? Yes it is somewhat complicated. We need some functionalities like accepting stdout to be a file or just a pipe (some programs have very long stdouts). We have added everything we have required for our programs. No, we haven't test anything on windows. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Jun 2 05:25:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 10:25:47 +0100 Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements Message-ID: On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother wrote: > > A potential problem that I'd like to point out early is that we are > working with modified RNA nucleotides a lot (up to 20% of residues in > every tRNA). This would require extending the RNA Alphabet (which now just > is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread. > What letters are you missing? There is a commented out ExtendedIUPACRNA alphabet that may be relevant in Bio/Alphabets/IUPAC.py Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 07:36:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 12:36:46 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Tue, Jun 1, 2010 at 4:15 PM, Peter wrote: > On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >> I'd suggest having an option to not capture stdout and stderr, which >> would help users avoid those cases where a program spews a lot to >> stdout and it's unwieldy to capture and stick it into a string. > > We need to avoid any risk of deadlocks, so I guess the safe > implementation here would be call subprocess with stdout and > stderr sent to dev null. How does this look? Tested on Mac and Windows: http://github.com/peterjc/biopython/tree/app-exec2 Example usage without capturing the output: from Bio.Emboss.Applications import WaterCommandline water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, asequence="a.fasta", bsequence="b.fasta") print "About to run:\n%s" % water_cmd return_code = water_cmd() print "Return code: %i" % return_code Example usage with stdout and stderr capture: from Bio.Emboss.Applications import WaterCommandline water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, asequence="a.fasta", bsequence="b.fasta") print "About to run:\n%s" % water_cmd stdout, stderr, return_code = water_cmd(capture=True) print "Return code: %i" % return_code print "Tool output:\n%s" % stdout Note in this implementation it either returns an integer error level (the default) or a tuple of stdout, stderr and the error level return code. If we opt for adding methods rather than using __call__ these could be different methods instead. Another potentially useful option would be to copy the subprocess.check_call() function in Python 2.5+ which verifies the return code (error level) is zero and raises an exception if not (probably only sensible if not capturing the output?). Maybe this could even be the default behaviour? [I would prefer to keep the interface as simple as possible though, less options is better! KISS principle.] Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 07:59:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 12:59:46 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 12:36 PM, Peter wrote: > On Tue, Jun 1, 2010 at 4:15 PM, Peter wrote: >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >>> I'd suggest having an option to not capture stdout and stderr, which >>> would help users avoid those cases where a program spews a lot to >>> stdout and it's unwieldy to capture and stick it into a string. >> >> We need to avoid any risk of deadlocks, so I guess the safe >> implementation here would be call subprocess with stdout and >> stderr sent to dev null. > > How does this look? Tested on Mac and Windows: > http://github.com/peterjc/biopython/tree/app-exec2 > > Example usage without capturing the output: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?return_code = water_cmd() > ? ?print "Return code: %i" % return_code > > Example usage with stdout and stderr capture: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?stdout, stderr, return_code = water_cmd(capture=True) > ? ?print "Return code: %i" % return_code > ? ?print "Tool output:\n%s" % stdout > > Note in this implementation it either returns an integer error level > (the default) or a tuple of stdout, stderr and the error level return > code. If we opt for adding methods rather than using __call__ > these could be different methods instead. > > Another potentially useful option would be to copy the > subprocess.check_call() function in Python 2.5+ which verifies > the return code (error level) is zero and raises an exception if not > (probably only sensible if not capturing the output?). Maybe this > could even be the default behaviour? > > [I would prefer to keep the interface as simple as possible though, > less options is better! KISS principle.] With that in mind, as I mentioned yesterday maybe we should just update the documentation to suggest using os.system() when you just need the return code and there is no stdin to worry about: import os from Bio.Emboss.Applications import WaterCommandline water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, asequence="a.fasta", bsequence="b.fasta") print "About to run:\n%s" % water_cmd return_code = os.system(water_cmd) print "Return code: %i" % return_code Even if the Python documentation seems to be discouraging it, using os.system() seems simple, robust, and cross platform. We could even update the tutorial now and post it online - it should make some people's lives a little easier. [Note this is actually a silly example, I should be telling water to output to a file, not stdout which is then ignored.] Peter From krother at rubor.de Wed Jun 2 08:14:05 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 2 Jun 2010 14:14:05 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> Hi Peter, Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new branch for that (on Git: krother/biopython). Putting secondary structures like '((((....))))' for a hairpin into the letter_annotation field makes sense. I think it even would work for pseudoknotted RNA (which is hard to represent as a string, one possible notation would be '(((..[[[....)))..]]]'. Where should the str subclass for secondary structures that the parsers create go? Could it be Bio.Struct.RNA? Best, Kristian Putting RNA secondary structures >> As for protein secondary structure, it's usually associated with a >> sequence >> or a structure, so maybe we could get by with storing that information >> in an >> ordinary Structure or SeqRecord object without inventing a new subclass. > > Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for > both proteins, RNA and DNA). We can store a secondary structure string as > per-letter-annotation, or things like helix regions as SeqFeature objects. > > Peter > > From krother at rubor.de Wed Jun 2 08:21:43 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 2 Jun 2010 14:21:43 +0200 Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements In-Reply-To: References: Message-ID: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de> Hi Peter, I'm afraid the matter is more complicated. To date, we have 115 modified RNA bases, which means in practice that you run out of nice ASCII characters. Moreover, some people use one-letter symbols in RNA as wildcards (R for purine, Y for pyrimidine). As a consequence, several sets of abbreviations have been developed - see http://modomics.genesilico.pl/modification_list to get an impression. We've written for our own purposes a class containing different ways of nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like to change that. Best Regards, Kristian > On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother wrote: >> >> A potential problem that I'd like to point out early is that we are >> working with modified RNA nucleotides a lot (up to 20% of residues in >> every tRNA). This would require extending the RNA Alphabet (which now >> just >> is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread. >> > What letters are you missing? There is a commented out ExtendedIUPACRNA > alphabet that may be relevant in Bio/Alphabets/IUPAC.py > > Peter > > From biopython at maubp.freeserve.co.uk Wed Jun 2 09:22:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 14:22:36 +0100 Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements In-Reply-To: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de> References: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Wed, Jun 2, 2010 at 1:21 PM, Kristian Rother wrote: > > Hi Peter, > > I'm afraid the matter is more complicated. To date, we have 115 modified > RNA bases, which means in practice that you run out of nice ASCII > characters. Moreover, some people use one-letter symbols in RNA as > wildcards (R for purine, Y for pyrimidine). As a consequence, several sets > of abbreviations have been developed - see > http://modomics.genesilico.pl/modification_list to get an impression. > > We've written for our own purposes a class containing different ways of > nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like > to change that. > > Best Regards, > ? Kristian Hmm. I wonder if the HTML entities would work nicely in Python (as unicode)? That way you could have an unambiguous string representation where each letter is one character long. I'm thinking a Seq subclass (with a special alphabet) might be the way to go here, allowing access to the single character entities by default but also the longer codes as well. There are similarities with modified peptide sequences where there are clear three letter codes, but not one letter codes. Tricky. Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 09:24:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 14:24:49 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Wed, Jun 2, 2010 at 1:14 PM, Kristian Rother wrote: > > Hi Peter, > > Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new > branch for that (on Git: krother/biopython). > > Putting secondary structures like '((((....))))' for a hairpin into the > letter_annotation field makes sense. I think it even would work for > pseudoknotted RNA (which is hard to represent as a string, one possible > notation would be '(((..[[[....)))..]]]'. > > Where should the str subclass for secondary structures that the parsers > create go? Could it be Bio.Struct.RNA? > > Best, > ? Kristian You don't think plain strings in the SeqRecord's letter_annotation dict would be enough? Assuming you do need something then perhaps under Bio.Seq or Bio.SeqUtils might be worth considering as alternatives to Bio.Struct.RNA. Peter From eric.talevich at gmail.com Thu Jun 3 12:17:09 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 3 Jun 2010 12:17:09 -0400 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Wed, Jun 2, 2010 at 8:14 AM, Kristian Rother wrote: > > Putting secondary structures like '((((....))))' for a hairpin into the > letter_annotation field makes sense. I think it even would work for > pseudoknotted RNA (which is hard to represent as a string, one possible > notation would be '(((..[[[....)))..]]]'. > > Here's another format that was designed to represent pseudoknots: http://www.uga.edu/RNA-Informatics/files/software/RNApasta.help.html#Format I'm not sure how standardized or widely used it is, but the program RNA-pasta works with it: http://www.uga.edu/RNA-Informatics/?f=software&p=RNApasta -Eric From biopython at maubp.freeserve.co.uk Thu Jun 3 12:43:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Jun 2010 17:43:47 +0100 Subject: [Biopython-dev] More SeqRecord methods In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 3:53 PM, Peter wrote: > Hi all, > > What do people think of adding upper and lower methods to the SeqRecord? > http://bugzilla.open-bio.org/show_bug.cgi?id=3054 I checked that in with an example in the tutorial. > If that is well received, how about adding another Seq method to the > SeqRecord, the newish ungap method? > http://bugzilla.open-bio.org/show_bug.cgi?id=3060 This one I would like some feedback on first. I'm sure the implementation could me made much more efficient too. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 3 12:45:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Jun 2010 12:45:16 -0400 Subject: [Biopython-dev] [Bug 3054] Add upper and lower methods to the SeqRecord In-Reply-To: Message-ID: <201006031645.o53GjGd9019264@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3054 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-03 12:45 EST ------- Checked in: http://github.com/biopython/biopython/tree/f4f11a9c4e7aca10c33cfe93c78d4972a0d736f8 With an example in the tutorial too: http://github.com/biopython/biopython/commit/3de8bbd423010eb0b480b8966041f7c6d8e9890d Marking this as fixed. See also: http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007772.html http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007801.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 3 13:24:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Jun 2010 18:24:43 +0100 Subject: [Biopython-dev] More SeqRecord methods In-Reply-To: References: Message-ID: On Thu, Jun 3, 2010 at 5:43 PM, Peter wrote: > On Mon, May 31, 2010 at 3:53 PM, Peter wrote: > >> ..., how about adding another Seq method to the >> SeqRecord, the newish ungap method? >> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 > > This one I would like some feedback on first. I'm sure the > implementation could be made much more efficient too. Maybe I should mention that I also envisage a similar method for the alignment object, to give a new alignment with any all-gap-columns removed (perhaps with an optional argument to specify a threshold for the number of gaps required - defaulting to only removing columns which are all gaps). Again, the simplest way to implement this is to re-use the new alignment slicing and addition features - much as how I did it for the proposed SeqRecord ungap method. Peter From eric.talevich at gmail.com Thu Jun 3 15:10:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 3 Jun 2010 15:10:51 -0400 Subject: [Biopython-dev] Fixup branch for Bio.PDB Message-ID: Hi all, I've poked around Bugzilla, taken patches for some outstanding bugs, and applied them to a branch on GitHub: http://github.com/etal/biopython/tree/pdbfixes http://github.com/etal/biopython/commits/pdbfixes I'd like to encourage people to test this branch with their own code, and if it all still works (or nobody's interested in testing this branch), I'll push it to the Biopython trunk so it gets tested more. Time frame: if this branch lingers too long, there's a high chance it will cause conflicts for Jo?o (our GSoC student) the next time he merges. How about a week? The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951: http://bugzilla.open-bio.org/show_bug.cgi?id=2820 http://bugzilla.open-bio.org/show_bug.cgi?id=2948 http://bugzilla.open-bio.org/show_bug.cgi?id=2879 http://bugzilla.open-bio.org/show_bug.cgi?id=2950 http://bugzilla.open-bio.org/show_bug.cgi?id=2951 Thanks, Eric From biopython at maubp.freeserve.co.uk Fri Jun 4 04:44:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 09:44:19 +0100 Subject: [Biopython-dev] Fixup branch for Bio.PDB In-Reply-To: References: Message-ID: On Thu, Jun 3, 2010 at 8:10 PM, Eric Talevich wrote: > Hi all, > > I've poked around Bugzilla, taken patches for some outstanding bugs, and > applied them to a branch on GitHub: > http://github.com/etal/biopython/tree/pdbfixes > http://github.com/etal/biopython/commits/pdbfixes > > I'd like to encourage people to test this branch with their own code, and if > it all still works (or nobody's interested in testing this branch), I'll > push it to the Biopython trunk so it gets tested more. Time frame: if this > branch lingers too long, there's a high chance it will cause conflicts for > Jo?o (our GSoC student) the next time he merges. How about a week? > > The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951: > http://bugzilla.open-bio.org/show_bug.cgi?id=2820 > http://bugzilla.open-bio.org/show_bug.cgi?id=2948 > http://bugzilla.open-bio.org/show_bug.cgi?id=2879 > http://bugzilla.open-bio.org/show_bug.cgi?id=2950 > http://bugzilla.open-bio.org/show_bug.cgi?id=2951 > > Thanks, > Eric That sounds like a good plan. Peter From mjldehoon at yahoo.com Fri Jun 4 11:55:27 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 4 Jun 2010 08:55:27 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: <933074.46322.qm@web62405.mail.re1.yahoo.com> Michael, Peter, Sebastian, Laurent, Jose, and others, Thanks for your comments. It looks like there are lots of things to discuss, so let's start with the easiest ones. About converting a record to a string (point 5): I agree that using __str__ is probably not the best choice, so let's use __format__ instead, or add a "write" method. The added advantage of these is that we can print out a record in different formats (xml, text, table) by specifying the requested format as an argument. For point 3), maybe my wording was confusing; actually what I had in mind is the case where a given Blast program can produce different output formats (xml, text, table, etc.). This was inspired by this bug report: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 In my mind, the different output formats are just different intermediates, but in essence they are the same and should therefore be stored in the same class. So, if I run blastp, save the result as XML, and parse it, I'd expect the same class as when I run blastp and save and parse the output in table format. Just in the latter case, some information may be missing if it is not available in the output in table format. Does that sound acceptable? --Michiel. --- On Fri, 5/28/10, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: [Biopython-dev] Blast parsers and records > To: biopython-dev at biopython.org > Date: Friday, May 28, 2010, 11:23 PM > Hi everybody, > > With Biopython 1.54 out (thanks Peter!), and NCBI > encouraging to use its new Blast+ suite of Blast programs, > maybe this is a good time to tackle some older bugs related > to Blast output parsing in Biopython: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > (inconsistencies in the output of different Blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 > (inconsistencies between Psi-blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2319 > (parsing Blast table output) > > and more generally think about the design of the Blast > record class and Blast parsing. In my opinion, these are the > major issues: > > 1) Blast parsers are located in several modules > (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, > Bio.Blast.ParseBlastTable). I think we should have one > read() function and one parse() function under Bio.Blast, > with arguments specifying which format the Blast output is > in. > > 2) Blast records produced by any of the parsers should be > consistent with each other. As XML output by blast and > psi-blast follow the same DTD, we should be able to > represent both by a single Record class. > > 3) Different parsers should store information in this > Record class in the same way. > > 4) The current Blast record stores its information in > attributes. If you use Bio.Entrez to parse Blast XML output > (Biopython 1.54 contains the necessary DTDs to do so), the > information is stored in dictionaries. This has some > advantages. For example, it allows you to use record.keys() > to find out what the record contains. Ideally, I think that > a Blast Record class should inherit from a dictionary. > > 5) We should be able to print a Blast record object to > generate output that is close to the plain-text output > generated by blast. This would allow us to generate and > store Blast output as XML, and to convert the output to > plain-text to make it more human-readable. > > 6) The current Blast record inherits from > Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport, > and Bio.Blast.Record.Parameters. I don't see the rationale > for this inheritance, and I think we should remove it. > > Any comments, suggestions (in particular about by proposal > to have a Blast Record class that inherits from a > dictionary? Btw, to avoid breaking scripts, I propose that > any changes to the Blast record and parser are implemented > separately from the existing parsers and record, and to > leave those untouched. > > --Michiel. > > > ? ? ? > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Sat Jun 5 10:49:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 15:49:39 +0100 Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris Message-ID: Hi all, Are any Biopython folk planning to be at the EuroSciPy conference in Paris this year (July 2010)? They are still finalising the Scientific track, but the list of tutorials is quite interesting already: http://www.euroscipy.org/conference/euroscipy2010 Peter From biopython at maubp.freeserve.co.uk Mon Jun 7 05:35:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 10:35:15 +0100 Subject: [Biopython-dev] Working directly on the main git repository Message-ID: Hi all, I thought I'd write down some notes about how I've been using git recently. This may be of interest to any of the other core developers (those of us with read-write access to the main repository), and I might get some good tips from any discussion. The key point is that I have read+write access to two repositories on github (the official repository AND my own fork), so there are different advantages/disadvantages about which I choose to work with directly as my main repository. Our official repository has just a single stable master branch, and I often need to work directly with this (e.g. committing small bug fixes or adding more documentation). I therefore if I setup a clone of the master repository I can work on the main branch very easily. Now, when working on a branch for new features, I could just do this locally, and when they are ready, merge them direct to the master. However, this means others cannot look at my work (and I find it a problem when working on multiple machines). Alternatively, I could push the branches to the public "master" repository. This would be the simplest option BUT the high visibility gives any such experimental branch disproportionate status. I think this would be a good idea for important (multi-person) efforts, like Python 3 work. Instead, I have a github repository of my own (what github calls a fork), and I push branches there. http://github.com/biopython/biopython - the official branch(es) http://github.com/peterjc/biopython - my branches How does this work in practice? Like this - I clone the master and add a reference to my repository (and I do the same when I want to grab a branch from another developer): git clone git at github.com:biopython/biopython.git cd biopython git remote add peterjc git at github.com:peterjc/biopython.git git fetch peterjc Then make a new local branch as usual, and when ready to share it publicly, I push it to *my* repository on github: git branch new-work git checkout new-work git commit ... git push peterjc new-work This would then appear as a new-work branch on my github page. Then if I (or someone else) wants to access these branches later (e.g. from another machine) just use the checkout tracked remote branch. For example, git clone git at github.com:biopython/biopython.git cd biopython git remote add peterjc git at github.com:peterjc/biopython.git git fetch peterjc git checkout -t peterjc/seqio-imgt This then looks like a normal branch (called just "seqio-imgt" in this example), but git knows it is linked to the remote branch on the "peterjc" repository (not the origin which is the "official" repository). I'd have to check, but I guess that if the original git clone is done with git://github.com/biopython/biopython.git instead (read only access) the same procedure could be used by non core devs. However, I'm not sure this is clearer for them. I think the current procedure (on our wiki) where you add a remote reference to the "upstream" official repository works better in this case. Comments? Peter Useful links from Google searches: http://www.gitready.com/intermediate/2009/01/09/checkout-remote-tracked-branch.html http://www.gitready.com/beginner/2009/03/09/remote-tracking-branches.html From biopython at maubp.freeserve.co.uk Mon Jun 7 09:40:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 14:40:54 +0100 Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris In-Reply-To: References: Message-ID: On Sat, Jun 5, 2010 at 3:49 PM, Peter wrote: > Hi all, > > Are any Biopython folk planning to be at the EuroSciPy > conference in Paris this year (July 2010)? They are still > finalising the Scientific track, but the list of tutorials is > quite interesting already: > > http://www.euroscipy.org/conference/euroscipy2010 > > Peter Hi all, The track list for the EuroSciPy 2010 Scientific track has now been announced, and I'm delighted that I will be able to present a talk on Biopython (likely 4pm Saturday 10 July). While I hope there will be some other Biopython users there, this is a nice opportunity to meet the broader scientific python community. There are still places at the moment if you want to attend: http://www.euroscipy.org/conference/euroscipy2010 Unfortunately I will not be attending BOSC or ISMB this year. However Brad Chapman will be there to present the annual "Biopython Project Update" talk (as well as helping to organise this year's BOSC and the associated CodeFest event preceding it). I'd love to have been there too, but I'm sure everyone attending will have a great time. Again, registration is still open: http://www.open-bio.org/wiki/BOSC_2010 http://www.open-bio.org/wiki/Codefest_2010 Regards, Peter P.S. Those of you in North America you might also be interested in the main SciPy conference in Austin, Texas (28 June to 3 July 2010): http://conference.scipy.org/scipy2010/ From biopython at maubp.freeserve.co.uk Mon Jun 7 09:50:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 14:50:06 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <933074.46322.qm@web62405.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> <933074.46322.qm@web62405.mail.re1.yahoo.com> Message-ID: On Fri, Jun 4, 2010 at 4:55 PM, Michiel de Hoon wrote: > Michael, Peter, Sebastian, Laurent, Jose, and others, > > Thanks for your comments. It looks like there are lots of things to discuss, > so let's start with the easiest ones. > > About converting a record to a string (point 5): I agree that using __str__ is > probably not the best choice, so let's use __format__ instead, or add a "write" > method. The added advantage of these is that we can print out a record in > different formats (xml, text, table) by specifying the requested format as an argument. The __format__ or format method sounds like a great idea (following other bits of Biopython). > For point 3), maybe my wording was confusing; actually what I had in mind > is the case where a given Blast program can produce different output formats > (xml, text, table, etc.). This was inspired by this bug report: > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > In my mind, the different output formats are just different intermediates, but > in essence they are the same and should therefore be stored in the same > class. So, if I run blastp, save the result as XML, and parse it, I'd expect the > same class as when I run blastp and save and parse the output in table format. > Just in the latter case, some information may be missing if it is not available in > the output in table format. Does that sound acceptable? I agree that records from all the different BLAST output formats should be represented by a common base class - but not necessarily the same class. For example, the default plain text and XML formats include the pairwise alignments, but the tabular output does not. To me having a sub-class which stores the pairwise alignments seems natural here. Peter From biopython at maubp.freeserve.co.uk Mon Jun 7 13:45:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 18:45:57 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite Message-ID: Hi all, Thanks for the lively discussion on the main list, http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html ... http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html I've spent the afternoon updating my old branch which uses SQLite to store the record identifier to file offset mapping. Using the code on this branch, Bio.SeqIO.index() supports a new optional argument currently called "db" (other names I like including "cache", suggestions welcome): http://github.com/peterjc/biopython/tree/index-sqlite The default (False) is not to use SQLite, but continue with an in memory Python dictionary. As long as you have enough RAM and don't plan to use the index at a later date, this will be fastest. If set to True or a filename, then an SQLite index is used to hold the offsets. This means very low RAM requirements, but is a lot slower because the offsets are written to disk and the SQLite index is updated as we go. I expect this part can be optimised (e.g. try to build the index at the end, try committing in batches). I'm still testing this, but the core of the work is done I think. Once we're happy with the public API, we can concentrate on things like the SQLite schema, and optimising the code. Peter P.S. I know it will need a little work to fail gracefully on Python 2.4 when SQLite isn't installed. From biopython at maubp.freeserve.co.uk Mon Jun 7 14:23:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 19:23:05 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: Peter wrote: >... > > http://github.com/peterjc/biopython/tree/index-sqlite > > ... an SQLite index is used to hold > the offsets. This means very low RAM requirements, but is a lot > slower because the offsets are written to disk and the SQLite > index is updated as we go. I expect this part can be optimised > (e.g. try to build the index at the end, try committing in batches). Having now tried using this on some files with tens of millions of records, tuning how we use SQLite is going to be important. Peter From bioinformed at gmail.com Mon Jun 7 17:10:42 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 7 Jun 2010 17:10:42 -0400 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: > Peter wrote: > >... > > > > http://github.com/peterjc/biopython/tree/index-sqlite > > > > ... an SQLite index is used to hold > > the offsets. This means very low RAM requirements, but is a lot > > slower because the offsets are written to disk and the SQLite > > index is updated as we go. I expect this part can be optimised > > (e.g. try to build the index at the end, try committing in batches). > > Having now tried using this on some files with tens of millions of > records, tuning how we use SQLite is going to be important. > > Wouldn't a Berkeley database be much much faster for constructing simple key to offset mappings? -Kevin From anaryin at gmail.com Mon Jun 7 20:45:05 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 7 Jun 2010 19:45:05 -0500 Subject: [Biopython-dev] [GSOC] Report - Week 1 Message-ID: Dear all, Eric suggested me to write a weekly email wrapping up my progress, any problems I encountered, new ideas, etc. So, here's week 1 :) *Proposed Tasks:* Wiki *Project's Github account:* Link * Progress:* *1. Renumbering Residues* I wrote a small function in Structure.py (link) that iterates over the residues in a chain and subtracts the original first residue number. This keeps gaps intacts. Worked on my machine for a set of 75 proteins I was working on. Also allows for people to change the starting residue for whatever reason, the default being 1. I had originally thought of having a SEQREQ parsing function and using this as a base for the new renumbering. However, most structures that lack residues (gaps) still count them in the numbering. Since there is no parser for SEQRES, I thought this to be the best option. *Example * ... s = p.get_structure('a', '2KSX.pdb') s.renumber_residues() s.renumber_residues(start=0) *2. Disulphide bond search* I originally proposed to use the NeighborSearch method but I didn't know that subtracting two atom objects gave me their distance. I used this instead. I defined a threshold of 3A for a S-S since the average is 2.05A. I tried to get some paper/doc from other software where such a limit would be already defined but I didn't find any.. thus, I assigned 3 because its results agreed with the SSBOND records. The user can provide a threshold integer or float as an argument to make the search stricter or broader. The function generates first an iterator with all the pairs of cysteines possible in the protein. It then checks and yields those with distances between the SG atoms of the cystein below the threshold. The result is also an iterator with tuples containing pairs of residue objects. *Example* ... s = p.get_structure('a', '2KSX.pdb') [i for i in s.search_ss_bonds()] [(, ), (, ), (, ), (, ), (, )] len([i for i in s.search_ss_bonds(threshold=100)]) 45 *Problems:* *3. Biological Unit* I added code to parse_pdb_header to extract the REMARK 350 section. They contain something like this (1IHM.pdb ): REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000 REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000 REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000 REMARK 350 BIOMT1 2 0.500000 -0.809017 -0.309017 0.00000 REMARK 350 BIOMT2 2 0.809017 0.309017 0.500000 0.00000 REMARK 350 BIOMT3 2 -0.309017 -0.500000 0.809017 0.00000 REMARK 350 BIOMT1 3 -0.309017 -0.500000 -0.809017 0.00000 I parse out the 4th column to identify each transformation. I store a 3x3 rotation matrix and the translation vector separately. It is then easy to apply them to each atom record via the transform function. Now, the problem lies in what the output should be. We broke it down to two main options: a. Create a new structure object for each rotated/translated object, thus making the final output a list of structures. This takes quite a while actually. I tried this with a deepcopy method to copy each structure and it took over 30 seconds on my machine for that PDB file above. b. Add the new rotated objects as new chains in the original structure. This is actually a good solution because it allows people to use other methods (the SS search comes to mind) on quartenary structures. It also allows the user to write a file with all the structures in their place using PDBIO quite seamlessly. However, it might be complicated to deal with an excess of chains, or if not all chains are supposed to be rotated (dunno if the case actually exists). My personal belief is that B is the way to go. Although it adulterates the original structure with alien chains, it allows much greater flexibility. I haven't tested it though. ---- Comments? :) Jo?o [...] Rodrigues From anaryin at gmail.com Mon Jun 7 23:42:27 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 7 Jun 2010 22:42:27 -0500 Subject: [Biopython-dev] [GSOC] Report - Week 1 In-Reply-To: References: Message-ID: Just my own heads up and comment. I thought of using MODEL records to hold the rotated structures. Citing the PDB format guidelines: This record is used only when more than one model appears in an entry. *Generally, > it is employed mainly for NMR structures.* The chemical connectivity > should be the same for each model. ATOM, HETATM, ANISOU, and TER records for > each model structure and are interspersed as needed between MODEL and ENDMDL > records. > Since REMARK 350 seems to be a X-Ray exclusive feature and conversely MODEL a NMR one, I believe this could also be a possible solution. I'm adding the code I wrote to Git. There is a huge speed problem with that deepcopy method.. if someone has a faster/better alternative, it would be great as this takes around 2 seconds per matrix. Best! Jo?o [...] Rodrigues @ http://www.biopython.org/wiki/User:Joaor On Mon, Jun 7, 2010 at 7:45 PM, Jo?o Rodrigues wrote: > Dear all, > > Eric suggested me to write a weekly email wrapping up my progress, any > problems I encountered, new ideas, etc. So, here's week 1 :) > > *Proposed Tasks:* Wiki > *Project's Github account:* Link > * > Progress:* > > *1. Renumbering Residues* > > I wrote a small function in Structure.py (link) > that iterates over the residues in a chain and subtracts the original first > residue number. This keeps gaps intacts. Worked on my machine for a set of > 75 proteins I was working on. Also allows for people to change the starting > residue for whatever reason, the default being 1. > > I had originally thought of having a SEQREQ parsing function and using this > as a base for the new renumbering. However, most structures that lack > residues (gaps) still count them in the numbering. Since there is no parser > for SEQRES, I thought this to be the best option. > > *Example > * > ... > s = p.get_structure('a', '2KSX.pdb') > s.renumber_residues() > s.renumber_residues(start=0) > > > *2. Disulphide bond search* > > I originally proposed to use the NeighborSearch method but I didn't know > that subtracting two atom objects gave me their distance. I used this > instead. > > I defined a threshold of 3A for a S-S since the average is 2.05A. I tried > to get some paper/doc from other software where such a limit would be > already defined but I didn't find any.. thus, I assigned 3 because its > results agreed with the SSBOND records. The user can provide a threshold > integer or float as an argument to make the search stricter or broader. > > The function generates first an iterator with all the pairs of cysteines > possible in the protein. It then checks and yields those with distances > between the SG atoms of the cystein below the threshold. The result is also > an iterator with tuples containing pairs of residue objects. > > *Example* > > ... > s = p.get_structure('a', '2KSX.pdb') > [i for i in s.search_ss_bonds()] > [(, >), (, icode= >), (, resseq=95 icode= >), (, het= resseq=66 icode= >), (, CYS het= resseq=200 icode= >)] > len([i for i in s.search_ss_bonds(threshold=100)]) > 45 > > > > *Problems:* > > *3. Biological Unit* > > I added code to parse_pdb_header to extract the REMARK 350 section. They > contain something like this (1IHM.pdb > ): > > REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 > 0.00000 > REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 > 0.00000 > REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 > 0.00000 > REMARK 350 BIOMT1 2 0.500000 -0.809017 -0.309017 > 0.00000 > REMARK 350 BIOMT2 2 0.809017 0.309017 0.500000 > 0.00000 > REMARK 350 BIOMT3 2 -0.309017 -0.500000 0.809017 > 0.00000 > REMARK 350 BIOMT1 3 -0.309017 -0.500000 -0.809017 0.00000 > > I parse out the 4th column to identify each transformation. I store a 3x3 > rotation matrix and the translation vector separately. It is then easy to > apply them to each atom record via the transform function. > > Now, the problem lies in what the output should be. We broke it down to two > main options: > > a. Create a new structure object for each rotated/translated object, thus > making the final output a list of structures. This takes quite a while > actually. I tried this with a deepcopy method to copy each structure and it > took over 30 seconds on my machine for that PDB file above. > > b. Add the new rotated objects as new chains in the original structure. > This is actually a good solution because it allows people to use other > methods (the SS search comes to mind) on quartenary structures. It also > allows the user to write a file with all the structures in their place using > PDBIO quite seamlessly. However, it might be complicated to deal with an > excess of chains, or if not all chains are supposed to be rotated (dunno if > the case actually exists). > > My personal belief is that B is the way to go. Although it adulterates the > original structure with alien chains, it allows much greater flexibility. I > haven't tested it though. > > ---- > > Comments? :) > > Jo?o [...] Rodrigues > > From thomas.hamelryck at gmail.com Tue Jun 8 02:39:53 2010 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Tue, 8 Jun 2010 08:39:53 +0200 Subject: [Biopython-dev] [GSOC] Report - Week 1 In-Reply-To: References: Message-ID: Hi all, I think it's great that Bio.PDB is being updated. Here are some remarks: I haven't seen much discussion about the one key feature of Bio.PDB that definitely needs to be improved: its speed. With the enormous increase of the number of structures, extracting data using Bio.PDB is too slow. Would be good to move some parts to C. A second issues is nicely illustrated by the following code snippet: > s = p.get_structure('a', '2KSX.pdb') > [i for i in s.search_ss_bonds()] I think this is NOT the way to do it. PDB files can contain anything RNA, DNA, sugars, small molecules... It is thus not a good idea to directly associate protein-specific methods to the structure class; it will lead to a bloated Structure class and a lot of irrelevant methods (ie. search_ss_bonds is meaningless for a PDB file that contains RNA). Currently, one creates Polypeptide objects from a Structure object using a factory design pattern (via PPBuilder); the Polypeptide class implements some protein specific methods. I believe that is a much cleaner way to do it (though we need a Protein class that represents collections of connected polypeptides). One can also make sure that all such derived objects (Protein, NA, DNA,...) adhere to the same interface by providing a suitable base class with shared functionality - in that way, the whole thing is also extendible. Something like: s = p.get_structure('a', '2KSX.pdb') pb = ProteinBuilder() proteins = pb.build(structure) ssbridges = proteins.get_ss_bonds() Here, "proteins" would represent a collection of polypeptide chains. Cheers, -Thomas -- Thomas Hamelryck, Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://www.binf.ku.dk/research/structural_bioinformatics/ From lgautier at gmail.com Tue Jun 8 03:00:10 2010 From: lgautier at gmail.com (Laurent) Date: Tue, 08 Jun 2010 09:00:10 +0200 Subject: [Biopython-dev] Biopython-dev Digest, Vol 89, Issue 8 In-Reply-To: References: Message-ID: <4C0DEA7A.1020606@gmail.com> On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote: > On Mon, Jun 7, 2010 at 2:23 PM, Peterwrote: > >> > Peter wrote: >>> > >... >>> > > >>> > > http://github.com/peterjc/biopython/tree/index-sqlite >>> > > >>> > > ... an SQLite index is used to hold >>> > > the offsets. This means very low RAM requirements, but is a lot >>> > > slower because the offsets are written to disk and the SQLite >>> > > index is updated as we go. I expect this part can be optimised >>> > > (e.g. try to build the index at the end, try committing in batches). >> > >> > Having now tried using this on some files with tens of millions of >> > records, tuning how we use SQLite is going to be important. >> > >> > > Wouldn't a Berkeley database be much much faster for constructing simple key > to offset mappings? > > -Kevin > Yes. If one is only looking for a key/value associative structure, the NOSQL solutions will be faster (tokyocabinet seems to be one of the fastest, up to 100x when compared to BerkleyDB http://www.ioremap.net/node/235 ). L. From biopython at maubp.freeserve.co.uk Tue Jun 8 05:35:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 10:35:15 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote: > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: >> >> Having now tried using this on some files with tens of millions of >> records, tuning how we use SQLite is going to be important. >> >> > Wouldn't a Berkeley database be much much faster for constructing > simple key to offset mappings? > Maybe - now that I've done the refactoring on Bio.SeqIO.index() to allow two back ends (python dict or SQLite) trying a third (BDB) is much easier. Did you know BDB was used in the old OBDA index files? However, Python 2.6 deprecated bsddb (the Python Interface to Berkeley DB library) and Python is pushing people to SQLite3 instead. Peter From krother at rubor.de Tue Jun 8 05:59:43 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 8 Jun 2010 11:59:43 +0200 Subject: [Biopython-dev] Tested Fixup branch for Bio.PDB Message-ID: Hi Eric, I've checked out your pdbfixes branch and ran our 431 Unit Tests of ModeRNA with it. There were no changes to the master Bio.PDB branch --> for us everything OK. Details: ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and uses Bio.PDB for most of its operations: reading files, adding/copying/manipulating residues/atoms, superimposing structures, searching neighbors by KDTree, writing files. Right, the tests most probably did not depend directly on the code you changed, but as I understand you wanted to go sure the branch didnt break anything by accident. Best Regards, Kristian From bioinformed at gmail.com Tue Jun 8 07:00:44 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 8 Jun 2010 07:00:44 -0400 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 5:35 AM, Peter wrote: > On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote: > > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: > >> > >> Having now tried using this on some files with tens of millions of > >> records, tuning how we use SQLite is going to be important. > >> > > Wouldn't a Berkeley database be much much faster for constructing > > simple key to offset mappings? > > Maybe - now that I've done the refactoring on Bio.SeqIO.index() to > allow two back ends (python dict or SQLite) trying a third (BDB) is > much easier. Did you know BDB was used in the old OBDA index > files? However, Python 2.6 deprecated bsddb (the Python Interface > to Berkeley DB library) and Python is pushing people to SQLite3 > instead. > > Hi Peter, I am aware that SQLite is taking over the job of serving as the default embedded database for Python and am in vigorous agreement with that trend. I use SQLite for a wide range of tasks and am extremely happy with it for most applications. Unfortunately, for pure key-value mapping tasks, I've found SQLite to be 4-10x slower than a well-tuned BDB tree, even with batched updates and using the most aggressive SQLite performance pragmas. My results may not be typical, but I thought I'd raise the issue given the magnitude of the performance difference. Best regards, -Kevin From mjldehoon at yahoo.com Tue Jun 8 08:19:28 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Jun 2010 05:19:28 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records In-Reply-To: Message-ID: <14055.47665.qm@web62401.mail.re1.yahoo.com> --- On Mon, 6/7/10, Peter wrote: > I agree that records from all the different BLAST output > formats should be represented by a common base class - > but not necessarily the same class. > For example, the default plain text and XML formats include > the pairwise alignments, but the tabular output does not. To > me having a sub-class which stores the pairwise alignments seems > natural here. Why do we need a sub-class? We don't do this in Bio.SeqIO, where GenBank files contain much more information than Fasta files, but both are represented by a SeqRecord. Best, --Michiel. From biopython at maubp.freeserve.co.uk Tue Jun 8 08:32:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 13:32:05 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <14055.47665.qm@web62401.mail.re1.yahoo.com> References: <14055.47665.qm@web62401.mail.re1.yahoo.com> Message-ID: On Tue, Jun 8, 2010 at 1:19 PM, Michiel de Hoon wrote: > --- On Mon, 6/7/10, Peter wrote: >> I agree that records from all the different BLAST output >> formats should be represented by a common base class - >> but not necessarily the same class. >> For example, the default plain text and XML formats include >> the pairwise alignments, but the tabular output does not. To >> me having a sub-class which stores the pairwise alignments seems >> natural here. > > Why do we need a sub-class? We don't do this in Bio.SeqIO, > where GenBank files contain much more information than Fasta > files, but both are represented by a SeqRecord. OK, I guess you could have some properties which are left empty (like the annotations dictionary or features list in a SeqRecord from a FASTA file). Peter From mjldehoon at yahoo.com Tue Jun 8 09:44:01 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Jun 2010 06:44:01 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records In-Reply-To: Message-ID: <756890.46421.qm@web62404.mail.re1.yahoo.com> --- On Tue, 6/8/10, Peter wrote: > > Why do we need a sub-class? We don't do this in > > Bio.SeqIO, where GenBank files contain much more > > information than Fasta files, but both are > > represented by a SeqRecord. > > OK, I guess you could have some properties which are left > empty > (like the annotations dictionary or features list in a > SeqRecord from a FASTA file). I would prefer that, as it keeps things simple and consistent with other parts of Biopython. But let's see how it goes. Over the weekend I'll set up a rudimentary Blast parser and record so we can see what it would look like in practice. --Michiel From bpederse at gmail.com Tue Jun 8 11:47:18 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 8 Jun 2010 08:47:18 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 4:00 AM, Kevin Jacobs wrote: > On Tue, Jun 8, 2010 at 5:35 AM, Peter wrote: > >> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote: >> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: >> >> >> >> Having now tried using this on some files with tens of millions of >> >> records, tuning how we use SQLite is going to be important. >> >> >> > Wouldn't a Berkeley database be much much faster for constructing >> > simple key to offset mappings? >> >> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to >> allow two back ends (python dict or SQLite) trying a third (BDB) is >> much easier. Did you know BDB was used in the old OBDA index >> files? However, Python 2.6 deprecated bsddb (the Python Interface >> to Berkeley DB library) and Python is pushing people to SQLite3 >> instead. >> >> > Hi Peter, > > I am aware that SQLite is taking over the job of serving as the default > embedded database for Python and am in vigorous agreement with that trend. > ?I use SQLite for a wide range of tasks and am extremely happy with it for > most applications. ?Unfortunately, for pure key-value mapping tasks, I've > found ?SQLite to be 4-10x slower than a well-tuned BDB tree, even with > batched updates and using the most aggressive SQLite performance pragmas. My > results may not be typical, but I thought I'd raise the issue given the > magnitude of the performance difference. > > Best regards, > -Kevin > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > my results may not be typical either, but using an earlier version of peter's sqlite biopython branch and comparing to screed (http://github.com/acr/screed), and my file-index (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i found that biopython's implementation is at most, a bit more than 2x slower. and it does the fastq parsing much more rigorously. also, i didn't see much difference between berkeleydb and tokyocabinet--though the ctypes-based TC wrapper i was using has since been streamlined. here's what i saw for 15+ million records with this script: http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py /opt/src/methylcode/data/s_1_sequence.txt benchmarking fastq file with 15646356 records (62585424 lines) performing 500000 random queries screed ------ create: 704.764 search: 51.717 biopython-sqlite ---------------- create: 727.868 search: 92.947 fileindex --------- create: 294.356 search: 53.701 From biopython at maubp.freeserve.co.uk Tue Jun 8 12:35:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 17:35:07 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen wrote: > > my results may not be typical either, but using an earlier version of > peter's sqlite biopython branch and comparing to screed > (http://github.com/acr/screed), and my file-index > (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i > found that biopython's implementation is at most, a bit more than 2x > slower. and it does the fastq parsing much more rigorously. > > also, i didn't see much difference between berkeleydb and > tokyocabinet--though the ctypes-based TC wrapper i was using has since > been streamlined. > here's what i saw for 15+ million records with this script: > http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py > > /opt/src/methylcode/data/s_1_sequence.txt > benchmarking fastq file with 15646356 records (62585424 lines) > performing 500000 random queries > > screed > ------ > create: 704.764 > search: 51.717 > > biopython-sqlite > ---------------- > create: 727.868 > search: 92.947 > > fileindex > --------- > create: 294.356 > search: 53.701 Are you using a recent version of screed (with SQLite internally)? Which back end are your "fileindex" numbers for? BDB? I'd say that the slow "search" from (the old branch of) Biopython is down to our FASTQ parsing time, which includes lots of object creation. The get_raw method can be useful here depending on what you want to achieve: http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ The version you tried didn't do anything clever with the SQLite indexes, batched inserts etc. I'm hoping the current code will be faster (although there is likely a penalty from having two switchable back ends). Brent, could you re-run this benchmark with this code: http://github.com/peterjc/biopython/tree/index-sqlite-batched You'll need to change the Biopython call in your test script from this (it was renamed before landing on the trunk): fi = SeqIO.indexed_dict(f, idx, "fastq") to this: fi = SeqIO.index(f, idx, "fastq", db=True) or give an explicit filename: fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx") where db is the new parameter for controlling where and if the lookup table is stored on disk. Peter From anaryin at gmail.com Tue Jun 8 13:10:48 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Jun 2010 12:10:48 -0500 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hello all, I'm replying here to what Thomas wrote on the GSOC Report thread because it seems a better place. PDB files can contain anything RNA, DNA, sugars, small molecules... It is > thus not a good idea to > directly associate protein-specific methods to the structure class; it will > lead to a bloated Structure class and a lot of irrelevant methods (ie. > search_ss_bonds is meaningless for a PDB file that contains RNA). Agree. Currently, one creates Polypeptide objects from a Structure object using a > factory design pattern (via PPBuilder); the Polypeptide class implements > some protein specific methods. I believe that is a much cleaner way to do it > (though we need a Protein class that represents collections of connected > polypeptides). One can also make sure that all such derived objects > (Protein, NA, DNA,...) adhere to the same interface by providing a suitable > base class with shared functionality - in that way, the whole thing is also > extendible. > I think there has been already some discussion about this. My personal opinion/suggestion is having a structure like: Bio.PDB/ _______/Protein.py _______/DNA.py _______/RNA.py that would translate to an usage of something like: from Bio.PDB import Protein structure = Protein('1ABC.pdb') structure.search_ss_bonds() but not structure.calc_melting_temperature() (just an example) Protein() would call PDBParser(). It could also include, to a certain extent, an Alphabet-like feature to assure residue names are OK (this goes a bit with this proposal). I believe this goes a bit into what you said. Having a class that basically abstracts what we do now (Bio.PDB.PDBParser) and allows for molecule-specific methods. However, it also leads to some problems: Protein/DNA complexes come to mind. How does this sound? I think it goes with what Eric said in the first post of this thread and what Thomas replied in the GSOC thread. We should also change the PDB name to Struct to better reflect the purpose of the module. All of the other additions like Bio.Struct.WWW would still apply. And I don't see a major problem in breaking the existing code by adding this. Jo?o From tiagoantao at gmail.com Tue Jun 8 15:12:00 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 8 Jun 2010 20:12:00 +0100 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 10:35 AM, Peter wrote: > Comments? Maybe put this on the wiki as doc for good practice? From biopython at maubp.freeserve.co.uk Tue Jun 8 15:41:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 20:41:03 +0100 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: 2010/6/8 Tiago Ant?o : > On Mon, Jun 7, 2010 at 10:35 AM, Peter wrote: >> Comments? > > Maybe put this on the wiki as doc for good practice? So this does seems like a sensible approach (for those of use with commit access to the main repository)? We can add it to the git usage page then... http://www.biopython.org/wiki/GitUsage Peter From eric.talevich at gmail.com Tue Jun 8 17:45:42 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 8 Jun 2010 17:45:42 -0400 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 5:35 AM, Peter wrote: > Hi all, > > I thought I'd write down some notes about how I've been using git recently. > This may be of interest to any of the other core developers (those of us > with read-write access to the main repository), and I might get some good > tips from any discussion. The key point is that I have read+write access > to two repositories on github (the official repository AND my own fork), > so there are different advantages/disadvantages about which I choose > to work with directly as my main repository. > > [...] > > Instead, I have a github repository of my own (what github calls a > fork), and I push branches there. > > http://github.com/biopython/biopython - the official branch(es) > http://github.com/peterjc/biopython - my branches > > How does this work in practice? Like this - I clone the master > and add a reference to my repository (and I do the same when I > want to grab a branch from another developer): > > git clone git at github.com:biopython/biopython.git > cd biopython > git remote add peterjc git at github.com:peterjc/biopython.git > git fetch peterjc > > Then make a new local branch as usual, and when ready to share > it publicly, I push it to *my* repository on github: > > git branch new-work > git checkout new-work > git commit ... > git push peterjc new-work > > This would then appear as a new-work branch on my github page. > Then if I (or someone else) wants to access these branches later > (e.g. from another machine) just use the checkout tracked remote > branch. For example, > > git clone git at github.com:biopython/biopython.git > cd biopython > git remote add peterjc git at github.com:peterjc/biopython.git > git fetch peterjc > git checkout -t peterjc/seqio-imgt > > This then looks like a normal branch (called just "seqio-imgt" in > this example), but git knows it is linked to the remote branch on > the "peterjc" repository (not the origin which is the "official" > repository). > This looks reasonable to me. I'd add that the procedure to delete a public branch from your personal fork on GitHub is a little obscure: git branch -a # list local and remote branches git branch -d new-work # delete a local branch that's been merged already git push peterjc :new-work # delete the public branch from GitHub This doesn't do what you'd expect: git branch -d peterjc/new-work That only removes your local reference to the the public branch; the branch is still visible on GitHub. (It's kind of hard to find in the GitHub documentation.) I'd have to check, but I guess that if the original git clone is done > with git://github.com/biopython/biopython.git instead (read only > access) the same procedure could be used by non core devs. > However, I'm not sure this is clearer for them. I think the current > procedure (on our wiki) where you add a remote reference to > the "upstream" official repository works better in this case. > I still have an "upstream" reference to the main repo. I wouldn't want to accidentally push something foolish to the main repo with a stray "git push"... better to have the safe thing happen by default. If the initial clone was from biopython master, and you later create a personal forkon GitHub, then it's not too hard to switch the references around in your local repo to make the public fork your "origin". -Eric From bugzilla-daemon at portal.open-bio.org Tue Jun 8 18:52:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Jun 2010 18:52:28 -0400 Subject: [Biopython-dev] [Bug 3096] New: PPBuilder build_peptides bugs Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3096 Summary: PPBuilder build_peptides bugs Product: Biopython Version: Not Applicable Platform: Other OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: skong at zymeworks.com Given a chain of backbone connected residues 'IXRGXTGL' that contains two non-standard amino acids 'X' in between, building peptide with only standard amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not be returned as a peptide since it is just one residue. Currently biopython would return 'IXGXGL', with two bugs in between: 1. Skipping a standard amino acid R and T after each X, while keeping X (Should skip X instead not R or T). Related to http://bugzilla.open-bio.org/show_bug.cgi?id=2910 and http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html 2. Return one peptide even though after filtering the two X residues which connect 'I', 'RG', 'TGL' are no longer present and fragment 'IRGTGL' cannot be considered as a valid peptide without the two Xs connecting them. The above sequence 'IXRGXTGL' are taken from 1bfe and mutated. The 'mutation' referred here is simply renaming the residue name to something that is not standard and represented as 'X'. Each solution proposed below is meant to fix respective bug above: 1. Insert (not accept(prev) or not accept(next)) after if aa_only check at line 299 of Bio/PDB/Polypeptide.py 2. Insert pp=None when either of the residues compared are filtered at line 300 or Bio/PDB/Polypeptide.py Amino acids filtering bug in method build_peptides() of class _PPBuilder ofin Bio/PDB/Polypeptide.py: Original: for chain in chain_list: chain_it=iter(chain) prev=chain_it.next() pp=None for next in chain_it: if aa_only and not accept(prev): prev=next continue if is_connected(prev, next): if pp is None: pp=Polypeptide() pp.append(prev) pp_list.append(pp) pp.append(next) else: pp=None prev=next return pp_list Fixed: for chain in chain_list: chain_it=iter(chain) prev=chain_it.next() pp=None for next in chain_it: if aa_only and (not accept(prev) or not accept(next)): prev=next; pp=None continue if is_connected(prev, next): if pp is None: pp=Polypeptide() pp.append(prev) pp_list.append(pp) pp.append(next) else: pp=None prev=next return pp_list Attached here is the code used to test the above case, with and without mutations, and with and without standard amino acid filtering. The case without mutation is just to show that the backbone atoms of the mutated version are connected: from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import PPBuilder, is_aa class StandardAABuilder(PPBuilder): """ Polypeptide builder which accepts only standard amino acids.""" def _accept(self, residue): return is_aa(residue, standard=True) def extract_peptides(model): """Extracts the peptides from a model. Returns a list of Peptide object.""" output = [] for peptide in PPBuilder().build_peptides(model): seq = str(peptide.get_sequence()) output.append(seq) return output def extract_peptides_saa(model): """Extracts the peptides from a model. Returns a list of Peptide object.""" output = [] for peptide in StandardAABuilder().build_peptides(model): seq = str(peptide.get_sequence()) output.append(seq) return output if __name__ == '__main__': oripdb = open('chopped_pdb1bfe.ent') sto = PDBParser().get_structure('', oripdb) seqao = extract_peptides(sto) seqbo = extract_peptides_saa(sto) print 'ori seq all ' print seqao print 'ori seq standard only' print seqbo pdb = open('chopped_mutated_pdb1bfe.ent') st = PDBParser().get_structure('', pdb) seqa = extract_peptides(st) seqb = extract_peptides_saa(st) print 'mut seq all' print seqa print 'mut seq standard only ' print seqb Attached below are the two fragments of PDB files, pre and post mutated. chopped_pdb1bfe.ent ATOM 85 N ILE A 316 37.386 71.217 31.070 1.00 36.97 N ATOM 86 CA ILE A 316 38.311 71.290 29.949 1.00 33.71 C ATOM 87 C ILE A 316 37.634 72.103 28.862 1.00 33.93 C ATOM 88 O ILE A 316 36.415 72.216 28.839 1.00 36.46 O ATOM 89 CB ILE A 316 38.651 69.876 29.404 1.00 35.79 C ATOM 90 CG1 ILE A 316 39.331 69.049 30.501 1.00 36.78 C ATOM 91 CG2 ILE A 316 39.572 69.979 28.187 1.00 37.71 C ATOM 92 CD1 ILE A 316 39.881 67.724 30.023 1.00 39.20 C ATOM 93 N HIS A 317 38.425 72.679 27.969 1.00 35.61 N ATOM 94 CA HIS A 317 37.880 73.473 26.881 1.00 37.92 C ATOM 95 C HIS A 317 38.360 72.928 25.540 1.00 37.79 C ATOM 96 O HIS A 317 39.463 73.240 25.094 1.00 37.44 O ATOM 97 CB HIS A 317 38.303 74.930 27.052 1.00 35.19 C ATOM 98 CG HIS A 317 37.888 75.519 28.363 1.00 35.76 C ATOM 99 ND1 HIS A 317 36.611 75.981 28.602 1.00 37.74 N ATOM 100 CD2 HIS A 317 38.575 75.701 29.516 1.00 37.59 C ATOM 101 CE1 HIS A 317 36.529 76.420 29.844 1.00 38.74 C ATOM 102 NE2 HIS A 317 37.706 76.262 30.421 1.00 36.76 N ATOM 103 N ARG A 318 37.527 72.109 24.905 1.00 38.78 N ATOM 104 CA ARG A 318 37.884 71.512 23.627 1.00 42.04 C ATOM 105 C ARG A 318 38.469 72.559 22.699 1.00 45.14 C ATOM 106 O ARG A 318 39.592 72.425 22.205 1.00 42.05 O ATOM 107 CB ARG A 318 36.657 70.880 22.967 1.00 42.93 C ATOM 108 CG ARG A 318 36.934 70.321 21.576 1.00 38.60 C ATOM 109 CD ARG A 318 35.654 70.038 20.821 1.00 35.39 C ATOM 110 NE ARG A 318 34.624 69.538 21.724 1.00 34.96 N ATOM 111 CZ ARG A 318 34.539 68.278 22.141 1.00 31.51 C ATOM 112 NH1 ARG A 318 35.419 67.373 21.736 1.00 25.19 N ATOM 113 NH2 ARG A 318 33.579 67.929 22.983 1.00 29.10 N ATOM 114 N GLY A 319 37.690 73.604 22.461 1.00 49.96 N ATOM 115 CA GLY A 319 38.138 74.668 21.592 1.00 55.53 C ATOM 116 C GLY A 319 38.459 74.219 20.180 1.00 58.85 C ATOM 117 O GLY A 319 37.583 73.766 19.440 1.00 58.98 O ATOM 118 N SER A 320 39.734 74.334 19.823 1.00 61.64 N ATOM 119 CA SER A 320 40.219 73.992 18.493 1.00 63.16 C ATOM 120 C SER A 320 40.212 72.517 18.110 1.00 65.27 C ATOM 121 O SER A 320 39.558 72.127 17.145 1.00 65.12 O ATOM 122 CB SER A 320 41.634 74.542 18.316 1.00 65.36 C ATOM 123 OG SER A 320 42.124 74.255 17.019 1.00 72.05 O ATOM 124 N THR A 321 40.955 71.702 18.853 1.00 67.43 N ATOM 125 CA THR A 321 41.049 70.274 18.562 1.00 67.73 C ATOM 126 C THR A 321 40.220 69.430 19.529 1.00 66.41 C ATOM 127 O THR A 321 39.244 69.917 20.095 1.00 70.21 O ATOM 128 CB THR A 321 42.517 69.810 18.620 1.00 70.22 C ATOM 129 OG1 THR A 321 42.613 68.453 18.169 1.00 77.03 O ATOM 130 CG2 THR A 321 43.049 69.915 20.045 1.00 72.07 C ATOM 131 N GLY A 322 40.608 68.168 19.707 1.00 61.22 N ATOM 132 CA GLY A 322 39.892 67.286 20.614 1.00 53.23 C ATOM 133 C GLY A 322 40.037 67.705 22.065 1.00 48.00 C ATOM 134 O GLY A 322 40.138 68.892 22.372 1.00 50.41 O ATOM 135 N LEU A 323 40.044 66.734 22.968 1.00 41.92 N ATOM 136 CA LEU A 323 40.190 67.033 24.385 1.00 35.58 C ATOM 137 C LEU A 323 41.613 66.738 24.874 1.00 31.41 C ATOM 138 O LEU A 323 41.932 66.921 26.046 1.00 30.47 O ATOM 139 CB LEU A 323 39.160 66.240 25.191 1.00 35.76 C ATOM 140 CG LEU A 323 37.716 66.576 24.802 1.00 39.50 C ATOM 141 CD1 LEU A 323 36.733 65.796 25.670 1.00 38.15 C ATOM 142 CD2 LEU A 323 37.493 68.074 24.955 1.00 38.58 C PDB FILE: mutated_chopped_pdb1bfe.ent ATOM 85 N ILE A 316 37.386 71.217 31.070 1.00 36.97 N ATOM 86 CA ILE A 316 38.311 71.290 29.949 1.00 33.71 C ATOM 87 C ILE A 316 37.634 72.103 28.862 1.00 33.93 C ATOM 88 O ILE A 316 36.415 72.216 28.839 1.00 36.46 O ATOM 89 CB ILE A 316 38.651 69.876 29.404 1.00 35.79 C ATOM 90 CG1 ILE A 316 39.331 69.049 30.501 1.00 36.78 C ATOM 91 CG2 ILE A 316 39.572 69.979 28.187 1.00 37.71 C ATOM 92 CD1 ILE A 316 39.881 67.724 30.023 1.00 39.20 C ATOM 93 N HIE A 317 38.425 72.679 27.969 1.00 35.61 N ATOM 94 CA HIE A 317 37.880 73.473 26.881 1.00 37.92 C ATOM 95 C HIE A 317 38.360 72.928 25.540 1.00 37.79 C ATOM 96 O HIE A 317 39.463 73.240 25.094 1.00 37.44 O ATOM 97 CB HIE A 317 38.303 74.930 27.052 1.00 35.19 C ATOM 98 CG HIE A 317 37.888 75.519 28.363 1.00 35.76 C ATOM 99 ND1 HIE A 317 36.611 75.981 28.602 1.00 37.74 N ATOM 100 CD2 HIE A 317 38.575 75.701 29.516 1.00 37.59 C ATOM 101 CE1 HIE A 317 36.529 76.420 29.844 1.00 38.74 C ATOM 102 NE2 HIE A 317 37.706 76.262 30.421 1.00 36.76 N ATOM 103 N ARG A 318 37.527 72.109 24.905 1.00 38.78 N ATOM 104 CA ARG A 318 37.884 71.512 23.627 1.00 42.04 C ATOM 105 C ARG A 318 38.469 72.559 22.699 1.00 45.14 C ATOM 106 O ARG A 318 39.592 72.425 22.205 1.00 42.05 O ATOM 107 CB ARG A 318 36.657 70.880 22.967 1.00 42.93 C ATOM 108 CG ARG A 318 36.934 70.321 21.576 1.00 38.60 C ATOM 109 CD ARG A 318 35.654 70.038 20.821 1.00 35.39 C ATOM 110 NE ARG A 318 34.624 69.538 21.724 1.00 34.96 N ATOM 111 CZ ARG A 318 34.539 68.278 22.141 1.00 31.51 C ATOM 112 NH1 ARG A 318 35.419 67.373 21.736 1.00 25.19 N ATOM 113 NH2 ARG A 318 33.579 67.929 22.983 1.00 29.10 N ATOM 114 N GLY A 319 37.690 73.604 22.461 1.00 49.96 N ATOM 115 CA GLY A 319 38.138 74.668 21.592 1.00 55.53 C ATOM 116 C GLY A 319 38.459 74.219 20.180 1.00 58.85 C ATOM 117 O GLY A 319 37.583 73.766 19.440 1.00 58.98 O ATOM 118 N XQQ A 320 39.734 74.334 19.823 1.00 61.64 N ATOM 119 CA XQQ A 320 40.219 73.992 18.493 1.00 63.16 C ATOM 120 C XQQ A 320 40.212 72.517 18.110 1.00 65.27 C ATOM 121 O XQQ A 320 39.558 72.127 17.145 1.00 65.12 O ATOM 122 CB XQQ A 320 41.634 74.542 18.316 1.00 65.36 C ATOM 123 OG XQQ A 320 42.124 74.255 17.019 1.00 72.05 O ATOM 124 N THR A 321 40.955 71.702 18.853 1.00 67.43 N ATOM 125 CA THR A 321 41.049 70.274 18.562 1.00 67.73 C ATOM 126 C THR A 321 40.220 69.430 19.529 1.00 66.41 C ATOM 127 O THR A 321 39.244 69.917 20.095 1.00 70.21 O ATOM 128 CB THR A 321 42.517 69.810 18.620 1.00 70.22 C ATOM 129 OG1 THR A 321 42.613 68.453 18.169 1.00 77.03 O ATOM 130 CG2 THR A 321 43.049 69.915 20.045 1.00 72.07 C ATOM 131 N GLY A 322 40.608 68.168 19.707 1.00 61.22 N ATOM 132 CA GLY A 322 39.892 67.286 20.614 1.00 53.23 C ATOM 133 C GLY A 322 40.037 67.705 22.065 1.00 48.00 C ATOM 134 O GLY A 322 40.138 68.892 22.372 1.00 50.41 O ATOM 135 N LEU A 323 40.044 66.734 22.968 1.00 41.92 N ATOM 136 CA LEU A 323 40.190 67.033 24.385 1.00 35.58 C ATOM 137 C LEU A 323 41.613 66.738 24.874 1.00 31.41 C ATOM 138 O LEU A 323 41.932 66.921 26.046 1.00 30.47 O ATOM 139 CB LEU A 323 39.160 66.240 25.191 1.00 35.76 C ATOM 140 CG LEU A 323 37.716 66.576 24.802 1.00 39.50 C ATOM 141 CD1 LEU A 323 36.733 65.796 25.670 1.00 38.15 C ATOM 142 CD2 LEU A 323 37.493 68.074 24.955 1.00 38.58 C -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bpederse at gmail.com Wed Jun 9 00:33:12 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 8 Jun 2010 21:33:12 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 9:35 AM, Peter wrote: > On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen wrote: >> >> my results may not be typical either, but using an earlier version of >> peter's sqlite biopython branch and comparing to screed >> (http://github.com/acr/screed), and my file-index >> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i >> found that biopython's implementation is at most, a bit more than 2x >> slower. and it does the fastq parsing much more rigorously. >> >> also, i didn't see much difference between berkeleydb and >> tokyocabinet--though the ctypes-based TC wrapper i was using has since >> been streamlined. >> here's what i saw for 15+ million records with this script: >> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py >> >> /opt/src/methylcode/data/s_1_sequence.txt >> benchmarking fastq file with 15646356 records (62585424 lines) >> performing 500000 random queries >> >> screed >> ------ >> create: 704.764 >> search: 51.717 >> >> biopython-sqlite >> ---------------- >> create: 727.868 >> search: 92.947 >> >> fileindex >> --------- >> create: 294.356 >> search: 53.701 > > Are you using a recent version of screed (with SQLite internally)? > > Which back end are your "fileindex" numbers for? BDB? > > I'd say that the slow "search" from (the old branch of) Biopython is > down to our FASTQ parsing time, which includes lots of object > creation. The get_raw method can be useful here depending on > what you want to achieve: > http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > > The version you tried didn't do anything clever with the SQLite > indexes, batched inserts etc. I'm hoping the current code will be > faster (although there is likely a penalty from having two switchable > back ends). Brent, could you re-run this benchmark with this code: > http://github.com/peterjc/biopython/tree/index-sqlite-batched > > You'll need to change the Biopython call in your test script from > this (it was renamed before landing on the trunk): > > fi = SeqIO.indexed_dict(f, idx, "fastq") > > to this: > > fi = SeqIO.index(f, idx, "fastq", db=True) > > or give an explicit filename: > > fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx") > > where db is the new parameter for controlling where and if > the lookup table is stored on disk. > > Peter > done. the previous times and the current were using py-tcdb not bsddb. the author of tcdb made some improvements so it's faster this time, and your SeqIO implementation is almost 2x as fast to load as the previous one. that's a nice implementation. i didn't try get_raw. these timints are are with your latest version, and the version of screed pulled from http://github.com/acr/screed master today. /opt/src/methylcode/data/s_1_sequence.txt benchmarking fastq file with 15646356 records (62585424 lines) performing 500000 random queries screed ------ create: 699.210 search: 51.043 biopython-sqlite ---------------- create: 386.647 search: 93.391 fileindex --------- create: 184.088 search: 48.887 From bugzilla-daemon at portal.open-bio.org Wed Jun 9 04:43:02 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Jun 2010 04:43:02 -0400 Subject: [Biopython-dev] [Bug 3096] PPBuilder build_peptides bugs In-Reply-To: Message-ID: <201006090843.o598h2tx024780@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3096 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-09 04:43 EST ------- (In reply to comment #0) > Given a chain of backbone connected residues 'IXRGXTGL' that contains two > non-standard amino acids 'X' in between, building peptide with only standard > amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not > be returned as a peptide since it is just one residue. Currently biopython > would return 'IXGXGL', with two bugs in between: What is wrong with returning 'IXGXGL'? The PDB contains a peptide of six linked residues doesn't it? It looks like Bio.PDB is doing something sensible. P.S. You didn't fill in which version of Biopython you are using. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 9 04:55:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 09:55:37 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen wrote: >> >> The version you tried didn't do anything clever with the SQLite >> indexes, batched inserts etc. I'm hoping the current code will be >> faster (although there is likely a penalty from having two switchable >> back ends). Brent, could you re-run this benchmark with this code: >> http://github.com/peterjc/biopython/tree/index-sqlite-batched >> ... > > done. Thank you Brent :) > the previous times and the current were using py-tcdb not bsddb. > the author of tcdb made some improvements so it's faster this time, OK, so you are using Tokyo Cabinet to store the lookup table here rather than BDB. Link, http://code.google.com/p/py-tcdb/ > and your SeqIO implementation is almost 2x as fast to load as the > previous one. that's a nice implementation. i didn't try get_raw. I've got some more re-factoring in mind which should help a little more (but mainly to make the structure clearer). > these timints are are with your latest version, and the version of > screed pulled from http://github.com/acr/screed master today. Having had a quick look, they are using SQLite3 in much the say way as I was initially. They create the index before loading (rather than after loading) and they use a single insert per offset (rather than using a batch in a transaction or the executemany method). I'm pretty sure from my experiments those changes would speed up screed's loading time a lot (probably inline with the speed up I achieved). > /opt/src/methylcode/data/s_1_sequence.txt > benchmarking fastq file with 15646356 records (62585424 lines) > performing 500000 random queries > > screed > ------ > create: 699.210 > search: 51.043 > > biopython-sqlite > ---------------- > create: 386.647 > search: 93.391 > > fileindex > --------- > create: 184.088 > search: 48.887 That's got us looking more competitive. As noted above, I think sceed's loading time could be much reduced by tweaking how they use SQLite3. I wonder what the breakdown for fileindex is between calling Tokyo Cabinet and the fileindex code itself? I guess we should try TK as the back end in Bio.SeqIO.index() for comparison. Peter P.S. Could you measure the database file sizes on disk? From thomas.hamelryck at gmail.com Wed Jun 9 08:18:41 2010 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Wed, 9 Jun 2010 14:18:41 +0200 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hi, On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues wrote: > > from Bio.PDB import Protein > structure = Protein('1ABC.pdb') > structure.search_ss_bonds() > Indeed, that would run into problems for complexes where proteins, RNA, DNA, etc. occur in the same file. It makes much more sense to have a Structure centred approach: proteins=Protein(structure) chains=proteins.get_chains() chain_a=chains["A"] polypeptides=chain_a.get_peptides() rnas=RNA(structure) etc. -Thomas -- Thomas Hamelryck, Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://wiki.binf.ku.dk/User:Thomas_Hamelryck http://www.binf.ku.dk/research/structural_bioinformatics/ From lgautier at gmail.com Wed Jun 9 08:28:20 2010 From: lgautier at gmail.com (Laurent) Date: Wed, 09 Jun 2010 14:28:20 +0200 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: <4C0F88E4.7070607@gmail.com> What about having a class instance instead ? This would let one change the index storage system very easily. For example, to use a dictionary: Bio.SeqIO.index(keyval_map = dict() ) A minimal requirement for the instance 'keyval_map' passed would be to implement the methods __getitem__(self, key) and __setitem__(self, key, value), allowing the "duck typing" approach commonly found in Python. An SQLite-based index would be a matter of having a class such as: class KeyValSQLite(object): def __init__(self, filename): # create the database into file "filename" pass def __getitem__(self, key): """ return the value """ # select whatever in something where key=''... pass def __setitem__(self, key, value): # update... pass The this would be a call like: Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db")) Now that you have the idea, getting a custom index based on BDB or anything should be a breeze... L. On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote: > Hi all, > > Thanks for the lively discussion on the main list, > > http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html > ... > http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html > > I've spent the afternoon updating my old branch which uses SQLite > to store the record identifier to file offset mapping. Using the code > on this branch, Bio.SeqIO.index() supports a new optional argument > currently called "db" (other names I like including "cache", suggestions > welcome): > > http://github.com/peterjc/biopython/tree/index-sqlite > > The default (False) is not to use SQLite, but continue with an in > memory Python dictionary. As long as you have enough RAM > and don't plan to use the index at a later date, this will be fastest. > > If set to True or a filename, then an SQLite index is used to hold > the offsets. This means very low RAM requirements, but is a lot > slower because the offsets are written to disk and the SQLite > index is updated as we go. I expect this part can be optimised > (e.g. try to build the index at the end, try committing in batches). > > I'm still testing this, but the core of the work is done I think. > Once we're happy with the public API, we can concentrate > on things like the SQLite schema, and optimising the code. > > Peter > > P.S. I know it will need a little work to fail gracefully on Python 2.4 > when SQLite isn't installed. > From biopython at maubp.freeserve.co.uk Wed Jun 9 08:53:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 13:53:39 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: <4C0F88E4.7070607@gmail.com> References: <4C0F88E4.7070607@gmail.com> Message-ID: On Wed, Jun 9, 2010 at 1:28 PM, Laurent wrote: > What about having a class instance instead ? This would let one change the > index storage system very easily. That is essentially what the recent code on my branch is doing, but the back end isn't being exposed to the public API (yet). > The this would be a call like: > > Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db")) > > > Now that you have the idea, getting a custom index based on BDB or > anything should be a breeze... Indeed. Most DB like back ends should offset a bulk loader we can exploit via the dict's update method. Peter From eric.talevich at gmail.com Wed Jun 9 09:31:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 9 Jun 2010 09:31:18 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 1:10 PM, Jo?o Rodrigues wrote: > Hello all, > > I'm replying here to what Thomas wrote on the GSOC Report thread because it > seems a better place. > > PDB files can contain anything RNA, DNA, sugars, small molecules... It is >> thus not a good idea to >> directly associate protein-specific methods to the structure class; it >> will lead to a bloated Structure class and a lot of irrelevant methods (ie. >> search_ss_bonds is meaningless for a PDB file that contains RNA). > > > Agree. > > Currently, one creates Polypeptide objects from a Structure object using a >> factory design pattern (via PPBuilder); the Polypeptide class implements >> some protein specific methods. I believe that is a much cleaner way to do it >> (though we need a Protein class that represents collections of connected >> polypeptides). One can also make sure that all such derived objects >> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable >> base class with shared functionality - in that way, the whole thing is also >> extendible. >> > > I think there has been already some discussion about this. My personal > opinion/suggestion is having a structure like: > > Bio.PDB/ > _______/Protein.py > _______/DNA.py > _______/RNA.py > > that would translate to an usage of something like: > > from Bio.PDB import Protein > structure = Protein('1ABC.pdb') > structure.search_ss_bonds() > > but not > > structure.calc_melting_temperature() (just an example) > How about: from Bio import Struct # extract the protein from a bound TF structure complex = Struct.read("3IKT.pdb") prot = complex.as_protein() # which is a wrapper for: from Bio.Struct.Protein import Protein # if Protein contains a Structure instance: prot = Protein(complex) # or, if Protein inherits from Structure: prot = Protein.from_structure(complex) The Bio.Struct.Protein module would mostly wrap Bio.PDB's protein-specific functionality, and contain a class called Protein which you construct using a Bio.PDB.Structure.Structure instance, in some way. I think the convenience methods as_protein, as_dna and as_rna are acceptable additions to the Structure class if that saves us from (a) polluting Structure with protein- and RNA-specific methods, or (b) requiring a slew of imports to reach any new functionality. You can add as_protein yourself and leave the other methods for other brave souls to implement. (Bio.Struct.RNA deserves its own directory, and I don't know of anyone working on a structural DNA branch.) Protein() would call PDBParser(). It could also include, to a certain > extent, an Alphabet-like feature to assure residue names are OK (this goes a > bit with this proposal). > I believe this goes a bit into what you said. Having a class that basically > abstracts what we do now (Bio.PDB.PDBParser) and allows for > molecule-specific methods. However, it also leads to some problems: > Protein/DNA complexes come to mind. > > How does this sound? I think it goes with what Eric said in the first post > of this thread and what Thomas replied in the GSOC thread. We should also > change the PDB name to Struct to better reflect the purpose of the module. > All of the other additions like Bio.Struct.WWW would still apply. And I > don't see a major problem in breaking the existing code by adding this. > To be clear, we don't need to rename anything -- Bio.Struct and Bio.PDB can live in harmony for the foreseeable future. Best, Eric From bpederse at gmail.com Wed Jun 9 10:42:29 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Wed, 9 Jun 2010 07:42:29 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 1:55 AM, Peter wrote: > On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen wrote: >>> >>> The version you tried didn't do anything clever with the SQLite >>> indexes, batched inserts etc. I'm hoping the current code will be >>> faster (although there is likely a penalty from having two switchable >>> back ends). Brent, could you re-run this benchmark with this code: >>> http://github.com/peterjc/biopython/tree/index-sqlite-batched >>> ... >> >> done. > > Thank you Brent :) > >> the previous times and the current were using py-tcdb not bsddb. >> the author of tcdb made some improvements so it's faster this time, > > OK, so you are using Tokyo Cabinet to store the lookup table here > rather than BDB. Link, http://code.google.com/p/py-tcdb/ > >> and your SeqIO implementation is almost 2x as fast to load as the >> previous one. that's a nice implementation. i didn't try get_raw. > > I've got some more re-factoring in mind which should help a little > more (but mainly to make the structure clearer). > >> these timints are are with your latest version, and the version of >> screed pulled from http://github.com/acr/screed master today. > > Having had a quick look, they are using SQLite3 in much the > say way as I was initially. They create the index before loading > (rather than after loading) and they use a single insert per > offset (rather than using a batch in a transaction or the > executemany method). I'm pretty sure from my experiments > those changes would speed up screed's loading time a lot > (probably inline with the speed up I achieved). > >> /opt/src/methylcode/data/s_1_sequence.txt >> benchmarking fastq file with 15646356 records (62585424 lines) >> performing 500000 random queries >> >> screed >> ------ >> create: 699.210 >> search: 51.043 >> >> biopython-sqlite >> ---------------- >> create: 386.647 >> search: 93.391 >> >> fileindex >> --------- >> create: 184.088 >> search: 48.887 > > That's got us looking more competitive. As noted above, I think > sceed's loading time could be much reduced by tweaking how > they use SQLite3. I wonder what the breakdown for fileindex is > between calling Tokyo Cabinet and the fileindex code itself? > I guess we should try TK as the back end in Bio.SeqIO.index() > for comparison. > > Peter > > P.S. Could you measure the database file sizes on disk? > for raw reads, screed, fileindex(tcdb), biopython respectively: -rw-r--r-T 1 brentp users 3.3G 2009-11-17 13:32 /opt/src/methylcode/data/s_1_sequence.txt -rw-r--r-- 1 brentp brentp 3.8G 2010-06-08 16:09 /opt/src/methylcode/data/s_1_sequence.txt_screed -rw-r--r-- 1 brentp brentp 1.2G 2010-06-08 16:21 /opt/src/methylcode/data/s_1_sequence.txt.fidx -rw-r--r-- 1 brentp brentp 1.5G 2010-06-08 21:15 /opt/src/methylcode/data/s_1_sequence.txt.bidx that's not using any compression for the fileindex. i think the overhead of the fileindex code + tcdb code is pretty low now. i think there'd only be improvement using a cython or c version of a TC wrapper--and even then, not much. -brentp From biopython at maubp.freeserve.co.uk Wed Jun 9 10:55:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 15:55:23 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 9:55 AM, Peter wrote: > > Having had a quick look, they are using SQLite3 in much the > say way as I was initially. They create the index before loading > (rather than after loading) and they use a single insert per > offset (rather than using a batch in a transaction or the > executemany method). I'm pretty sure from my experiments > those changes would speed up screed's loading time a lot > (probably inline with the speed up I achieved). > Do you fancy trying this version of screed? It seems much faster on medium sized FASTQ files:- http://github.com/peterjc/screed/tree/sqlite-tweaks I'm still running a few tests myself, but will pass this on to the screed team unless I find some regressions. Peter From bpederse at gmail.com Wed Jun 9 11:56:27 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Wed, 9 Jun 2010 08:56:27 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 7:55 AM, Peter wrote: > On Wed, Jun 9, 2010 at 9:55 AM, Peter wrote: >> >> Having had a quick look, they are using SQLite3 in much the >> say way as I was initially. They create the index before loading >> (rather than after loading) and they use a single insert per >> offset (rather than using a batch in a transaction or the >> executemany method). I'm pretty sure from my experiments >> those changes would speed up screed's loading time a lot >> (probably inline with the speed up I achieved). >> > > Do you fancy trying this version of screed? It seems much > faster on medium sized FASTQ files:- > > http://github.com/peterjc/screed/tree/sqlite-tweaks > > I'm still running a few tests myself, but will pass this on to > the screed team unless I find some regressions. > > Peter > not too much difference. screed ------ create: 666.381 search: 51.839 From biopython at maubp.freeserve.co.uk Wed Jun 9 12:19:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 17:19:24 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 4:56 PM, Brent Pedersen wrote: > On Wed, Jun 9, 2010 at 7:55 AM, Peter wrote: >> On Wed, Jun 9, 2010 at 9:55 AM, Peter wrote: >> >> Do you fancy trying this version of screed? It seems much >> faster on medium sized FASTQ files:- >> >> http://github.com/peterjc/screed/tree/sqlite-tweaks >> >> I'm still running a few tests myself, but will pass this on to >> the screed team unless I find some regressions. >> >> Peter >> > > not too much difference. > > screed > ------ > create: 666.381 > search: 51.839 Still noticeable, but not quite as much of a speed up as I was seeing (but different example, different OS, etc). Anyway, I've sent them a "pull request" and they can merge it if they like. Peter From rodrigo_faccioli at uol.com.br Wed Jun 9 13:35:24 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Wed, 9 Jun 2010 14:35:24 -0300 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: About your Github's problem, you may try to perform the command below, after you removed your local branch. git push git at github.com:/.git :heads/ I've found the command below in [1]. [1] http://originblog.wordpress.com/2008/04/28/github-tips-removing-a-remote-branch/ Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Tue, Jun 8, 2010 at 6:45 PM, Eric Talevich wrote: > On Mon, Jun 7, 2010 at 5:35 AM, Peter >wrote: > > > Hi all, > > > > I thought I'd write down some notes about how I've been using git > recently. > > This may be of interest to any of the other core developers (those of us > > with read-write access to the main repository), and I might get some good > > tips from any discussion. The key point is that I have read+write access > > to two repositories on github (the official repository AND my own fork), > > so there are different advantages/disadvantages about which I choose > > to work with directly as my main repository. > > > > [...] > > > > Instead, I have a github repository of my own (what github calls a > > fork), and I push branches there. > > > > http://github.com/biopython/biopython - the official branch(es) > > http://github.com/peterjc/biopython - my branches > > > > How does this work in practice? Like this - I clone the master > > and add a reference to my repository (and I do the same when I > > want to grab a branch from another developer): > > > > git clone git at github.com:biopython/biopython.git > > cd biopython > > git remote add peterjc git at github.com:peterjc/biopython.git > > git fetch peterjc > > > > Then make a new local branch as usual, and when ready to share > > it publicly, I push it to *my* repository on github: > > > > git branch new-work > > git checkout new-work > > git commit ... > > git push peterjc new-work > > > > This would then appear as a new-work branch on my github page. > > Then if I (or someone else) wants to access these branches later > > (e.g. from another machine) just use the checkout tracked remote > > branch. For example, > > > > git clone git at github.com:biopython/biopython.git > > cd biopython > > git remote add peterjc git at github.com:peterjc/biopython.git > > git fetch peterjc > > git checkout -t peterjc/seqio-imgt > > > > This then looks like a normal branch (called just "seqio-imgt" in > > this example), but git knows it is linked to the remote branch on > > the "peterjc" repository (not the origin which is the "official" > > repository). > > > > This looks reasonable to me. I'd add that the procedure to delete a public > branch from your personal fork on GitHub is a little obscure: > > git branch -a # list local and remote branches > git branch -d new-work # delete a local branch that's been merged already > git push peterjc :new-work # delete the public branch from GitHub > > This doesn't do what you'd expect: > git branch -d peterjc/new-work > > That only removes your local reference to the the public branch; the branch > is still visible on GitHub. > > (It's kind of hard to find in the GitHub documentation.) > > > I'd have to check, but I guess that if the original git clone is done > > with git://github.com/biopython/biopython.git instead (read only > > access) the same procedure could be used by non core devs. > > However, I'm not sure this is clearer for them. I think the current > > procedure (on our wiki) where you add a remote reference to > > the "upstream" official repository works better in this case. > > > > I still have an "upstream" reference to the main repo. I wouldn't want to > accidentally push something foolish to the main repo with a stray "git > push"... better to have the safe thing happen by default. > > If the initial clone was from biopython master, and you later create a > personal forkon GitHub, then it's not too hard to switch the references > around in your local repo to make the public fork your "origin". > > -Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed Jun 9 19:56:35 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 9 Jun 2010 19:56:35 -0400 Subject: [Biopython-dev] Tested Fixup branch for Bio.PDB In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 5:59 AM, Kristian Rother wrote: > > Hi Eric, > > I've checked out your pdbfixes branch and ran our 431 Unit Tests of > ModeRNA with it. There were no changes to the master Bio.PDB branch --> > for us everything OK. > > Details: > ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and > uses Bio.PDB for most of its operations: reading files, > adding/copying/manipulating residues/atoms, superimposing structures, > searching neighbors by KDTree, writing files. > > Right, the tests most probably did not depend directly on the code you > changed, but as I understand you wanted to go sure the branch didnt break > anything by accident. > Thanks, Kristian! I didn't expect the patches to break anything, but it's hard to be sure until someone else has tried it. I've pushed the pdbfixes branch to Biopython's master branch on GitHub. Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Jun 10 12:24:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Jun 2010 17:24:20 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 12:59 PM, Peter wrote: > > With that in mind, as I mentioned yesterday maybe we should just > update the documentation to suggest using os.system() when you > just need the return code and there is no stdin to worry about: > I've added a basic example to the tutorial now, but the potential trouble is any output from the called tool will spew out at the python prompt (if working at the terminal). This may or may not be an issue. ClustalW for example is rather verbose. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 10 14:18:41 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Jun 2010 14:18:41 -0400 Subject: [Biopython-dev] [Bug 3098] New: GenBank/EMBL parser breaks for between features at origin Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3098 Summary: GenBank/EMBL parser breaks for between features at origin Product: Biopython Version: 1.54 Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I was testing Bio.SeqIO with with a GenBank file gbpln1.seq which includes: LOCUS AB042240 134545 bp DNA circular PLN 02-MAY-2006 ... misc_feature 134545^1 /standard_name="JLA" /note="Junction IRA-LSC" ORIGIN ... This is a "between" feature of length zero at the origin of this circular genome. This is a special case since normally between positions "start^end" have end=start+1 (using one based counting) which the parser does not allow for. The same applies to EMBL files as well, e.g. http://www.ebi.ac.uk/cgi-bin/expasyfetch?AB042240 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 10 14:35:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Jun 2010 14:35:48 -0400 Subject: [Biopython-dev] [Bug 3098] GenBank/EMBL parser breaks for between features at origin In-Reply-To: Message-ID: <201006101835.o5AIZm0b025094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3098 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-10 14:35 EST ------- Fixed, http://github.com/biopython/biopython/commit/80aa43e5434316d151bca5916442a3429b8724e2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Jun 10 15:18:38 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 Jun 2010 15:18:38 -0400 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 7:59 AM, Peter wrote: > > Even if the Python documentation seems to be discouraging it, > using os.system() seems simple, robust, and cross platform. We > could even update the tutorial now and post it online - it should > make some people's lives a little easier. > The Python docs claim os.system(cmd) is equivalent to subprocess.call(cmd, shell=True): http://docs.python.org/library/subprocess.html#replacing-os-system As I understood it, the reason for usually skipping the shell on Unix systems was for additional security -- the called program sees the same thing either way. Should we use this as a "teachable moment" involving the subprocess module in the tutorial? -Eric From anaryin at gmail.com Thu Jun 10 19:45:02 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 10 Jun 2010 18:45:02 -0500 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hello all, I'm having some issues dealing with this :x I created a module Bio.Struct that has the following contents: __init__.py Protein.py WWW/ The __init__.py file has a read() method that calls PDBParser and returns a Structure object. So far so good I think. Then I added a method to Bio.PDB.Structure more or less like this: def as_protein(self): from Bio.Struct.Protein import Protein prot = Protein(self) return prot so when you call it you get a new object. Protein is a class that inherits from Structure and that has the search_ss_bonds function. I can make the new object get all the methods from Structure AND from Protein, but when I try to execute search_ss_bonds, it fails because child_list, a Structure method, comes empty.. In fact, the whole SMCRA object comes empty.. How do I effectively do the inheritance on the Protein class? from Bio.PDB.Structure import Structure class Protein(Structure): def __init__(self, protein): self = protein This is what I last tried and doesn't work.. I've tried Structure.__init__, and several other things but to no avail. I'm sure this is simple OOP but I really can't understand that well how to do it ... Care to give a hand to a friend in need? :) Thanks in advance! By the way, I assume that if I got no comments on anything else on the GSOC thread that I'm doing a perfect job :P Thanks for that too :D Best! Jo?o [...] Rodrigues From eric.talevich at gmail.com Thu Jun 10 21:49:39 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 Jun 2010 21:49:39 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Thu, Jun 10, 2010 at 7:45 PM, Jo?o Rodrigues wrote: > Hello all, > > I'm having some issues dealing with this :x > > I created a module Bio.Struct that has the following contents: > > __init__.py > Protein.py > WWW/ > > The __init__.py file has a read() method that calls PDBParser and returns a > Structure object. So far so good I think. Then I added a method to > Bio.PDB.Structure more or less like this: > > def as_protein(self): > > from Bio.Struct.Protein import Protein > prot = Protein(self) > return prot > > so when you call it you get a new object. Protein is a class that inherits > from Structure and that has the search_ss_bonds function. > > I can make the new object get all the methods from Structure AND from > Protein, but when I try to execute search_ss_bonds, it fails because > child_list, a Structure method, comes empty.. In fact, the whole SMCRA > object comes empty.. > > How do I effectively do the inheritance on the Protein class? > > from Bio.PDB.Structure import Structure > > class Protein(Structure): > > def __init__(self, protein): > > self = protein > > This is what I last tried and doesn't work.. I've tried Structure.__init__, > and several other things but to no avail. I'm sure this is simple OOP but I > really can't understand that well how to do it ... > > Care to give a hand to a friend in need? :) > > Thanks in advance! By the way, I assume that if I got no comments on > anything else on the GSOC thread that I'm doing a perfect job :P Thanks for > that too :D > > Best! > > Jo?o [...] Rodrigues > Hi Jo?o, You have it mostly correct, but you need to call the parent class's constructor, too. Here's the constructor for Structure: def __init__(self, id): self.level="S" Entity.__init__(self, id) And here it is for Entity: def __init__(self, id): self.id=id self.full_id=None self.parent=None self.child_list=[] self.child_dict={} # Dictionary that keeps addictional properties self.xtra={} See the problem? Every subclass of Entity takes an "id" argument and sets the other attributes separately. In Bio.Phylo, I used another convention for converting an object of one type to a sub-class of the original type, as you're doing here. Rather than change the arguments to the constructor (which could have weird side-effects), I added a class method in the target class: @classmethod def from_structure(cls, struct): # Instantiate a Protein with the structure's id # Assign the other attributes individually from struct Then Structure.as_protein() becomes fairly simple. Alternatively, you could skip implementing Protein.from_structure() and do the attribute reassignment in as_protein(). Or, covering all the options, implement from_structure() but not as_protein(), and let the user figure it out. Do you think it would also be useful if as_protein() or from_structure() dropped any non-protein molecules during the conversion, and raise an error if nothing's left? Or would that cause more problems than it solves? Best, Eric From biopython at maubp.freeserve.co.uk Mon Jun 14 10:44:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Jun 2010 15:44:50 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> Message-ID: Hi all, You may recall late last year I posted about adding a reverse complement method to the SeqRecord, and addition support: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html SeqRecord addition was included in Biopython 1.53, but not the reverse_complement() method - which is something I wanted to use again today to reverse complement an annotated GenBank file and have all the SeqFeature locations flipped for me. I've rescued my old code and its unit tests and created a new branch for it: http://github.com/peterjc/biopython/commits/seqrecord-rc As I said at the end of last year, I think the general idea of a SeqRecord reverse_complement() method is nice but the details about handling the annotation is tricky. When we discussed slicing and addition, it was agreed that we should be cautious to avoid blindly transferring annotation inappropriately. The code on this branch allows the user to choose for each annotation type if it should be dropped (False), kept (True) or set to a supplied new value. The docstring has examples of how this works (which double as doctests). Jose - I've CC'd you since I know you wrote your own SeqRecord subclass with a complement() method (but not a reverse_complement() method) for Franklin. I'm curious about this choice. Cedar - I've CC'd you since you asked about this kind of think last year: http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html Regards, Peter From biopython at maubp.freeserve.co.uk Mon Jun 14 10:50:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Jun 2010 15:50:31 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: On Mon, Jun 14, 2010 at 3:43 PM, Kristian Rother wrote: > > > Hi Peter, > > just digesting BioPy mails from last week. > >>> Where should the str subclass for secondary structures that the parsers >>> create go? Could it be Bio.Struct.RNA? >> >> You don't think plain strings in the SeqRecord's letter_annotation >> dict would be enough? > > Not really - base pairing makes most normal string functions useless. > > >> Assuming you do need something then >> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering >> as alternatives to Bio.Struct.RNA. > > OK, I'll try that. > > Thanks, > ? Kristian > > Hi Kristian, Could you explain at little more about why plain strings wouldn't be suitable here. What kind of things do you want to do with them? Peter From krother at rubor.de Mon Jun 14 10:55:21 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 16:55:21 +0200 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements Message-ID: <1cf21a9224e1cd3ad4c8e2853d99100b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXwtdXg==-webmailer2@server03.webmailer.hosteurope.de> Hi guys, I'm fine with your ideas regarding different wrappers for Bio.PDB.Structure objects discussed last week, in particular: - creating Bio.Struct.RNA or Bio.PDB.RNA with a Structure instance. - having a structure.as_rna() helper method as suggested by Eric (but this is no must). I'd like to take what Joao does for proteins and add some basic equivalent for RNA structures shortly after. Best Regards, Kristian Quoting Thomas Hamelryck : > Hi, > > On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues wrote: > >> >> from Bio.PDB import Protein >> structure = Protein('1ABC.pdb') >> structure.search_ss_bonds() >> > > Indeed, that would run into problems for complexes where proteins, RNA, DNA, > etc. occur in the same file. It makes much more sense to have a Structure > centred approach: > > proteins=Protein(structure) > chains=proteins.get_chains() > chain_a=chains["A"] > polypeptides=chain_a.get_peptides() > > rnas=RNA(structure) > > etc. > > -Thomas From krother at rubor.de Mon Jun 14 11:01:48 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 17:01:48 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: Hi, much of what I do with RNA secondary structures strongly depends on iterating base pairs, e.g.. >>> sec = Secstruc("(((...)).)") >>> for bp in sec.basepairs(): >>> print bp (0, 9) (1, 7) (2, 6) also: >>> sec.get_helices() >>> sec.get_bulges() >>> sec.get_hairpins() >>> sec.contains_pseudoknot() .. and a couple of similar ones. The reason why I'd prefer to have something more than a string as a sec feature is that I wouldn't want to do all the time: sec = Secstruc(my_seq['secondary_structure']) sec.get_helices() but my_seq['secondary_structure'].get_helices() instead. Best Regards, Kristian >> Hi Peter, >> >> just digesting BioPy mails from last week. >> >>>> Where should the str subclass for secondary structures that the >>>> parsers >>>> create go? Could it be Bio.Struct.RNA? >>> >>> You don't think plain strings in the SeqRecord's letter_annotation >>> dict would be enough? >> >> Not really - base pairing makes most normal string functions useless. >> >> >>> Assuming you do need something then >>> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering >>> as alternatives to Bio.Struct.RNA. >> >> OK, I'll try that. >> >> Thanks, >> ? Kristian >> >> > > Hi Kristian, > > Could you explain at little more about why plain strings wouldn't be > suitable here. What kind of things do you want to do with them? > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From krother at rubor.de Mon Jun 14 11:13:19 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 17:13:19 +0200 Subject: [Biopython-dev] creating Protein(structure) object Message-ID: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Hi Joao, what you are describing is the classical Decorator Pattern (see http://en.wikipedia.org/wiki/Decorator_pattern). In the books, they say that the Decorator (Protein) must implement all methods of the decorated object (Structure). Of course, for a class as big as Bio.PDB.Structure, this sucks a lot. I see two alternatives: (1) override Protein.__getattr__(self, attr) to return self.struc.attr if it exists. I tried this recently and it worked fine until the decorated class used Python properties, when it started getting ugly again. (2) have Protein inherit from Structure, and grab all the children from the structure class, e.g.: class Protein(Structure): def __init__(self, struc): """ The given Structure instance becomes a Protein. """ Structure.__init__(self, struc.id) for child in struc.child_list: # eventually check if its a protein chain. self.add_child(child) Any comments? Kristian > Hello all, > > I'm having some issues dealing with this :x > > I created a module Bio.Struct that has the following contents: > > __init__.py > Protein.py > WWW/ > > The __init__.py file has a read() method that calls PDBParser and returns a > Structure object. So far so good I think. Then I added a method to > Bio.PDB.Structure more or less like this: > > def as_protein(self): > from Bio.Struct.Protein import Protein > prot = Protein(self) > return prot > > so when you call it you get a new object. Protein is a class that inherits > from Structure and that has the search_ss_bonds function. > > I can make the new object get all the methods from Structure AND from > Protein, but when I try to execute search_ss_bonds, it fails because > child_list, a Structure method, comes empty.. In fact, the whole SMCRA > object comes empty.. > > How do I effectively do the inheritance on the Protein class? > > from Bio.PDB.Structure import Structure > > class Protein(Structure): > > def __init__(self, protein): > > self = protein > > This is what I last tried and doesn't work.. I've tried Structure.__init__, > and several other things but to no avail. I'm sure this is simple OOP but I > really can't understand that well how to do it ... > > Care to give a hand to a friend in need? :) > > Thanks in advance! By the way, I assume that if I got no comments on > anything else on the GSOC thread that I'm doing a perfect job :P Thanks for > that too :D > > Best! > > Jo?o [...] Rodrigues > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > From biopython at maubp.freeserve.co.uk Mon Jun 14 11:23:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Jun 2010 16:23:25 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother wrote: > > Hi, > > much of what I do with RNA secondary structures strongly depends on > iterating base pairs, e.g.. > >>>> sec = Secstruc("(((...)).)") >>>> for bp in sec.basepairs(): >>>> ? ?print bp > (0, 9) > (1, 7) > (2, 6) > > also: >>>> sec.get_helices() >>>> sec.get_bulges() >>>> sec.get_hairpins() >>>> sec.contains_pseudoknot() > .. and a couple of similar ones. > > The reason why I'd prefer to have something more than a string as a sec > feature is that I wouldn't want to do all the time: > > sec = Secstruc(my_seq['secondary_structure']) > sec.get_helices() > > but > > my_seq['secondary_structure'].get_helices() > > instead. > > Best Regards, > ? Kristian That helped - thanks. Does your Secstruc object behave like a Python sequence (string/list/tuple) in that it has a length and can be sliced (as if acting on the string representation)? If so then it should be fine to store in the SeqRecord's letter_annotation dictionary. Peter From krother at rubor.de Mon Jun 14 11:41:05 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 17:41:05 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: <3e6714450418534d741476aa0b64b374-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1WWAhZWg==-webmailer2@server03.webmailer.hosteurope.de> Hi Peter, > That helped - thanks. Does your Secstruc object behave like a Python > sequence (string/list/tuple) in that it has a length and can be sliced Yes, it does. > If so then it should be fine to > store in the SeqRecord's letter_annotation dictionary. Best, Kristian > On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother wrote: >> >> Hi, >> >> much of what I do with RNA secondary structures strongly depends on >> iterating base pairs, e.g.. >> >>>>> sec = Secstruc("(((...)).)") >>>>> for bp in sec.basepairs(): >>>>> ? ?print bp >> (0, 9) >> (1, 7) >> (2, 6) >> >> also: >>>>> sec.get_helices() >>>>> sec.get_bulges() >>>>> sec.get_hairpins() >>>>> sec.contains_pseudoknot() >> .. and a couple of similar ones. >> >> The reason why I'd prefer to have something more than a string as a sec >> feature is that I wouldn't want to do all the time: >> >> sec = Secstruc(my_seq['secondary_structure']) >> sec.get_helices() >> >> but >> >> my_seq['secondary_structure'].get_helices() >> >> instead. >> >> Best Regards, >> ? Kristian > > That helped - thanks. Does your Secstruc object behave like a Python > sequence (string/list/tuple) in that it has a length and can be sliced (as > if acting on the string representation)? If so then it should be fine to > store in the SeqRecord's letter_annotation dictionary. > > Peter > > From anaryin at gmail.com Mon Jun 14 13:58:56 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 14 Jun 2010 12:58:56 -0500 Subject: [Biopython-dev] creating Protein(structure) object In-Reply-To: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: Hello Kristian, The way I'm doing it as a workaround is: class Protein(Structure): def __init__(self, protein): Structure.__init__(self, protein.id) self.full_id = protein.full_id self.child_list = protein.child_list self.child_dict = protein.child_dict self.parent = protein.parent self.xtra = protein.xtra It works because every method I'm using deepcopies this anyway.. The way of adding the childs seems the correct way to go but it won't copy headers... should we want this? Thanks :) J From eric.talevich at gmail.com Mon Jun 14 16:27:24 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 14 Jun 2010 16:27:24 -0400 Subject: [Biopython-dev] creating Protein(structure) object In-Reply-To: References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: Hi guys, Another convention with the Decorator pattern is to ensure that all of the method arguments that existed in the original class are also present in the decorated one. This includes the constructor. Decoration simply adds another feature to whatever was already there. Jo?o Rodrigues wrote: > Hello Kristian, > > The way I'm doing it as a workaround is: > > class Protein(Structure): > > def __init__(self, protein): > > Structure.__init__(self, protein.id) > > self.full_id = protein.full_id > self.child_list = protein.child_list > self.child_dict = protein.child_dict > self.parent = protein.parent > self.xtra = protein.xtra > The way the constructors of Structure and other Entity subclasses work is to create a new object with the appropriate, empty attributes -- i.e. no children. Other code then attaches children to the class. To decorate a Structure with Protein-specific functionality, I would consider: 1. The Entity constructor takes an ID, and creates empty containers for child Entities. (Models, in this case.) So Protein.__init__ needs to start like: class Protein(Structure): def __init__(self, id): # take any keyword arguments? Structure.__init__(self, id) # handle any keyword arguments here 2. We need to be able to convert an existing Structure to a new Protein. That's new functionality, so it needs either a keyword argument in __init__, or a separate method or function. If we add a keyword argument to __init__, then the implementation is basically two completely different operations depending on if a Structure was passed or not. Plus, there's still that 'id' argument to deal with. 3. Instantiating a Protein directly would mean importing the Bio.Struct.Protein module manually, in addition to "from Bio import Struct". More to the point, Bio.Struct.Protein consists of lower-level functionality that a casual Struct user shouldn't have to dig into, as long as Structure.as_protein() exists. So there's no value in making Protein.__init__ "do what I mean" at the expense of clarity in the code. Better to make the code very obvious and explicit here, and focus on API prettiness from a different angle. 4. The next most convenient place for Structure-to-Protein conversion is on the Structure class. This presents a nice API that will be sufficient for most users: from Bio import Struct prot = Struct.read('1ABC.pdb').as_protein() But, going back to OOP principles, the Structure class shouldn't need to know anything about the Protein class's internals -- though it's free to call any public method and make things nicer for the user. So, finally, we need a class method* on Protein that Structure.as_protein() can call. Hence, Protein.from_structure(). [*] A class method can be called without first instantiating the class. Since we're trying to construct a new object here, we need to be able to call this Protein method before the Protein object exists. No worries, just use the @classmethod decorator. > It works because every method I'm using deepcopies this anyway.. > If someone modifies the original Structure object after you've created a Protein this way -- e.g. renumbering residues, or with their own function -- it will also modify the Protein object, since lists and dicts are shared. Is this what you want? If you're concerned about memory usage, you can also look at implementing __deepcopy__. > The way of adding the childs seems the correct way to go but it won't copy > headers... should we want this? > You code for copying the Structure's children looks right to me, except I think it's best to be little paranoid with Python lists and make deep copies anyway. I suppose you could also copy any header info that's relevant to proteins, using the same approach. Best, Eric From anaryin at gmail.com Mon Jun 14 23:06:03 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 14 Jun 2010 22:06:03 -0500 Subject: [Biopython-dev] creating Protein(structure) object In-Reply-To: References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: Ok, thanks for the long explanation! I'll merge what you and Kristian said and come up with a better interface. As is, I call is like this: s = Struct.read("1abc.pdb") # by the way, I added a trick to avoid the mandatory name of the structure p = s.as_protein() Best J From jblanca at btc.upv.es Tue Jun 15 01:55:45 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 15 Jun 2010 07:55:45 +0200 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> Message-ID: <201006150755.45162.jblanca@btc.upv.es> On Monday 14 June 2010 16:44:50 Peter wrote: > Hi all, > > You may recall late last year I posted about adding a reverse > complement method to the SeqRecord, and addition support: > http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.htm >l http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html > > SeqRecord addition was included in Biopython 1.53, but > not the reverse_complement() method - which is something > I wanted to use again today to reverse complement an > annotated GenBank file and have all the SeqFeature > locations flipped for me. I've rescued my old code and > its unit tests and created a new branch for it: > http://github.com/peterjc/biopython/commits/seqrecord-rc > > As I said at the end of last year, I think the general idea of > a SeqRecord reverse_complement() method is nice but the > details about handling the annotation is tricky. When we > discussed slicing and addition, it was agreed that we > should be cautious to avoid blindly transferring annotation > inappropriately. The code on this branch allows the user to > choose for each annotation type if it should be dropped > (False), kept (True) or set to a supplied new value. The > docstring has examples of how this works (which double > as doctests). Having a reverse_complement method would be useful for us. But it could be quite tricky to reverse complement some features. For instance we have SNP features that include a reference nucleotide. We would had to complement that nucleotide too. Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 15 05:08:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Jun 2010 10:08:14 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <201006150755.45162.jblanca@btc.upv.es> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> <201006150755.45162.jblanca@btc.upv.es> Message-ID: On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca wrote: > > Having a reverse_complement method would be useful for us. But it could be > quite tricky to reverse complement some features. For instance we have SNP > features that include a reference nucleotide. We would had to complement that > nucleotide too. > Could you give an example? I assume you are talking about the annotation of the feature (i.e. the qualifiers dictionary of a SeqFeature object). Peter From jblanca at btc.upv.es Tue Jun 15 05:23:27 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 15 Jun 2010 11:23:27 +0200 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <201006150755.45162.jblanca@btc.upv.es> Message-ID: <201006151123.27158.jblanca@btc.upv.es> On Tuesday 15 June 2010 11:08:14 Peter wrote: > On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca wrote: > > Having a reverse_complement method would be useful for us. But it could > > be quite tricky to reverse complement some features. For instance we have > > SNP features that include a reference nucleotide. We would had to > > complement that nucleotide too. > > Could you give an example? I assume you are talking about the annotation > of the feature (i.e. the qualifiers dictionary of a SeqFeature object). That is right in some instances the qualifiers should be modified. For instance if we have an ORF with a qualifier 'forward':True, it should be changed. I don't think this change can be done automatically . -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 15 05:42:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Jun 2010 10:42:47 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <201006151123.27158.jblanca@btc.upv.es> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <201006150755.45162.jblanca@btc.upv.es> <201006151123.27158.jblanca@btc.upv.es> Message-ID: On Tue, Jun 15, 2010 at 10:23 AM, Jose Blanca wrote: > On Tuesday 15 June 2010 11:08:14 Peter wrote: >> On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca wrote: >> > Having a reverse_complement method would be useful for us. But it could >> > be quite tricky to reverse complement some features. For instance we have >> > SNP features that include a reference nucleotide. We would had to >> > complement that nucleotide too. >> >> Could you give an example? I assume you are talking about the annotation >> of the feature (i.e. the qualifiers dictionary of a SeqFeature object). > > That is right in some instances the qualifiers should be modified. For > instance if we have an ORF with a qualifier 'forward':True, it should be > changed. I don't think this change can be done automatically . Yes, that sort of thing would be very difficult to do automatically. We come back to the question of what the default should be - blindly copy, or just drop this information. I would say for most feature annotation (and I am thinking about GenBank and EMBL style files here) there isn't anything strand specific to worry about, so in general copying is fine. Clearly this is not a safe assumption for SNP features. Peter From krother at rubor.de Tue Jun 15 10:06:52 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 15 Jun 2010 16:06:52 +0200 Subject: [Biopython-dev] RNA Alphabet: request for comments Message-ID: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de> Hi, I've commited a proof-of-concept implementation how modified RNA bases could be made compatible to Biopython Alphabets. Comments are very welcome, especially because I had to change two lines in the Seq class to make it work. The code can be viewed on: http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa (on github: krother/biopython, branch rna_alphabet). The two main classes are: RNAAlphabetEntry(str) that contains different abbreviations for one base. and ModifiedRNAString(str) that behaves like a string except that it iterates through RNAAlphabetEntry objects. Thus, you can do: >>> from Bio.Alphabet.ModifiedRNAAlphabet import modified_rna >>> from Bio.Seq import Seq >>> from Bio.RNA.ModifiedRNAString import ModifiedRNAString >>> >>> mod_seq = ModifiedRNAString('AA:"A') >>> seq = Seq(mod_seq, modified_rna) >>> for char in seq: >>> print char adenosine adenosine 2-O-methyladenosine 1-methyladenosine adenosine (see Unit test for details). Best Regards, Kristian From biopython at maubp.freeserve.co.uk Tue Jun 15 10:46:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Jun 2010 15:46:10 +0100 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de> References: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother wrote: > > Hi, > > I've commited a proof-of-concept implementation how modified RNA bases > could be made compatible to Biopython Alphabets. Comments are very > welcome, especially because I had to change two lines in the Seq class to > make it work. > > The code can be viewed on: > http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa > (on github: krother/biopython, branch rna_alphabet). > > The two main classes are: > RNAAlphabetEntry(str) that contains different abbreviations for one base. > and > ModifiedRNAString(str) that behaves like a string except that it iterates > through RNAAlphabetEntry objects. > Why not create a Seq subclass instead of your class ModifiedRNAString(str)? This would then implement suitable (reverse) complement etc. I would also have __iter__ and __getitem__ for a single letter return an instance of RNAAlphabetEntry (which would act like a single character string). Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 15 12:23:00 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Jun 2010 12:23:00 -0400 Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord? In-Reply-To: Message-ID: <201006151623.o5FGN0K6028619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-15 12:22 EST ------- Patch applied to this branch: http://github.com/peterjc/biopython/tree/seqrecord-rc -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Wed Jun 16 04:32:29 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 16 Jun 2010 10:32:29 +0200 Subject: [Biopython-dev] RNA Alphabet: request for comments Message-ID: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Hi Peter, > Why not create a Seq subclass instead of your class ModifiedRNAString(str)? This turned out to be a lot simpler. Worked right away. New commit at: http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70 more comments welcome. Next steps from my side would be: 1) add all modifications to the Alphabet. 2) add some RNA-specific methods. 3) add more tests. 4) sync with latest master branch. 5) request code merge. Best regards, Kristian Quoting Peter : > On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother wrote: >> >> Hi, >> >> I've commited a proof-of-concept implementation how modified RNA bases >> could be made compatible to Biopython Alphabets. Comments are very >> welcome, especially because I had to change two lines in the Seq class to >> make it work. >> >> The code can be viewed on: >> http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa >> (on github: krother/biopython, branch rna_alphabet). >> >> The two main classes are: >> RNAAlphabetEntry(str) that contains different abbreviations for one base. >> and >> ModifiedRNAString(str) that behaves like a string except that it iterates >> through RNAAlphabetEntry objects. >> > > Why not create a Seq subclass instead of your class ModifiedRNAString(str)? > This would then implement suitable (reverse) complement etc. > > I would also have __iter__ and __getitem__ for a single letter return > an instance > of RNAAlphabetEntry (which would act like a single character string). > > Peter > > > > From biopython at maubp.freeserve.co.uk Wed Jun 16 04:51:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Jun 2010 09:51:03 +0100 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: On Wed, Jun 16, 2010 at 9:32 AM, Kristian Rother wrote: > > Hi Peter, > >> Why not create a Seq subclass instead of your class ModifiedRNAString(str)? > > This turned out to be a lot simpler. Worked right away. New commit at: > > http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70 > > more comments welcome. Why do you need the _set_sequence method? Why not just put that small piece of code inside the __init__ method? > Next steps from my side would be: > > 1) add all modifications to the Alphabet. > 2) add some RNA-specific methods. > 3) add more tests. > 4) sync with latest master branch. > 5) request code merge. > > Best regards, > ? ? Kristian If this works out we should look at doing a Protein 3-letter code version for use with PDB sequences (I'm thinking about the modified amino acids). Peter From krother at rubor.de Wed Jun 16 05:03:37 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 16 Jun 2010 11:03:37 +0200 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: Hi Peter, > Why do you need the _set_sequence method? Why not just put that > small piece of code inside the __init__ method? In _set_sequence there'll be a small parser taking care of modifications where the one-letter abbreviations do not suffice. E.g. a sequence could be "CCC022UCCC" (22U is a 5-hydroxyuridine). --> being parsed into a list of RNAAlphabetEntries ['C','C','C','22U','C','C','C'] So the code will grow a little, but the basic idea stays the same. If someone wants a one-letter representation, it could be "CCCxCCC", but this is degenerate because 'x' is used for several modifications. Best Regards, Kristian >>> Why not create a Seq subclass instead of your class >>> ModifiedRNAString(str)? >> >> This turned out to be a lot simpler. Worked right away. New commit at: >> >> http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70 >> >> more comments welcome. > > Why do you need the _set_sequence method? Why not just put that > small piece of code inside the __init__ method? > >> Next steps from my side would be: >> >> 1) add all modifications to the Alphabet. >> 2) add some RNA-specific methods. >> 3) add more tests. >> 4) sync with latest master branch. >> 5) request code merge. >> >> Best regards, >> ? ? Kristian > > If this works out we should look at doing a Protein 3-letter code version > for use with PDB sequences (I'm thinking about the modified amino acids). > > Peter > > From biopython at maubp.freeserve.co.uk Wed Jun 16 05:41:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Jun 2010 10:41:35 +0100 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: On Wed, Jun 16, 2010 at 10:03 AM, Kristian Rother wrote: > > Hi Peter, > >> Why do you need the ?_set_sequence method? Why not just put that >> small piece of code inside the __init__ method? > > In _set_sequence there'll be a small parser taking care of modifications > where the one-letter abbreviations do not suffice. E.g. a sequence could > be > > "CCC022UCCC" > > (22U is a 5-hydroxyuridine). > > --> being parsed into a list of RNAAlphabetEntries > ['C','C','C','22U','C','C','C'] > > So the code will grow a little, but the basic idea stays the same. > > If someone wants a one-letter representation, it could be "CCCxCCC", but > this is degenerate because 'x' is used for several modifications. > > Best Regards, > ? Kristian Thinking ahead, we are planning to make the Seq objects use string comparison instead of object identity. When that happens, I would suggest in your subclass you implement the the equality method so that if you are comparing against another instance of the modified RNA Seq compare at the more detailed "22U" level, and if not then for compatibility compare at the single letter level ("x" even though degenerate). Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 16 08:43:07 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 16 Jun 2010 08:43:07 -0400 Subject: [Biopython-dev] [Bug 3100] New: Bio.PDB.ResidueDepth distance calculation error Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3100 Summary: Bio.PDB.ResidueDepth distance calculation error Product: Biopython Version: 1.54b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: andres.colubri at gmail.com ResidueDepth.py in Bio.PDB contains an error at line 100: d2=sum(d*d, 1) This uses the built-in sum() function, which just sums all the elements of d*d, starting at 1. But it should use numpy's sum instead: d2=numpy.sum(d*d, 1) To check the error, try the following code: from Bio.PDB import from Bio.PDB.ResidueDepth import parser = PDBParser() str = parser.get_structure('test', '3M38.pdb') surf = get_surface('3M38.pdb', PDB_TO_XYZR='./pdb_to_xyzr', MSMS='./msms') print min_dist(surf[10], surf) 3M38.pdb could be replaced by any other pdb file. The result of this calculation printed to the console should be zero, since we are calculating the minimum distance to the surface of a point belonging to the surface. But this gives a value greater than zero. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From lueck at ipk-gatersleben.de Wed Jun 16 09:18:00 2010 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 16 Jun 2010 15:18:00 +0200 Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris In-Reply-To: References: Message-ID: <001a01cb0d56$581dd610$1022a8c0@ipkgatersleben.de> Hello! Sorry for the late reply but I just came back from my holidays. I have been to EuroSciPy 2009 and it's was really great (I also gave a talk where biopython was several times mentioned ;-). Since it's was problematic to go last time, I decided to skip it this year (principally I have to come private). Unfortunately I hear now that the biopython people will be there and I would be very interested to meet you, since I'm using biopython a lot. I have to see what I still can do. Would be great to see us! Stefanie -----Urspr?ngliche Nachricht----- Von: biopython-dev-bounces at lists.open-bio.org [mailto:biopython-dev-bounces at lists.open-bio.org] Im Auftrag von Peter Gesendet: Samstag, 5. Juni 2010 16:50 An: Biopython-Dev Mailing List Betreff: [Biopython-dev] EuroSciPy 2010 conference in Paris Hi all, Are any Biopython folk planning to be at the EuroSciPy conference in Paris this year (July 2010)? They are still finalising the Scientific track, but the list of tutorials is quite interesting already: http://www.euroscipy.org/conference/euroscipy2010 Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Fri Jun 18 09:19:02 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 09:19:02 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastaq In-Reply-To: Message-ID: <201006181319.o5IDJ2Oj022977@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 cjfields at bioperl.org changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bioperl-guts-l at bioperl.org |biopython-dev at biopython.org ------- Comment #3 from cjfields at bioperl.org 2010-06-18 09:18 EST ------- (In reply to comment #2) > (In reply to comment #1) > > I'm making a wild guess that this is Biopython and not BioPerl. > > Yes, it's Biopython, Can you halp me, please? or can you give me a link where > to find the answer for my problem? Thank you very much. Reassigning to the Biopython devs. This should go to their list now, hopefully you'll get a response. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 09:45:37 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 09:45:37 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181345.o5IDjbNB023730@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Error converting sff into |Error converting sff into |fastaq |fastq ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-18 09:45 EST ------- Thanks Chris. Giorgio - Could you confirm which version of Biopython are you using? To me the error message suggests the SFF file is corrupted (damaged). Is it very large? Could you attach it to this bug (or email it to me personally) to check? Have you been able to process the SFF file with any other tools (e.g. sff_extract which should work on Windows/Linux/Mac, or the Roche tools which are Linux only)? If you copied the SFF file over your network, or over the internet from your sequencing center, perhaps there was an error there. Could you try re-downloading the SFF file? Regards, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 11:03:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 11:03:45 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181503.o5IF3j23025689@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #5 from gcasaburi at tiscali.it 2010-06-18 11:03 EST ------- (In reply to comment #4) > Thanks Chris. > Giorgio - Could you confirm which version of Biopython are you using? > To me the error message suggests the SFF file is corrupted (damaged). Is it > very large? Could you attach it to this bug (or email it to me personally) to > check? > Have you been able to process the SFF file with any other tools (e.g. > sff_extract which should work on Windows/Linux/Mac, or the Roche tools which > are Linux only)? > If you copied the SFF file over your network, or over the internet from your > sequencing center, perhaps there was an error there. Could you try > re-downloading the SFF file? > Regards, > Peter Thank u for the answer. I have the last version of Biopython, The file is 1,12 giga, so i think is difficult to attach the file. The file has been taken directly from the usb port of the 454 with a pendrive and now is in a normal PC. With Biopthon i'v been able to read and open this sff file, but at the end of the reading appers the message (Value error:...). So when i try to convert the file in fasta the same message apper to be, bloking any work. So why the file is open reading, with all information (flow, lewnght) but impossible to edit, convert??? Thank u hope u can help us. Grater from ITALY -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 11:28:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 11:28:01 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181528.o5IFS1iY026418@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-18 11:28 EST ------- (In reply to comment #5) > Thank u for the answer. I have the last version of Biopython, Good. > The file is 1,12 giga, so i think is difficult to attach the file. Yes, too big to attach or email :( > The file has been taken directly from the usb port of the 454 with a > pendrive and now is in a normal PC. I would try copying it again using a different USB memory stick / pen drive. > With Biopthon i'v been able to read and open this sff file, but at the end > of the reading appers the message (Value error:...). So when i try to convert > the file in fasta the same message apper to be, bloking any work. So why the > file is open reading, with all information (flow, lewnght) but impossible to > edit, convert??? Thank u hope u can help us. > Grater from ITALY It sounds like there is an error is near the end of the file. You can open the file and read lots of reads up until the error. If you use Bio.SeqIO.parse() or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps the file is truncated (only partly copied)? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 13:35:00 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 13:35:00 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181735.o5IHZ0SW030183@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #7 from gcasaburi at tiscali.it 2010-06-18 13:35 EST ------- (In reply to comment #6) > (In reply to comment #5) > > Thank u for the answer. I have the last version of Biopython, > > Good. > > > The file is 1,12 giga, so i think is difficult to attach the file. > > Yes, too big to attach or email :( > > > The file has been taken directly from the usb port of the 454 with a > > pendrive and now is in a normal PC. > > I would try copying it again using a different USB memory stick / pen drive. > > > With Biopthon i'v been able to read and open this sff file, but at the end > > of the reading appers the message (Value error:...). So when i try to convert > > the file in fasta the same message apper to be, bloking any work. So why the > > file is open reading, with all information (flow, lewnght) but impossible to > > edit, convert??? Thank u hope u can help us. > > Grater from ITALY > > It sounds like there is an error is near the end of the file. You can open the > file and read lots of reads up until the error. If you use Bio.SeqIO.parse() > or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps > the file is truncated (only partly copied)? > > Peter > I will try to recopy the file on another pendrive. I thought like you, may be the file has a corruption at the end. I don't think is truncated, in fact is a .sff that represents one region of the "ptp", but the same error appers with another file .sff2 that represents the second region of the "ptp" (diveded in two regions for the same "run", totally 2 regions, each for one sample, two samples in total). So i don't know if there is a syntax command to modify the error value. Thank you Giorgio -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 22 09:11:15 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Jun 2010 09:11:15 -0400 Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord? In-Reply-To: Message-ID: <201006221311.o5MDBF8o003119@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-22 09:11 EST ------- (In reply to comment #0) > My motivating example is to take an ACE file loaded with SeqIO, remove the > gaps, and output the contigs as FASTQ or QUAL files. This requires the > per-letter-annotation to be sliced to match the ungapped sequence. > > Likewise any features fully contained within ungapped regions should be > retained and their co-ordinates shifted. I'm not sure if we should do anything > about features spanning a gap - the simple option which I have implemented is > they are lost. This is done via the existing SeqRecord slicing and addition > code. I've been trying building SeqFeature objects for the reads in an ACE file, http://github.com/peterjc/biopython/tree/ace-reads In this case when I call the SeqRecord ungap method, many of my read features are lost with the current implementation (because they included gaps). This also showed the ungap code to be quite slow for features. I'm going to have another look at this. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 22 10:58:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Jun 2010 10:58:39 -0400 Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord? In-Reply-To: Message-ID: <201006221458.o5MEwd0I005797@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1482 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-22 10:58 EST ------- (From update of attachment 1482) (In reply to comment #3) > > I've been trying building SeqFeature objects for the reads in an ACE file, > http://github.com/peterjc/biopython/tree/ace-reads > > In this case when I call the SeqRecord ungap method, many of my read features > are lost with the current implementation (because they included gaps). This > also showed the ungap code to be quite slow for features. I'm going to have > another look at this. My new code handles SeqFeature ungapping so as to preserve all the features by adjusting their end points. This is also much faster: http://github.com/peterjc/biopython/tree/ungap2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Tue Jun 22 15:25:17 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 22 Jun 2010 14:25:17 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file Message-ID: Hello all, I've been using some non-standard pdb files outputted by some programs and they miss the chemical element column in each ATOM line. I was looking at the PDBParser code and element is dealt with like this: if element is None: import warnings from PDBExceptions import PDBConstructionWarning warnings.warn("Atom object (name=%s) without element" % name, PDBConstructionWarning) element = "?" print name, "--> ?" elif len(element)>2 or element != element.upper() or element != element.strip(): raise ValueError(element) self.element=element In my case, the element line is not "None" but just an empty string - ' ' - which fails these tests and is then passed on. This would be no problem at all, but I've added a "mass" attribute to the Atom object defined like this: self.mass = IUPACData.atom_weigths[element] I've added the ? to the atom_weights list as I thought it would deal with the empty element cases. I'd suggest adding to the first if statement a test to check if the element string is empty and if so, treat it as None. if element is None or element is '': import warnings from PDBExceptions import PDBConstructionWarning warnings.warn("Atom object (name=%s) without element" % name, PDBConstructionWarning) element = "?" print name, "--> ?" elif len(element)>2 or element != element.upper() or element != element.strip(): raise ValueError(element) self.element=element What do you think? Best! Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org From biopython at maubp.freeserve.co.uk Wed Jun 23 05:11:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Jun 2010 10:11:06 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues wrote: > Hello all, > > I've been using some non-standard pdb files outputted by some programs and > they miss the chemical element column in each ATOM line. I was looking at > the PDBParser code and element is dealt with like this: > > ? ? ? ?if element is None: > ? ? ? ? ? ?import warnings > ? ? ? ? ? ?from PDBExceptions import PDBConstructionWarning > ? ? ? ? ? ?warnings.warn("Atom object (name=%s) without element" % name, > ? ? ? ? ? ? ? ? ? ? ? ? ?PDBConstructionWarning) > ? ? ? ? ? ?element = "?" > ? ? ? ? ? ?print name, "--> ?" > ? ? ? ?elif len(element)>2 or element != element.upper() or element != > element.strip(): > ? ? ? ? ? ?raise ValueError(element) > ? ? ? ?self.element=element > > > In my case, the element line is not "None" but just an empty string - ' ' - > which fails these tests and is then passed on. That makes sense, since element=line[76:78].strip() will give an empty string. A change as you suggest makes sense, but I think just using "if element:" would be nicer. Peter From biopython at maubp.freeserve.co.uk Wed Jun 23 06:28:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Jun 2010 11:28:22 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS Message-ID: Hi all, >From some unit test output posted by Manabu Ishii via Twitter I think the test suite is having problems checking for external tools on non-English operating systems (e.g. Debian in Japanese): http://d.hatena.ne.jp/manabou/20100619 http://twitter.com/manabou I've tried to update a few to do a better job (test_Muscle_tool.py, test_Clustalw_tool.py and test_Emboss.py), but what I really need is someone to run the test suite on a non English system - ideally without all these command line tools installed. The tests should notice when the tool is missing, and be skipped without errors. Could anyone with a non-English OS try running the latest code from git (or even the latest release) to see if you get similar problems? Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 23 09:21:25 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Jun 2010 09:21:25 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006231321.o5NDLPm0017094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-23 09:21 EST ------- Hi Giorgio, Did coping the file again help? In addition to trying to read the SFF files with other tools (like sff_extract or the Roche ssfinfo) as suggested, I have some additional things you could try. Firstly try this private function to see how many reads there should be: filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff" from Bio import SeqIO print SeqIO.SffIO._sff_file_header(open(filename, "rb"))[3] Then compare this to the number of reads you could extract up until the error. Secondly, see if the index can be loaded or not: filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff" from Bio import SeqIO d = SeqIO.index(filename, "sff") print len(d) If it is just one or two bad reads, this may allow you to jump to specific records (and so avoid getting stuck on the bad ones). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Jun 23 12:52:47 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 23 Jun 2010 11:52:47 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: Ok, I've changed it in my local branch to if not element since that covers both None and empty strings. Best, Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org On Wed, Jun 23, 2010 at 4:11 AM, Peter wrote: > On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues wrote: > > Hello all, > > > > I've been using some non-standard pdb files outputted by some programs > and > > they miss the chemical element column in each ATOM line. I was looking at > > the PDBParser code and element is dealt with like this: > > > > if element is None: > > import warnings > > from PDBExceptions import PDBConstructionWarning > > warnings.warn("Atom object (name=%s) without element" % name, > > PDBConstructionWarning) > > element = "?" > > print name, "--> ?" > > elif len(element)>2 or element != element.upper() or element != > > element.strip(): > > raise ValueError(element) > > self.element=element > > > > > > In my case, the element line is not "None" but just an empty string - ' ' > - > > which fails these tests and is then passed on. > > That makes sense, since element=line[76:78].strip() will give an empty > string. A change as you suggest makes sense, but I think just using > "if element:" would be nicer. > > Peter > From biopython at maubp.freeserve.co.uk Thu Jun 24 04:26:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 09:26:50 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Wed, Jun 23, 2010 at 5:52 PM, Jo?o Rodrigues wrote: > Ok, I've changed it in my local branch to if not element since that covers > both None and empty strings. > > Best, > > Jo?o [...] Rodrigues > @ http://doeidoei.wordpress.org I've you've done that little change as a single commit, then I can use git cherry-pick to apply it to the master branch. But first you need to push this work to github.com Peter From biopython at maubp.freeserve.co.uk Thu Jun 24 04:32:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 09:32:46 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues wrote: > Hello all, > > I've been using some non-standard pdb files outputted by some programs and > they miss the chemical element column in each ATOM line. ... This would be no > problem at all, but I've added a "mass" attribute to the Atom object defined like this: > > ? ? ? ?self.mass = IUPACData.atom_weigths[element] > > I've added the ? to the atom_weights list as I thought it would deal with > the empty element cases. I wonder if using None or NAN would be better than zero here? Or just an exception. This is difficult for me to say without a better idea of what you will be using the atomic weights for. On a separate point, if you have an old fashioned PDB file without the element column, you can probably work out the element anyway. For example CA in a normal amino acids residue means the alpha carbon, so the element is carbon (although in a HETATM there is a possibility it is Calcium I think). So I think it would be possible to infer the element in many cases (but not all). However, this is going to be a reasonable amount of work to write and test. How common are this kind of PDB file for the work you are doing - do many modelling packages omit the element? Have you contacted the program authors to request they include the element column in future? Peter From anaryin at gmail.com Thu Jun 24 12:36:36 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Jun 2010 11:36:36 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: > > I wonder if using None or NAN would be better than zero here? Or just an > exception. This is difficult for me to say without a better idea of what > you > will be using the atomic weights for. > Right now I'm just using them for the center of mass calculation. > > On a separate point, if you have an old fashioned PDB file without the > element > column, you can probably work out the element anyway. For example CA in > a normal amino acids residue means the alpha carbon, so the element is > carbon (although in a HETATM there is a possibility it is Calcium I think). > So I think it would be possible to infer the element in many cases (but not > all). However, this is going to be a reasonable amount of work to write and > test. >From non HETATMs its possible from the first letter of the atom name (or it is H if the first letter is a digit). For HETATMs, names match elements IIRC. Do you think it's worth the try? It shouldn't be hard to write and the cases where it would fail would be sporadic. > How common are this kind of PDB file for the work you are doing - do > many modelling packages omit the element? > Have you contacted the program authors to request they include the > element column in future? > Well... several packages make this, specially webservers.. Contacting them authors wouldn't bring those many favourable answers IMO. I've commited it here: http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc I commented out the self.mass part since we're still working on it. Best, J From biopython at maubp.freeserve.co.uk Thu Jun 24 12:54:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 17:54:41 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues wrote: >> >> I wonder if using None or NAN would be better than zero here? Or just an >> exception. This is difficult for me to say without a better idea of what >> you will be using the atomic weights for. >> > > Right now I'm just using them for the center of mass calculation. > Well if you don't know an atom's mass, you can't calculate the real center of mass. Maybe this should throw an exception? >> On a separate point, if you have an old fashioned PDB file without the >> element column, you can probably work out the element anyway. ... > > From non HETATMs its possible from the first letter of the atom name (or it > is H if the first letter is a digit). For HETATMs, names match elements > IIRC. > > Do you think it's worth the try? It shouldn't be hard to write and the cases > where it would fail would be sporadic. Eric - what do you think? >> How common are this kind of PDB file for the work you are doing - do >> many modelling packages omit the element? > > >> Have you contacted the program authors to request they include the >> element column in future? >> > > Well... several packages make this, specially webservers.. Contacting them > authors wouldn't bring those many favourable answers IMO. I'd ask politely anyway ;) > I've commited it here: > http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc > > I commented out the self.mass part since we're still working on it. I've cherry-picked that for the trunk - could you test the master branch please (just to make sure this worked as you expected)? Thanks, Peter From eric.talevich at gmail.com Thu Jun 24 14:05:11 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 24 Jun 2010 14:05:11 -0400 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Thu, Jun 24, 2010 at 12:54 PM, Peter wrote: > On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues wrote: > >> > >> I wonder if using None or NAN would be better than zero here? Or just an > >> exception. This is difficult for me to say without a better idea of what > >> you will be using the atomic weights for. > >> > > > > Right now I'm just using them for the center of mass calculation. > > > > Well if you don't know an atom's mass, you can't calculate the real > center of mass. Maybe this should throw an exception? > And the center of mass calculation was for coarse-graining structures, right? What would be most useful there? (a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them (b) Give unknown atoms a weight of None, and have CoM check for this and disregard those atoms (similar effect) -- preferably issuing a warning (c) Like (b), but CoM raises an exception (d) Give CoM a keyword argument for how to treat this (e.g. strict=True/False), so course-graining can be permissive but direct use of CoM can raise an exception if desired. (However, if warnings are used then the warnings module already lets you convert specific warnings into exceptions.) >> On a separate point, if you have an old fashioned PDB file without the > >> element column, you can probably work out the element anyway. ... > > > > From non HETATMs its possible from the first letter of the atom name (or > it > > is H if the first letter is a digit). For HETATMs, names match elements > > IIRC. > > > > Do you think it's worth the try? It shouldn't be hard to write and the > cases > > where it would fail would be sporadic. > > Eric - what do you think? > Sounds useful to me. Where would it fail, and how should failures be treated? Unrecognized atom names, and then issue a warning and leave the element attribute blank? (See options above...) Cheers, Eric From anaryin at gmail.com Thu Jun 24 14:25:45 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Jun 2010 13:25:45 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: > > And the center of mass calculation was for coarse-graining structures, > right? What would be most useful there? > > (a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them > CoM counts with the number of atoms so 0.0 will not work anyways actually. > (b) Give unknown atoms a weight of None, and have CoM check for this and > disregard those atoms (similar effect) -- preferably issuing a warning > I'd prefer this. Exclude atoms from the calculation. But then this might have an impact in the location of the mass.. > (c) Like (b), but CoM raises an exception > (d) Give CoM a keyword argument for how to treat this (e.g. > strict=True/False), so course-graining can be permissive but direct use of > CoM can raise an exception if desired. (However, if warnings are used then > the warnings module already lets you convert specific warnings into > exceptions.) > My suggestion. CoM can be either geometrical or gravitical. The first assumes equal mass for everyone, the second does not. If there's a mass that doesn't exist, the CoM would default to geometrical and issue a warning. Having a flag in CoM can also be valuable but I guess this would be redundant with the warning/exception (permissive/strict) in the Atom class. > > > >> On a separate point, if you have an old fashioned PDB file without the >> >> element column, you can probably work out the element anyway. ... >> > >> > From non HETATMs its possible from the first letter of the atom name (or >> it >> > is H if the first letter is a digit). For HETATMs, names match elements >> > IIRC. >> > >> > Do you think it's worth the try? It shouldn't be hard to write and the >> cases >> > where it would fail would be sporadic. >> >> Eric - what do you think? >> > > Sounds useful to me. Where would it fail, and how should failures be > treated? Unrecognized atom names, and then issue a warning and leave the > element attribute blank? (See options above...) > I'd implement it in the Atom class. Instead of having this check (lines 75-76): elif len(element)>2 or element != element.upper() or element != element.strip(): raise ValueError(element) there would be a check against IUPACData.atom_weight.keys(). If the element is not found, then it would try to check the atom name and issue a warning. If this fails, exception thrown. Sounds good? Best! J From anaryin at gmail.com Thu Jun 24 16:25:23 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Jun 2010 15:25:23 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: Ok, I was looking at the element attribution and there's a slight problem. I thought I could easily fetch if the atom is from an ATOM or HETATM, but since the "parenting" of the Atom is only done *after* the Atom is created, there is no way (as is) of knowing where it comes from. Therefore, I thought of the following work around. *hetero_flag* is already defined when the Atom is created. It could be passed to the Atom as another of its arguments. It would then be a conditional like this inside the Atom class: if not element or element not in IUPACData: if hetatm: if atom.name in IUPACData: element = atom.name else: element = ? else: # Not HETATM t_element = atom.name[0] if not atom.name[0].isdigit() else atom.name[1] if t_element in IUPACData: element = t_element else: element = ? else: # Has element and it is in IUPACData element = element The advantage is that either if you don't give an element or if it fails the IUPACData check, it will try to recover it from the atom name. It also makes it possible to thrown an exception when the element is not found. Or a warning since for now, only the CoM function uses it and it has a failsafe against it (defaults to geometrical). Opinions? Jo?o From bugzilla-daemon at portal.open-bio.org Fri Jun 25 07:49:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Jun 2010 07:49:35 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201006251149.o5PBnZpA007121@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1327 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 25 07:51:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Jun 2010 07:51:16 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201006251151.o5PBpGE9007286@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1329 is|0 |1 obsolete| | ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-25 07:51 EST ------- (From update of attachment 1329) I've got a branch using regular expressions which seems to cover all the location strings I've found in testing. It is at least twice the speed of the old parser. http://github.com/peterjc/biopython/tree/location-parsing2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Jun 25 11:21:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Jun 2010 16:21:46 +0100 Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing Message-ID: Hi all, I've been working on and off recently on rewriting the location parsing for GenBank/EMBL features: http://bugzilla.open-bio.org/show_bug.cgi?id=2738 I have a branch ready for public testing, http://github.com/peterjc/biopython/commits/location-parsing2 The old code is still there (and indeed right now gets used as a fall back with a warning if an unrecognised location is seen). I'd like to label it (plus Bio.Parsers and Bio.Parsers.spark) as obsolete for the next release, and then deprecate them the subsequence release. The old code takes each location string, parses it with SPARK and generates a set of token objects for each element (see the code in Bio.GenBank.LocationParser) and then turns that into SeqFeature location and position objects. All this object creation is probably a major reason why the old code is slow. The new code takes each location string, and parses it with a mix of regular expressions and simple Python code, and then builds the SeqFeature location and position objects. On my tests this is at least twice as fast, typically between three and four times faster. The intention is this parser change will result in no functional changes at all. As part of this work I have been extending the feature unit tests, and have also run some more extensive additional tests locally (GenBank files for plants, viruses, environmental samples etc). I'm reasonably sure this covers all the location variants... but with GenBank and EMBL files you can never be sure ;) Would anyone like to volunteer to test the new branch before I merge it to the trunk? I'm also interested in comments on the code itself. Note I have tried to avoid any refactoring until the old code is actually deprecated. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 25 13:46:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Jun 2010 13:46:14 -0400 Subject: [Biopython-dev] [Bug 3103] New: Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3103 Summary: Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML Product: Biopython Version: 1.54 Platform: Other OS/Version: Linux Status: NEW Severity: minor Priority: P5 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: vimalkumarvelayudhan at gmail.com I created an RPM recently for Biopython version 1.54 and got this error from rpmlint python-biopython.i586:???W:???unable-to-read-zip???/usr/share/python-biopython/Tests/PhyloXML/ncbi_taxonomy_mollusca.xml.zip:???Bad???magic???number???for???central???directory This appears for both the .tar.gz and the .zip version. I could do a manual unzip of the file though. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 27 11:31:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 27 Jun 2010 11:31:11 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006271531.o5RFVBTP001043@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #1 from eric.talevich at gmail.com 2010-06-27 11:31 EST ------- Interesting. Where did you get this release of Biopython 1.54? From PyPI, or GitHub? I downloaded this file from phyloxml.org originally, and haven't changed it. This file is used in the unit tests, and Python's zipfile library doesn't seem to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies it as: "Zip archive data, at least v2.0 to extract" It's actually not a very important part of the unit tests anyway, so if it's causing you trouble, I could give you a patch to remove this file from the unit tests. (If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd like to include a fix for, too. It's fixed in Biopython's trunk already, but slipped past our release process for v.1.54.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 27 12:45:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 27 Jun 2010 12:45:28 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006271645.o5RGjSBd019564@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #2 from vimalkumarvelayudhan at gmail.com 2010-06-27 12:45 EST ------- The archives were downloaded from http://biopython.org/DIST/biopython-1.54.tar.gz http://biopython.org/DIST/biopython-1.54.zip I could remove the zip file during the build process and can also patch the Phylo.Nexus for the next release if you could forward it to me. (In reply to comment #1) > Interesting. Where did you get this release of Biopython 1.54? From PyPI, or > GitHub? > > I downloaded this file from phyloxml.org originally, and haven't changed it. > This file is used in the unit tests, and Python's zipfile library doesn't seem > to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies > it as: > "Zip archive data, at least v2.0 to extract" > > It's actually not a very important part of the unit tests anyway, so if it's > causing you trouble, I could give you a patch to remove this file from the unit > tests. > > (If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd > like to include a fix for, too. It's fixed in Biopython's trunk already, but > slipped past our release process for v.1.54.) > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sun Jun 27 18:21:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 27 Jun 2010 23:21:43 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 23, 2010 at 11:28 AM, Peter wrote: > Hi all, > > From some unit test output posted by Manabu Ishii via Twitter I > think the test suite is having problems checking for external tools > on non-English operating systems (e.g. Debian in Japanese): > http://d.hatena.ne.jp/manabou/20100619 > http://twitter.com/manabou > > I've tried to update a few to do a better job (test_Muscle_tool.py, > test_Clustalw_tool.py and test_Emboss.py), but what I really need > is someone to run the test suite on a non English system - ideally > without all these command line tools installed. The tests should > notice when the tool is missing, and be skipped without errors. > > Could anyone with a non-English OS try running the latest code > from git (or even the latest release) to see if you get similar > problems? I've also included an idea from Manabu Ishii to set environment variable LANG=C to get the default of USA English. This should work on Linux etc, and is probably harmless on Windows. Again, testing would be most welcome (any non-English OS), Thanks Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 28 08:23:25 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Jun 2010 08:23:25 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006281223.o5SCNPog015539@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #3 from eric.talevich at gmail.com 2010-06-28 08:23 EST ------- Created an attachment (id=1517) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1517&action=view) Patch to remove ncbi_xml_mollusca.xml.zip from the Phylo unit test This patch should fix the problem reported in Bug 3103. Created with git format-patch. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 28 08:25:20 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Jun 2010 08:25:20 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006281225.o5SCPKo9015639@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #4 from eric.talevich at gmail.com 2010-06-28 08:25 EST ------- Created an attachment (id=1518) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1518&action=view) Patch to fix a bug in NexusIO This patch fixes another bug in NexusIO, parsing the support values on branches. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Mon Jun 28 13:55:30 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Tue, 29 Jun 2010 00:55:30 +0700 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: Peter, I have built and run the latest code from git on Russian Ubuntu 10.4. Entrez tests have failed. Muscle, clustal and emboss tests have been skipped successfully. The tests have been executed from build.py script and I am not sure how to generate test report. Redirecting the script output to file didn't help. On Mon, Jun 28, 2010 at 5:21 AM, Peter wrote: > On Wed, Jun 23, 2010 at 11:28 AM, Peter > wrote: > > Hi all, > > > > From some unit test output posted by Manabu Ishii via Twitter I > > think the test suite is having problems checking for external tools > > on non-English operating systems (e.g. Debian in Japanese): > > http://d.hatena.ne.jp/manabou/20100619 > > http://twitter.com/manabou > > > > I've tried to update a few to do a better job (test_Muscle_tool.py, > > test_Clustalw_tool.py and test_Emboss.py), but what I really need > > is someone to run the test suite on a non English system - ideally > > without all these command line tools installed. The tests should > > notice when the tool is missing, and be skipped without errors. > > > > Could anyone with a non-English OS try running the latest code > > from git (or even the latest release) to see if you get similar > > problems? > > I've also included an idea from Manabu Ishii to set environment > variable LANG=C to get the default of USA English. This should > work on Linux etc, and is probably harmless on Windows. > > Again, testing would be most welcome (any non-English OS), > > Thanks > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Best regards, Konstantin From biopython at maubp.freeserve.co.uk Tue Jun 29 05:57:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Jun 2010 10:57:27 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov wrote: > Peter, > I have built and run the latest code from git on Russian Ubuntu 10.4. Thank you, > Entrez tests have failed. That can happen due to network problems. I'd like to see the error though. > Muscle, clustal and emboss tests have been skipped successfully. Good :) > The tests have been executed from build.py script and I am not sure how to > generate test report. Redirecting the script output to file didn't help. I normally just run "python setup.py test" from the source directory or "python run_tests.py" from the Tests subdirectory at the terminal, and copy and paste the interesting bits of the output. If you want to capture the test output to a file, you should probably redirect both stdout and stderr: python run_tests.py &> output.txt Regards, Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 29 15:08:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 29 Jun 2010 15:08:45 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006291908.o5TJ8j66032031@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 vimalkumarvelayudhan at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from vimalkumarvelayudhan at gmail.com 2010-06-29 15:08 EST ------- Thank you. RPMs packaged with patches applied and can be found at http://download.opensuse.org/repositories/science:/vlinux/ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Tue Jun 29 23:27:20 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Wed, 30 Jun 2010 10:27:20 +0700 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: Peter, actually the problems with Entrez tools are Unicode related. I suppose, that the test failures are related with the current working dir path: it contains a non-English word in it, thus it can not be represented as an ascii string. Also there are similar problems with Genbank to Sql tests. Please, see the error-log attached. On Tue, Jun 29, 2010 at 4:57 PM, Peter wrote: > On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov > wrote: > > Peter, > > I have built and run the latest code from git on Russian Ubuntu 10.4. > > Thank you, > > > Entrez tests have failed. > > That can happen due to network problems. I'd like to see the error though. > > > Muscle, clustal and emboss tests have been skipped successfully. > > Good :) > > > The tests have been executed from build.py script and I am not sure how > to > > generate test report. Redirecting the script output to file didn't help. > > I normally just run "python setup.py test" from the source directory or > "python run_tests.py" from the Tests subdirectory at the terminal, and > copy and paste the interesting bits of the output. > > If you want to capture the test output to a file, you should probably > redirect > both stdout and stderr: > > python run_tests.py &> output.txt > > Regards, > > Peter > -- Best regards, Konstantin -------------- next part -------------- running test test_Ace ... ok test_AlignIO ... ok test_AlignIO_convert ... ok test_BioSQL ... FAIL test_BioSQL_SeqIO ... /home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: order location operators are not fully supported % feature.location_operator) /home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: bond location operators are not fully supported % feature.location_operator) ok test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw. test_Cluster ... ok test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL. test_Emboss ... skipping. Install EMBOSS if you want to use Bio.Emboss. test_EmbossPhylipNew ... skipping. Install the Emboss package 'PhylipNew' if you want to use the Bio.Emboss.Applications wrappers for phylogenetic tools. test_EmbossPrimer ... ok test_Entrez ... FAIL test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF. test_GenBank ... ok test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsBitmaps ... skipping. Install ReportLab if you want to use Bio.Graphics. test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics. test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... ok test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... ok test_MEME ... ok test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper. test_MarkovModel ... ok test_Medline ... ok test_Motif ... ok test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper. test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_BLAST_tools ... skipping. Install the NCBI BLAST+ command line tools if you want to use the Bio.Blast.Applications wrapper. test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... ok test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_Phylo ... ok test_PhyloXML ... ok test_Phylo_depend ... skipping. Install NetworkX if you want to use Bio.Phylo._utils. test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop. test_PopGen_GenePop_EasyController ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop. test_PopGen_GenePop_nodepend ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper. test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... ok test_SeqIO ... ok test_SeqIO_FastaIO ... ok test_SeqIO_QualityIO ... ok test_SeqIO_convert ... ok test_SeqIO_features ... ok test_SeqIO_index ... ok test_SeqIO_online ... ok test_SeqRecord ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper. test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... ok test_lowess ... ok test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite1 ... ok test_prosite2 ... ok test_prosite_patterns ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Application docstring test ... ok Bio.Seq docstring test ... ok Bio.SeqFeature docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqIO.PhdIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.SffIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Blast.Applications docstring test ... ok Bio.Clustalw docstring test ... ok Bio.Emboss.Applications docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... FAIL Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok Bio.Statistics.lowess docstring test ... ok ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NC_000932. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 423, in test_NC_000932 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NC_005816. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 419, in test_NC_005816 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NT_019265. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 427, in test_NT_019265 self.loop(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, arab1. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 447, in test_arab1 self.loop(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, cor6_6. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 443, in test_cor6_6 self.loop(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, noref. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 435, in test_no_ref self.loop(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, one_of. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 439, in test_one_of self.loop(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, protein_refseq2. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 431, in test_protein_refseq2 self.loop(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, NC_000932. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 496, in test_NC_000932 self.trans(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, NC_005816. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 492, in test_NC_005816 self.trans(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, NT_019265. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 500, in test_NT_019265 self.trans(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, arab1. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 520, in test_arab1 self.trans(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, cor6_6. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 516, in test_cor6_6 self.trans(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, noref. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 508, in test_no_ref self.trans(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, one_of. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 512, in test_one_of self.trans(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, protein_refseq2. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 504, in test_protein_refseq2 self.trans(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: Test parsing XML returned by EFetch, Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3451, in test_journals record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, Nucleotide database (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3893, in test_nucleotide1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, Protein database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 4045, in test_nucleotide2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, OMIM database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3607, in test_omim record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, PubMed database (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3034, in test_pubmed1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, PubMed database (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3237, in test_pubmed2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, Taxonomy database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3784, in test_taxonomy record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML output returned by EGQuery (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2706, in test_egquery1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML output returned by EGQuery (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2858, in test_egquery2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing database list returned by EInfo ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 26, in test_list record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing database info returned by EInfo ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 72, in test_pubmed record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing cancerchromosomes links returned by ELink ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2690, in test_cancerchromosomes record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing medline indexed articles returned by ELink ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 1965, in test_medline record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing Nucleotide to Protein links returned by ELink ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 1239, in test_nucleotide record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 934, in test_pubmed1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 1253, in test_pubmed2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed link returned by ELink (third test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2404, in test_pubmed3 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (fourth test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2431, in test_pubmed4 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (fifth test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2499, in test_pubmed5 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (sixth test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2669, in test_pubmed6 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EPost ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 535, in test_epost record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EPost with an invalid id (overflow tag) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 553, in test_invalid record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EPost with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 545, in test_wrong self.assertRaises(RuntimeError, Entrez.read, handle) File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises callableObj(*args, **kwargs) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from the Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 322, in test_journals record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch when no items were found ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 502, in test_notfound record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from the Nucleotide database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 444, in test_nucleotide record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed Central ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 366, in test_pmc record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from the Protein database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 479, in test_protein record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 107, in test_pubmed1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 136, in test_pubmed2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed (third test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 289, in test_pubmed3 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML output returned by ESpell ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3013, in test_espell record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 653, in test_journals record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Nucleotide database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 766, in test_nucleotide record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Protein database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 727, in test_protein record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from PubMed ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 576, in test_pubmed record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Structure database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 805, in test_structure record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Taxonomy database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 855, in test_taxonomy record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the UniSTS database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 895, in test_unists record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 921, in test_wrong self.assertRaises(RuntimeError, Entrez.read, handle) File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises callableObj(*args, **kwargs) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== FAIL: Doctest: Bio.Wise._build_align_cmdline ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.6/doctest.py", line 2152, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for Bio.Wise._build_align_cmdline File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 23, in _build_align_cmdline ---------------------------------------------------------------------- File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 26, in Bio.Wise._build_align_cmdline Failed example: _build_align_cmdline(["dnal"], ("seq1.fna", "seq2.fna"), "/tmp/output", kbyte=100000) Expected: 'dnal -kbyte 100000 seq1.fna seq2.fna > /tmp/output' Got: 'dnal -kbyte 100000 -quiet seq1.fna seq2.fna > /tmp/output' ---------------------------------------------------------------------- File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 28, in Bio.Wise._build_align_cmdline Failed example: _build_align_cmdline(["psw"], ("seq1.faa", "seq2.faa"), "/tmp/output_aa") Expected: 'psw -kbyte 300000 seq1.faa seq2.faa > /tmp/output_aa' Got: 'psw -kbyte 300000 -quiet seq1.faa seq2.faa > /tmp/output_aa' ---------------------------------------------------------------------- Ran 144 tests in 192.676 seconds FAILED (failures = 3) From biopython at maubp.freeserve.co.uk Wed Jun 30 06:19:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 11:19:19 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov wrote: > Peter, > actually the problems with Entrez tools are Unicode related. > I suppose, that the test failures are related with? the current working dir > path: it contains a non-English word in it, thus it can not be represented > as an ascii string. > Also there are similar problems with Genbank to Sql tests. > > Please, see the error-log attached. Thank you for the error log. Yes, there do seem to be problems with having the source code under a unicode path. Could you try moving the folder from /home/okko/??????/biopython to /home/okko/biopython and repeat the test? That would help confirm this hypothesis. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 08:47:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 13:47:14 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 11:19 AM, Peter wrote: > On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov > wrote: >> Peter, >> actually the problems with Entrez tools are Unicode related. >> I suppose, that the test failures are related with? the current working dir >> path: it contains a non-English word in it, thus it can not be represented >> as an ascii string. >> Also there are similar problems with Genbank to Sql tests. >> >> Please, see the error-log attached. > > Thank you for the error log. Yes, there do seem to be problems > with having the source code under a unicode path. Could you > try moving the folder from /home/okko/??????/biopython to > /home/okko/biopython and repeat the test? That would help > confirm this hypothesis. I created a similar directory name on my (English) version of Mac OS X, and get the same Entrez failure. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 09:05:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 14:05:53 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 1:47 PM, Peter wrote: > > I created a similar directory name on my (English) version of > Mac OS X, and get the same Entrez failure. > Hi Konstantin, Could you retest using the latest code from github? I hope that now test_Entrez.py will work for you. Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 09:31:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 14:31:58 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 2:05 PM, Peter wrote: > On Wed, Jun 30, 2010 at 1:47 PM, Peter wrote: >> >> I created a similar directory name on my (English) version of >> Mac OS X, and get the same Entrez failure. >> > > Hi Konstantin, > > Could you retest using the latest code from github? I hope that now > test_Entrez.py will work for you. The second update should also fix test_BioSQL.py as well. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 10:24:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 15:24:57 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov wrote: > The fixes work! > Only one test fails, but it doesn't look related to non-English OS > problems.? I've attached the new test log. Great :) I hadn't done anything about the Bio.Wise docstring test failure yet, but it isn't linked to the non-English OS at all. I'll start a new thread... Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 30 11:22:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 30 Jun 2010 11:22:16 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201006301522.o5UFMGvo028548@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-30 11:22 EST ------- I've merged my github branch into the master. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 30 11:23:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 16:23:12 +0100 Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing In-Reply-To: References: Message-ID: On Fri, Jun 25, 2010 at 4:21 PM, Peter wrote: > Hi all, > > I've been working on and off recently on rewriting the location > parsing for GenBank/EMBL features: > http://bugzilla.open-bio.org/show_bug.cgi?id=2738 > > I have a branch ready for public testing, ... Would anyone like > to volunteer to test the new branch before I merge it to the trunk? I've just merged it - testing and feedback still welcome of course. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 10:38:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 15:38:59 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 3:24 PM, Peter wrote: > On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov > wrote: >> The fixes work! >> Only one test fails, but it doesn't look related to non-English OS >> problems.? I've attached the new test log. > > Great :) > > I hadn't done anything about the Bio.Wise docstring test failure yet, > but it isn't linked to the non-English OS at all. I'll start a new thread... > Solved. The doctest was working UNLESS the test output was being sent to a file. Peter From eric.talevich at gmail.com Tue Jun 1 03:44:11 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 31 May 2010 23:44:11 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 11:53 AM, Peter wrote: > On Mon, May 31, 2010 at 4:38 PM, Eric Talevich > wrote: > > Hi all, > > > > This summer our GSoC student Jo?o Rodrigues will be implementing a number > of > > enhancements to Biopython's structural biology modules. Since Bio.PDB is > one > > of the most widely used parts of Biopython, I'd like to find a way to > > let Jo?o add major new features without breaking existing code and > > documentation. > > > > There are a few issues I'd like to address: > > > > 1. The I/O conventions of parse/read/write/convert seem to work very well > in > > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports > > I/O in several formats, but the API is lower-level and isn't unified in > the > > same way (yet). > > Currently Bio.PDB supports the plain text PDB format, and has partial > support for mmCIF. It lacks support for the XML PDB format, PDBML - > Protein Data Bank Markup Language. > Yeah, it would be good to implement that at some point. For now, I'd be happy to be able to read and write PDB files with a single function call each, and design the I/O wrapper for easy extension to mmCIF and PDBML. Under this proposed scheme, what would you see as the basic record type > (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and > Bio.Phylo)? It would be nice to say a protein chain, but there is the issue > of > multiple models (e.g. from NMR). I presume you'd go with the model as the > basic unit (where each model may contain multiple chains). > I'd consider a structure to be the basic unit of I/O. If we're going to make better use of header info, that's generally associated with the whole structure and not individual models -- we'd have to duplicate the header info in each Model object emitted, which would be weird. Are there any formats that store more than one structure in a file? If not, then there's probably no need for a parse() function in Bio.Struct. > > from Bio.Struct import WHATIF, Jpred > > # Servers each get their own module > > Hmm - perhaps we may need have another level here, Bio.Struct.Servers > or Bio.Struct.WWW or something. How many of these do you expect? > Jo?o's project plan includes Dali and WHATIF: http://biopython.org/wiki/GSOC2010_Joao These servers do different things so I wouldn't expect any similarity in the code between them. There are lots of servers that we *could* support... Aesthetically, a Servers or WWW subdirectory would match Bio.Struct.Applications and make the whole package a little more self-documenting. Here's one more idea: Fetching a single PDB file from RCSB requires a separate import and a couple of calls. Should we make this even easier by mimicking the efetch function in Bio.Entrez, something like >>> handle = Bio.PDB.fetch("1MOT") or >>> from Bio.Struct.WWW import RCSB >>> handle = RCSB.fetch("1MOT", "pdb") ? -Eric From biopython at maubp.freeserve.co.uk Tue Jun 1 09:05:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 10:05:43 +0100 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: 2010/6/1 Eric Talevich: > On Mon, May 31, 2010 at 11:53 AM, Peter wrote: > > Under this proposed scheme, what would you see as the basic record type >> (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO >> and Bio.Phylo)? It would be nice to say a protein chain, but there is the >> issue of multiple models (e.g. from NMR). I presume you'd go with the >> model as the basic unit (where each model may contain multiple chains). >> > > I'd consider a structure to be the basic unit of I/O. If we're going to make > better use of header info, that's generally associated with the whole > structure and not individual models -- we'd have to duplicate the header > info in each Model object emitted, which would be weird. > > Are there any formats that store more than one structure in a file? If not, > then there's probably no need for a parse() function in Bio.Struct. OK, yes - a whole structure as the unit would work, so we would only need the read function (one file is one structure) and not the parse function (no point in iterating over one thing). >> > from Bio.Struct import WHATIF, Jpred >> > # Servers each get their own module >> >> Hmm - perhaps we may need have another level here, Bio.Struct.Servers >> or Bio.Struct.WWW or something. How many of these do you expect? >> > > Jo?o's project plan includes Dali and WHATIF: > http://biopython.org/wiki/GSOC2010_Joao > > These servers do different things so I wouldn't expect any similarity in the > code between them. There are lots of servers that we *could* support... > Aesthetically, a Servers or WWW subdirectory would match > Bio.Struct.Applications and make the whole package a little more > self-documenting. My thoughts exactly. > Here's one more idea: Fetching a single PDB file from RCSB requires a > separate import and a couple of calls. Should we make this even easier by > mimicking the efetch function in Bio.Entrez, something like > >>>> handle = Bio.PDB.fetch("1MOT") > > or > >>>> from Bio.Struct.WWW import RCSB >>>> handle = RCSB.fetch("1MOT", "pdb") > > ? > That seems nice. Peter From krother at rubor.de Tue Jun 1 09:59:31 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 1 Jun 2010 11:59:31 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: Message-ID: Hi, Got some comments & questions. > 2. PDB headers seem to have become better structured in recent years, in > ... parse_pdb_header needs some attention as well. I haven't looked into this code for years .. I think it might be a little messy. > 3. Kristian asked on this list awhile ago about the proper location for > his new code that works with RNA structures. While RCSB's PDB contains > some RNA structures, the RNA world doesn't revolve around it. Similarly, > Jo?o needs a place to put code for structure prediction/validation > servers, command-line wrappers, secondary structures, etc. > > I propose a new sub-package called Bio.Struct for these enhancements: > > from Bio.Struct import RNA > # Would this work for you, Kristian? Yes, it would be more descriptive than the originally proposed Bio.RNA . I am just concerned whether I could keep the 2D structure-related modules in the same package. > Alternatively, we could do all of this within the PDB module -- so picture > the above examples with "PDB" in place of "Struct". This raises the chance > of naming collisions, though, and doesn't solve issue #3 above. I like Bio.PDB.RNA less for the same reasons plus the 2D structure issue. > We'll leave the existing PDB module layout alone, in general. I think it > will be necessary to add a few more attributes to the > Bio.PDB.Structure.Structure class, but we can do this without breaking > compatibility. > > Comments? What about the modules for constructing coordinates & Loop Closure (currently available on my Github branch)? I placed them in Bio.PDB because they are not limited to RNA and are conceptually similar to the operations performed by Bio.PDB.NeighborSearch and Bio.PDB.SVDSuperimposer - or would it be better to gather such things in some other package within Bio.PDB.Struct? Cheers, Kristian From biopython at maubp.freeserve.co.uk Tue Jun 1 11:42:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 12:42:53 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: Message-ID: 2010/6/1 Kristian Rother : > >> 3. Kristian asked on this list awhile ago about the proper location for >> his new code that works with RNA structures. While RCSB's PDB >> contains some RNA structures, the RNA world doesn't revolve around >> it. Similarly, Jo?o needs a place to put code for structure prediction/ >> validation servers, command-line wrappers, secondary structures, etc. >> >> I propose a new sub-package called Bio.Struct for these enhancements: >> >> from Bio.Struct import RNA >> # Would this work for you, Kristian? > > Yes, it would be more descriptive than the originally proposed Bio.RNA . I > am just concerned whether I could keep the 2D structure-related modules > in the same package. I don't necessarily see a problem with Bio.Struct or Bio.Structure covering both 2D and 3D structures. Does this 2D stuff include file parsers? That would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better. Peter From biopython at maubp.freeserve.co.uk Tue Jun 1 13:10:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 14:10:05 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 3:50 PM, Peter wrote: > Hi all, > > With the new command line wrappers and the tutorial pushing > users towards using subprocess we've had more queries > about how to use it. The subprocess module itself is rather > scary I guess, and things could be made a lot easier. > > I think the most typical use cases are: > > (1) Run the command, return the error code (integer) > (2) Run the command, return stdout, stderr and error code > > In theory the function subprocess.call() would take care > of the first example, but there is a cross platform annoyance > here with the shell parameter. Also, if you want the output > too things get even more tricky. It hasn't helped that there > are a few platform specific quirks/bugs in subprocess itself > (the different behaviour of the shell option on Windows, > bug http://bugs.python.org/issue1124861 in old Pythons, > the risk of deadlocks with large output files, etc). In fact I've often found using os.system() much easier than subprocess for the first use case - running a command and getting the return code. I wondered about adding an example of this to the tutorial but didn't find time before the last release (even if the Python documentation does try and encourage using subprocess instead). Peter From chapmanb at 50mail.com Tue Jun 1 13:23:55 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Jun 2010 09:23:55 -0400 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: Message-ID: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Peter; > With the new command line wrappers and the tutorial pushing > users towards using subprocess we've had more queries > about how to use it. The subprocess module itself is rather > scary I guess, and things could be made a lot easier. [...] > We could instead make the wrapper objects callable (define > the magic method __call__) to offer this kind of functionality. > This seems quite elegant to me. This is a good idea, although I'm 50/50 on the __call__ idea. Having a run() command or something similar might be more intuitive then the more magical call, if the idea is to appeal to users who find subprocess too problematic. I'd suggest having an option to not capture stdout and stderr, which would help users avoid those cases where a program spews a lot to stdout and it's unwieldy to capture and stick it into a string. Brad From biopython at maubp.freeserve.co.uk Tue Jun 1 13:48:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 14:48:30 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <1275332206.4c04066ed4ec5@webmail.upv.es> References: <1275332206.4c04066ed4ec5@webmail.upv.es> Message-ID: On Mon, May 31, 2010 at 7:56 PM, Blanca Postigo Jose Miguel wrote: > Mensaje citado por Michael Sandford : > >> I've got a few comments as well: >> > 4) The current Blast record stores its information in attributes. If you >> use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the >> necessary DTDs to do so), the information is stored in dictionaries. This has >> some advantages. For example, it allows you to use record.keys() to find out >> what the record contains. Ideally, I think that a Blast Record class should >> inherit from a dictionary. > > I've developed for my own use a dict structure that represents a blast result. > This structure also can represent many other results, like exonerate, SSAHA or > any other number of aligners. Having a common representations for all of them > allows you to create common filters that work with the same interface. I don't > know if it is very efficient, but it has proven to be very convinient for us. > You can take a look at: > > http://github.com/JoseBlanca/franklin/blob/master/franklin/alignment_search_result.py > > Best regards, > > Jose Blanca It has some similarities to what I was imagining for a BioPerl-SearchIO-like module. I'm still not convinced that we should just be using (subclasses of) dictionaries - I would rather have important core properties like the hit co-ordinates held explicitly as properties or attributes (and always using Python counting, not whatever a given file format uses, like one-based locations in BLAST output). Peter From krother at rubor.de Tue Jun 1 14:11:51 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 1 Jun 2010 16:11:51 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: Message-ID: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Hi, >>> from Bio.Struct import RNA >>> # Would this work for you, Kristian? >> >> Yes, it would be more descriptive than the originally proposed Bio.RNA . >> I >> am just concerned whether I could keep the 2D structure-related modules >> in the same package. > > I don't necessarily see a problem with Bio.Struct or Bio.Structure > covering > both 2D and 3D structures. Does this 2D stuff include file parsers? That > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better. Yes, currently, RNA contains 2D stuff. It would complicate Struct.read(). On the other hand, the 2D stuff is independent from the 3D modules - could be split into two packages -- but I think keeping RNA is simpler. Best Regards, Kristian From biopython at maubp.freeserve.co.uk Tue Jun 1 15:15:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 16:15:03 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <20100601132355.GU1054@sobchak.mgh.harvard.edu> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: > Peter; > >> With the new command line wrappers and the tutorial pushing >> users towards using subprocess we've had more queries >> about how to use it. The subprocess module itself is rather >> scary I guess, and things could be made a lot easier. > [...] >> We could instead make the wrapper objects callable (define >> the magic method __call__) to offer this kind of functionality. >> This seems quite elegant to me. > > This is a good idea, although I'm 50/50 on the __call__ idea. > Having a run() command or something similar might be more intuitive > then the more magical call, if the idea is to appeal to users who > find subprocess too problematic. Fair point. We'd have to audit all the existing wrappers to make sure we have some suitable names free (e.g run or execute). > I'd suggest having an option to not capture stdout and stderr, which > would help users avoid those cases where a program spews a lot to > stdout and it's unwieldy to capture and stick it into a string. We need to avoid any risk of deadlocks, so I guess the safe implementation here would be call subprocess with stdout and stderr sent to dev null. Peter From eric.talevich at gmail.com Tue Jun 1 18:25:52 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 1 Jun 2010 14:25:52 -0400 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Tue, Jun 1, 2010 at 10:11 AM, Kristian Rother wrote: > Hi, > > >>> from Bio.Struct import RNA > >>> # Would this work for you, Kristian? > >> > >> Yes, it would be more descriptive than the originally proposed Bio.RNA . > >> I > >> am just concerned whether I could keep the 2D structure-related modules > >> in the same package. > > > > I don't necessarily see a problem with Bio.Struct or Bio.Structure > > covering > > both 2D and 3D structures. Does this 2D stuff include file parsers? That > > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is > better. > > Yes, currently, RNA contains 2D stuff. It would complicate Struct.read(). > On the other hand, the 2D stuff is independent from the 3D modules - could > be split into two packages -- but I think keeping RNA is simpler. > > Best Regards, > Kristian > > I could be totally wrong here, but I think it's useful to lay out some assumptions and intuitions explicitly. To me, secondary structure is not really a separate dimension in its own right, the way tertiary structure corresponds to 3D space and primary structure corresponds to a linear sequence. Instead, secondary structure has meaning in 3D space, but is usually serialized as a linear sequence. That is, we want to parse something that resembles a sequence, but be able to map it onto a 3D structure. (More for proteins than for RNA, usually.) (For non-RNA folk, here's an example of RNA secondary structure: http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna ) For instance, the output of DSSP and Jpred describes a protein's secondary structure, but the input to DSSP is a 3D structure, while Jpred accepts a protein sequence. The representation of secondary structure isn't distinct from either of these. I'd want both of these available in Bio.Struct (eventually). This means that some interaction between Bio.Struct and SeqIO is necessary. It would be neat if secondary structure regions were represented as SeqFeature instances, and secondary-structure parsers returned some kind of subclass of SeqRecord -- or a standard SeqRecord containing a special kind of Seq. The secondary-structure parsers for RNA and proteins should be separate, too, since the annotated features are different. So the function Bio.Struct.read() can apply exclusively to 3D structures. Would it be reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary structures -- assuming that anything that's not a secondary structure, 3D structure, or nucleotide sequence is something special that belongs in its own module? As for protein secondary structure, it's usually associated with a sequence or a structure, so maybe we could get by with storing that information in an ordinary Structure or SeqRecord object without inventing a new subclass. Best, Eric From jblanca at btc.upv.es Wed Jun 2 06:21:36 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 2 Jun 2010 08:21:36 +0200 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: <201006020821.36486.jblanca@btc.upv.es> On Tuesday 01 June 2010 17:15:03 Peter wrote: > On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: > > Peter; > > > >> With the new command line wrappers and the tutorial pushing > >> users towards using subprocess we've had more queries > >> about how to use it. The subprocess module itself is rather > >> scary I guess, and things could be made a lot easier. We had the same need. We solved it with a call function. You can take a look at: http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From krother at rubor.de Wed Jun 2 08:17:01 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 2 Jun 2010 10:17:01 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: Hi, >> >>> from Bio.Struct import RNA .. >> > I don't necessarily see a problem with Bio.Struct or Bio.Structure >> > covering both 2D and 3D structures. Eric, I agree with you - the secondary structure of RNA maps nicely to 3D space. Generally, I think it is a little more common to work with RNA 2D structures in absence of 3D information than in proteins - 2D prediction of RNA is maybe simply a less nasty target. Eric wrote: > I could be totally wrong here, but I think it's useful to lay out some > assumptions and intuitions explicitly. > > To me, secondary structure is not really a separate dimension in its own > right, the way tertiary structure corresponds to 3D space and primary > structure corresponds to a linear sequence. Instead, secondary structure > has > meaning in 3D space, but is usually serialized as a linear sequence. That > is, we want to parse something that resembles a sequence, but be able to > map > it onto a 3D structure. (More for proteins than for RNA, usually.) > > (For non-RNA folk, here's an example of RNA secondary structure: > http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna > ) > > For instance, the output of DSSP and Jpred describes a protein's secondary > structure, but the input to DSSP is a 3D structure, while Jpred accepts a > protein sequence. The representation of secondary structure isn't distinct > from either of these. I'd want both of these available in Bio.Struct > (eventually). > > This means that some interaction between Bio.Struct and SeqIO is > necessary. > It would be neat if secondary structure regions were represented as > SeqFeature instances, and secondary-structure parsers returned some kind > of > subclass of SeqRecord -- or a standard SeqRecord containing a special kind > of Seq. So far the Secstruc parsers I've implemented just return (sequence,secstruc) tuples. But putting this into a SeqRecord makes sense - I understand this fits better to the BioPython architecture. Maybe instead of a Seq or SeqRecord subclass we could use the decorator pattern (decorating a class, not the Python decorator function syntax). A potential problem that I'd like to point out early is that we are working with modified RNA nucleotides a lot (up to 20% of residues in every tRNA). This would require extending the RNA Alphabet (which now just is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread. > The secondary-structure parsers for RNA and proteins should be separate, > too, since the annotated features are different. So the function > Bio.Struct.read() can apply exclusively to 3D structures. Would it be > reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary > structures -- assuming that anything that's not a secondary structure, 3D > structure, or nucleotide sequence is something special that belongs in its > own module? To summarize, we could use: 1) protein 3D structures: Bio.Struct.read() --> Bio.PDB.Structure 2) RNA 3D structures: Bio.Struct.read() --> Bio.PDB.Structure 3) RNA 2D structures: Bio.Struct.RNA.read() --> Bio.SeqRecord (extended/decorated by a secstruc field) 4) protein 2D structures: uses special parser module?? 5) plain sequences: Bio.read() --> Bio.SeqRecord Eric, does this summarize your thoughts correctly? This would work for me. Any comments from the others. Best, Kristian From biopython at maubp.freeserve.co.uk Wed Jun 2 08:44:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 09:44:54 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <201006020821.36486.jblanca@btc.upv.es> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <201006020821.36486.jblanca@btc.upv.es> Message-ID: On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca wrote: > On Tuesday 01 June 2010 17:15:03 Peter wrote: >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >> > Peter; >> > >> >> With the new command line wrappers and the tutorial pushing >> >> users towards using subprocess we've had more queries >> >> about how to use it. The subprocess module itself is rather >> >> scary I guess, and things could be made a lot easier. > > We had the same need. We solved it with a call function. You can take > a look at: > > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py > It looks complicated (and I'm sure with good reason), but I'd guess you've never tried this on Windows? We used to have the Bio.Application.generic_run function for calling a command - but making the command line wrapper callable or having a method on the command line wrapper is much easier to use (no extra import needed). Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 09:23:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 10:23:15 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Tue, Jun 1, 2010 at 7:25 PM, Eric Talevich wrote: >> > I could be totally wrong here, but I think it's useful to lay out some > assumptions and intuitions explicitly. > > To me, secondary structure is not really a separate dimension in its own > right, the way tertiary structure corresponds to 3D space and primary > structure corresponds to a linear sequence. Instead, secondary structure has > meaning in 3D space, but is usually serialized as a linear sequence. That > is, we want to parse something that resembles a sequence, but be able to map > it onto a 3D structure. (More for proteins than for RNA, usually.) > > (For non-RNA folk, here's an example of RNA secondary structure: > http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna > ) > > For instance, the output of DSSP and Jpred describes a protein's secondary > structure, but the input to DSSP is a 3D structure, while Jpred accepts a > protein sequence. The representation of secondary structure isn't distinct > from either of these. I'd want both of these available in Bio.Struct > (eventually). > > This means that some interaction between Bio.Struct and SeqIO is necessary. > It would be neat if secondary structure regions were represented as > SeqFeature instances, and secondary-structure parsers returned some kind of > subclass of SeqRecord -- or a standard SeqRecord containing a special kind > of Seq. > > ... > > As for protein secondary structure, it's usually associated with a sequence > or a structure, so maybe we could get by with storing that information in an > ordinary Structure or SeqRecord object without inventing a new subclass. Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for both proteins, RNA and DNA). We can store a secondary structure string as per-letter-annotation, or things like helix regions as SeqFeature objects. Peter From jblanca at btc.upv.es Wed Jun 2 09:24:24 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 2 Jun 2010 11:24:24 +0200 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <201006020821.36486.jblanca@btc.upv.es> Message-ID: <201006021124.24499.jblanca@btc.upv.es> On Wednesday 02 June 2010 10:44:54 Peter wrote: > On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca wrote: > > On Tuesday 01 June 2010 17:15:03 Peter wrote: > >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: > >> > Peter; > >> > > >> >> With the new command line wrappers and the tutorial pushing > >> >> users towards using subprocess we've had more queries > >> >> about how to use it. The subprocess module itself is rather > >> >> scary I guess, and things could be made a lot easier. > > > > We had the same need. We solved it with a call function. You can take > > a look at: > > > > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_util > >s.py > > It looks complicated (and I'm sure with good reason), but I'd guess > you've never tried this on Windows? Yes it is somewhat complicated. We need some functionalities like accepting stdout to be a file or just a pipe (some programs have very long stdouts). We have added everything we have required for our programs. No, we haven't test anything on windows. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Jun 2 09:25:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 10:25:47 +0100 Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements Message-ID: On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother wrote: > > A potential problem that I'd like to point out early is that we are > working with modified RNA nucleotides a lot (up to 20% of residues in > every tRNA). This would require extending the RNA Alphabet (which now just > is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread. > What letters are you missing? There is a commented out ExtendedIUPACRNA alphabet that may be relevant in Bio/Alphabets/IUPAC.py Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 11:36:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 12:36:46 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Tue, Jun 1, 2010 at 4:15 PM, Peter wrote: > On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >> I'd suggest having an option to not capture stdout and stderr, which >> would help users avoid those cases where a program spews a lot to >> stdout and it's unwieldy to capture and stick it into a string. > > We need to avoid any risk of deadlocks, so I guess the safe > implementation here would be call subprocess with stdout and > stderr sent to dev null. How does this look? Tested on Mac and Windows: http://github.com/peterjc/biopython/tree/app-exec2 Example usage without capturing the output: from Bio.Emboss.Applications import WaterCommandline water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, asequence="a.fasta", bsequence="b.fasta") print "About to run:\n%s" % water_cmd return_code = water_cmd() print "Return code: %i" % return_code Example usage with stdout and stderr capture: from Bio.Emboss.Applications import WaterCommandline water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, asequence="a.fasta", bsequence="b.fasta") print "About to run:\n%s" % water_cmd stdout, stderr, return_code = water_cmd(capture=True) print "Return code: %i" % return_code print "Tool output:\n%s" % stdout Note in this implementation it either returns an integer error level (the default) or a tuple of stdout, stderr and the error level return code. If we opt for adding methods rather than using __call__ these could be different methods instead. Another potentially useful option would be to copy the subprocess.check_call() function in Python 2.5+ which verifies the return code (error level) is zero and raises an exception if not (probably only sensible if not capturing the output?). Maybe this could even be the default behaviour? [I would prefer to keep the interface as simple as possible though, less options is better! KISS principle.] Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 11:59:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 12:59:46 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 12:36 PM, Peter wrote: > On Tue, Jun 1, 2010 at 4:15 PM, Peter wrote: >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >>> I'd suggest having an option to not capture stdout and stderr, which >>> would help users avoid those cases where a program spews a lot to >>> stdout and it's unwieldy to capture and stick it into a string. >> >> We need to avoid any risk of deadlocks, so I guess the safe >> implementation here would be call subprocess with stdout and >> stderr sent to dev null. > > How does this look? Tested on Mac and Windows: > http://github.com/peterjc/biopython/tree/app-exec2 > > Example usage without capturing the output: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?return_code = water_cmd() > ? ?print "Return code: %i" % return_code > > Example usage with stdout and stderr capture: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?stdout, stderr, return_code = water_cmd(capture=True) > ? ?print "Return code: %i" % return_code > ? ?print "Tool output:\n%s" % stdout > > Note in this implementation it either returns an integer error level > (the default) or a tuple of stdout, stderr and the error level return > code. If we opt for adding methods rather than using __call__ > these could be different methods instead. > > Another potentially useful option would be to copy the > subprocess.check_call() function in Python 2.5+ which verifies > the return code (error level) is zero and raises an exception if not > (probably only sensible if not capturing the output?). Maybe this > could even be the default behaviour? > > [I would prefer to keep the interface as simple as possible though, > less options is better! KISS principle.] With that in mind, as I mentioned yesterday maybe we should just update the documentation to suggest using os.system() when you just need the return code and there is no stdin to worry about: import os from Bio.Emboss.Applications import WaterCommandline water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, asequence="a.fasta", bsequence="b.fasta") print "About to run:\n%s" % water_cmd return_code = os.system(water_cmd) print "Return code: %i" % return_code Even if the Python documentation seems to be discouraging it, using os.system() seems simple, robust, and cross platform. We could even update the tutorial now and post it online - it should make some people's lives a little easier. [Note this is actually a silly example, I should be telling water to output to a file, not stdout which is then ignored.] Peter From krother at rubor.de Wed Jun 2 12:14:05 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 2 Jun 2010 14:14:05 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> Hi Peter, Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new branch for that (on Git: krother/biopython). Putting secondary structures like '((((....))))' for a hairpin into the letter_annotation field makes sense. I think it even would work for pseudoknotted RNA (which is hard to represent as a string, one possible notation would be '(((..[[[....)))..]]]'. Where should the str subclass for secondary structures that the parsers create go? Could it be Bio.Struct.RNA? Best, Kristian Putting RNA secondary structures >> As for protein secondary structure, it's usually associated with a >> sequence >> or a structure, so maybe we could get by with storing that information >> in an >> ordinary Structure or SeqRecord object without inventing a new subclass. > > Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for > both proteins, RNA and DNA). We can store a secondary structure string as > per-letter-annotation, or things like helix regions as SeqFeature objects. > > Peter > > From krother at rubor.de Wed Jun 2 12:21:43 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 2 Jun 2010 14:21:43 +0200 Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements In-Reply-To: References: Message-ID: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de> Hi Peter, I'm afraid the matter is more complicated. To date, we have 115 modified RNA bases, which means in practice that you run out of nice ASCII characters. Moreover, some people use one-letter symbols in RNA as wildcards (R for purine, Y for pyrimidine). As a consequence, several sets of abbreviations have been developed - see http://modomics.genesilico.pl/modification_list to get an impression. We've written for our own purposes a class containing different ways of nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like to change that. Best Regards, Kristian > On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother wrote: >> >> A potential problem that I'd like to point out early is that we are >> working with modified RNA nucleotides a lot (up to 20% of residues in >> every tRNA). This would require extending the RNA Alphabet (which now >> just >> is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread. >> > What letters are you missing? There is a commented out ExtendedIUPACRNA > alphabet that may be relevant in Bio/Alphabets/IUPAC.py > > Peter > > From biopython at maubp.freeserve.co.uk Wed Jun 2 13:22:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 14:22:36 +0100 Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements In-Reply-To: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de> References: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Wed, Jun 2, 2010 at 1:21 PM, Kristian Rother wrote: > > Hi Peter, > > I'm afraid the matter is more complicated. To date, we have 115 modified > RNA bases, which means in practice that you run out of nice ASCII > characters. Moreover, some people use one-letter symbols in RNA as > wildcards (R for purine, Y for pyrimidine). As a consequence, several sets > of abbreviations have been developed - see > http://modomics.genesilico.pl/modification_list to get an impression. > > We've written for our own purposes a class containing different ways of > nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like > to change that. > > Best Regards, > ? Kristian Hmm. I wonder if the HTML entities would work nicely in Python (as unicode)? That way you could have an unambiguous string representation where each letter is one character long. I'm thinking a Seq subclass (with a special alphabet) might be the way to go here, allowing access to the single character entities by default but also the longer codes as well. There are similarities with modified peptide sequences where there are clear three letter codes, but not one letter codes. Tricky. Peter From biopython at maubp.freeserve.co.uk Wed Jun 2 13:24:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Jun 2010 14:24:49 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Wed, Jun 2, 2010 at 1:14 PM, Kristian Rother wrote: > > Hi Peter, > > Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new > branch for that (on Git: krother/biopython). > > Putting secondary structures like '((((....))))' for a hairpin into the > letter_annotation field makes sense. I think it even would work for > pseudoknotted RNA (which is hard to represent as a string, one possible > notation would be '(((..[[[....)))..]]]'. > > Where should the str subclass for secondary structures that the parsers > create go? Could it be Bio.Struct.RNA? > > Best, > ? Kristian You don't think plain strings in the SeqRecord's letter_annotation dict would be enough? Assuming you do need something then perhaps under Bio.Seq or Bio.SeqUtils might be worth considering as alternatives to Bio.Struct.RNA. Peter From eric.talevich at gmail.com Thu Jun 3 16:17:09 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 3 Jun 2010 12:17:09 -0400 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> Message-ID: On Wed, Jun 2, 2010 at 8:14 AM, Kristian Rother wrote: > > Putting secondary structures like '((((....))))' for a hairpin into the > letter_annotation field makes sense. I think it even would work for > pseudoknotted RNA (which is hard to represent as a string, one possible > notation would be '(((..[[[....)))..]]]'. > > Here's another format that was designed to represent pseudoknots: http://www.uga.edu/RNA-Informatics/files/software/RNApasta.help.html#Format I'm not sure how standardized or widely used it is, but the program RNA-pasta works with it: http://www.uga.edu/RNA-Informatics/?f=software&p=RNApasta -Eric From biopython at maubp.freeserve.co.uk Thu Jun 3 16:43:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Jun 2010 17:43:47 +0100 Subject: [Biopython-dev] More SeqRecord methods In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 3:53 PM, Peter wrote: > Hi all, > > What do people think of adding upper and lower methods to the SeqRecord? > http://bugzilla.open-bio.org/show_bug.cgi?id=3054 I checked that in with an example in the tutorial. > If that is well received, how about adding another Seq method to the > SeqRecord, the newish ungap method? > http://bugzilla.open-bio.org/show_bug.cgi?id=3060 This one I would like some feedback on first. I'm sure the implementation could me made much more efficient too. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 3 16:45:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Jun 2010 12:45:16 -0400 Subject: [Biopython-dev] [Bug 3054] Add upper and lower methods to the SeqRecord In-Reply-To: Message-ID: <201006031645.o53GjGd9019264@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3054 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-03 12:45 EST ------- Checked in: http://github.com/biopython/biopython/tree/f4f11a9c4e7aca10c33cfe93c78d4972a0d736f8 With an example in the tutorial too: http://github.com/biopython/biopython/commit/3de8bbd423010eb0b480b8966041f7c6d8e9890d Marking this as fixed. See also: http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007772.html http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007801.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 3 17:24:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Jun 2010 18:24:43 +0100 Subject: [Biopython-dev] More SeqRecord methods In-Reply-To: References: Message-ID: On Thu, Jun 3, 2010 at 5:43 PM, Peter wrote: > On Mon, May 31, 2010 at 3:53 PM, Peter wrote: > >> ..., how about adding another Seq method to the >> SeqRecord, the newish ungap method? >> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 > > This one I would like some feedback on first. I'm sure the > implementation could be made much more efficient too. Maybe I should mention that I also envisage a similar method for the alignment object, to give a new alignment with any all-gap-columns removed (perhaps with an optional argument to specify a threshold for the number of gaps required - defaulting to only removing columns which are all gaps). Again, the simplest way to implement this is to re-use the new alignment slicing and addition features - much as how I did it for the proposed SeqRecord ungap method. Peter From eric.talevich at gmail.com Thu Jun 3 19:10:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 3 Jun 2010 15:10:51 -0400 Subject: [Biopython-dev] Fixup branch for Bio.PDB Message-ID: Hi all, I've poked around Bugzilla, taken patches for some outstanding bugs, and applied them to a branch on GitHub: http://github.com/etal/biopython/tree/pdbfixes http://github.com/etal/biopython/commits/pdbfixes I'd like to encourage people to test this branch with their own code, and if it all still works (or nobody's interested in testing this branch), I'll push it to the Biopython trunk so it gets tested more. Time frame: if this branch lingers too long, there's a high chance it will cause conflicts for Jo?o (our GSoC student) the next time he merges. How about a week? The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951: http://bugzilla.open-bio.org/show_bug.cgi?id=2820 http://bugzilla.open-bio.org/show_bug.cgi?id=2948 http://bugzilla.open-bio.org/show_bug.cgi?id=2879 http://bugzilla.open-bio.org/show_bug.cgi?id=2950 http://bugzilla.open-bio.org/show_bug.cgi?id=2951 Thanks, Eric From biopython at maubp.freeserve.co.uk Fri Jun 4 08:44:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 09:44:19 +0100 Subject: [Biopython-dev] Fixup branch for Bio.PDB In-Reply-To: References: Message-ID: On Thu, Jun 3, 2010 at 8:10 PM, Eric Talevich wrote: > Hi all, > > I've poked around Bugzilla, taken patches for some outstanding bugs, and > applied them to a branch on GitHub: > http://github.com/etal/biopython/tree/pdbfixes > http://github.com/etal/biopython/commits/pdbfixes > > I'd like to encourage people to test this branch with their own code, and if > it all still works (or nobody's interested in testing this branch), I'll > push it to the Biopython trunk so it gets tested more. Time frame: if this > branch lingers too long, there's a high chance it will cause conflicts for > Jo?o (our GSoC student) the next time he merges. How about a week? > > The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951: > http://bugzilla.open-bio.org/show_bug.cgi?id=2820 > http://bugzilla.open-bio.org/show_bug.cgi?id=2948 > http://bugzilla.open-bio.org/show_bug.cgi?id=2879 > http://bugzilla.open-bio.org/show_bug.cgi?id=2950 > http://bugzilla.open-bio.org/show_bug.cgi?id=2951 > > Thanks, > Eric That sounds like a good plan. Peter From mjldehoon at yahoo.com Fri Jun 4 15:55:27 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 4 Jun 2010 08:55:27 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: <933074.46322.qm@web62405.mail.re1.yahoo.com> Michael, Peter, Sebastian, Laurent, Jose, and others, Thanks for your comments. It looks like there are lots of things to discuss, so let's start with the easiest ones. About converting a record to a string (point 5): I agree that using __str__ is probably not the best choice, so let's use __format__ instead, or add a "write" method. The added advantage of these is that we can print out a record in different formats (xml, text, table) by specifying the requested format as an argument. For point 3), maybe my wording was confusing; actually what I had in mind is the case where a given Blast program can produce different output formats (xml, text, table, etc.). This was inspired by this bug report: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 In my mind, the different output formats are just different intermediates, but in essence they are the same and should therefore be stored in the same class. So, if I run blastp, save the result as XML, and parse it, I'd expect the same class as when I run blastp and save and parse the output in table format. Just in the latter case, some information may be missing if it is not available in the output in table format. Does that sound acceptable? --Michiel. --- On Fri, 5/28/10, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: [Biopython-dev] Blast parsers and records > To: biopython-dev at biopython.org > Date: Friday, May 28, 2010, 11:23 PM > Hi everybody, > > With Biopython 1.54 out (thanks Peter!), and NCBI > encouraging to use its new Blast+ suite of Blast programs, > maybe this is a good time to tackle some older bugs related > to Blast output parsing in Biopython: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > (inconsistencies in the output of different Blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 > (inconsistencies between Psi-blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2319 > (parsing Blast table output) > > and more generally think about the design of the Blast > record class and Blast parsing. In my opinion, these are the > major issues: > > 1) Blast parsers are located in several modules > (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, > Bio.Blast.ParseBlastTable). I think we should have one > read() function and one parse() function under Bio.Blast, > with arguments specifying which format the Blast output is > in. > > 2) Blast records produced by any of the parsers should be > consistent with each other. As XML output by blast and > psi-blast follow the same DTD, we should be able to > represent both by a single Record class. > > 3) Different parsers should store information in this > Record class in the same way. > > 4) The current Blast record stores its information in > attributes. If you use Bio.Entrez to parse Blast XML output > (Biopython 1.54 contains the necessary DTDs to do so), the > information is stored in dictionaries. This has some > advantages. For example, it allows you to use record.keys() > to find out what the record contains. Ideally, I think that > a Blast Record class should inherit from a dictionary. > > 5) We should be able to print a Blast record object to > generate output that is close to the plain-text output > generated by blast. This would allow us to generate and > store Blast output as XML, and to convert the output to > plain-text to make it more human-readable. > > 6) The current Blast record inherits from > Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport, > and Bio.Blast.Record.Parameters. I don't see the rationale > for this inheritance, and I think we should remove it. > > Any comments, suggestions (in particular about by proposal > to have a Blast Record class that inherits from a > dictionary? Btw, to avoid breaking scripts, I propose that > any changes to the Blast record and parser are implemented > separately from the existing parsers and record, and to > leave those untouched. > > --Michiel. > > > ? ? ? > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Sat Jun 5 14:49:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 15:49:39 +0100 Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris Message-ID: Hi all, Are any Biopython folk planning to be at the EuroSciPy conference in Paris this year (July 2010)? They are still finalising the Scientific track, but the list of tutorials is quite interesting already: http://www.euroscipy.org/conference/euroscipy2010 Peter From biopython at maubp.freeserve.co.uk Mon Jun 7 09:35:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 10:35:15 +0100 Subject: [Biopython-dev] Working directly on the main git repository Message-ID: Hi all, I thought I'd write down some notes about how I've been using git recently. This may be of interest to any of the other core developers (those of us with read-write access to the main repository), and I might get some good tips from any discussion. The key point is that I have read+write access to two repositories on github (the official repository AND my own fork), so there are different advantages/disadvantages about which I choose to work with directly as my main repository. Our official repository has just a single stable master branch, and I often need to work directly with this (e.g. committing small bug fixes or adding more documentation). I therefore if I setup a clone of the master repository I can work on the main branch very easily. Now, when working on a branch for new features, I could just do this locally, and when they are ready, merge them direct to the master. However, this means others cannot look at my work (and I find it a problem when working on multiple machines). Alternatively, I could push the branches to the public "master" repository. This would be the simplest option BUT the high visibility gives any such experimental branch disproportionate status. I think this would be a good idea for important (multi-person) efforts, like Python 3 work. Instead, I have a github repository of my own (what github calls a fork), and I push branches there. http://github.com/biopython/biopython - the official branch(es) http://github.com/peterjc/biopython - my branches How does this work in practice? Like this - I clone the master and add a reference to my repository (and I do the same when I want to grab a branch from another developer): git clone git at github.com:biopython/biopython.git cd biopython git remote add peterjc git at github.com:peterjc/biopython.git git fetch peterjc Then make a new local branch as usual, and when ready to share it publicly, I push it to *my* repository on github: git branch new-work git checkout new-work git commit ... git push peterjc new-work This would then appear as a new-work branch on my github page. Then if I (or someone else) wants to access these branches later (e.g. from another machine) just use the checkout tracked remote branch. For example, git clone git at github.com:biopython/biopython.git cd biopython git remote add peterjc git at github.com:peterjc/biopython.git git fetch peterjc git checkout -t peterjc/seqio-imgt This then looks like a normal branch (called just "seqio-imgt" in this example), but git knows it is linked to the remote branch on the "peterjc" repository (not the origin which is the "official" repository). I'd have to check, but I guess that if the original git clone is done with git://github.com/biopython/biopython.git instead (read only access) the same procedure could be used by non core devs. However, I'm not sure this is clearer for them. I think the current procedure (on our wiki) where you add a remote reference to the "upstream" official repository works better in this case. Comments? Peter Useful links from Google searches: http://www.gitready.com/intermediate/2009/01/09/checkout-remote-tracked-branch.html http://www.gitready.com/beginner/2009/03/09/remote-tracking-branches.html From biopython at maubp.freeserve.co.uk Mon Jun 7 13:40:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 14:40:54 +0100 Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris In-Reply-To: References: Message-ID: On Sat, Jun 5, 2010 at 3:49 PM, Peter wrote: > Hi all, > > Are any Biopython folk planning to be at the EuroSciPy > conference in Paris this year (July 2010)? They are still > finalising the Scientific track, but the list of tutorials is > quite interesting already: > > http://www.euroscipy.org/conference/euroscipy2010 > > Peter Hi all, The track list for the EuroSciPy 2010 Scientific track has now been announced, and I'm delighted that I will be able to present a talk on Biopython (likely 4pm Saturday 10 July). While I hope there will be some other Biopython users there, this is a nice opportunity to meet the broader scientific python community. There are still places at the moment if you want to attend: http://www.euroscipy.org/conference/euroscipy2010 Unfortunately I will not be attending BOSC or ISMB this year. However Brad Chapman will be there to present the annual "Biopython Project Update" talk (as well as helping to organise this year's BOSC and the associated CodeFest event preceding it). I'd love to have been there too, but I'm sure everyone attending will have a great time. Again, registration is still open: http://www.open-bio.org/wiki/BOSC_2010 http://www.open-bio.org/wiki/Codefest_2010 Regards, Peter P.S. Those of you in North America you might also be interested in the main SciPy conference in Austin, Texas (28 June to 3 July 2010): http://conference.scipy.org/scipy2010/ From biopython at maubp.freeserve.co.uk Mon Jun 7 13:50:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 14:50:06 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <933074.46322.qm@web62405.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> <933074.46322.qm@web62405.mail.re1.yahoo.com> Message-ID: On Fri, Jun 4, 2010 at 4:55 PM, Michiel de Hoon wrote: > Michael, Peter, Sebastian, Laurent, Jose, and others, > > Thanks for your comments. It looks like there are lots of things to discuss, > so let's start with the easiest ones. > > About converting a record to a string (point 5): I agree that using __str__ is > probably not the best choice, so let's use __format__ instead, or add a "write" > method. The added advantage of these is that we can print out a record in > different formats (xml, text, table) by specifying the requested format as an argument. The __format__ or format method sounds like a great idea (following other bits of Biopython). > For point 3), maybe my wording was confusing; actually what I had in mind > is the case where a given Blast program can produce different output formats > (xml, text, table, etc.). This was inspired by this bug report: > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > In my mind, the different output formats are just different intermediates, but > in essence they are the same and should therefore be stored in the same > class. So, if I run blastp, save the result as XML, and parse it, I'd expect the > same class as when I run blastp and save and parse the output in table format. > Just in the latter case, some information may be missing if it is not available in > the output in table format. Does that sound acceptable? I agree that records from all the different BLAST output formats should be represented by a common base class - but not necessarily the same class. For example, the default plain text and XML formats include the pairwise alignments, but the tabular output does not. To me having a sub-class which stores the pairwise alignments seems natural here. Peter From biopython at maubp.freeserve.co.uk Mon Jun 7 17:45:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 18:45:57 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite Message-ID: Hi all, Thanks for the lively discussion on the main list, http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html ... http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html I've spent the afternoon updating my old branch which uses SQLite to store the record identifier to file offset mapping. Using the code on this branch, Bio.SeqIO.index() supports a new optional argument currently called "db" (other names I like including "cache", suggestions welcome): http://github.com/peterjc/biopython/tree/index-sqlite The default (False) is not to use SQLite, but continue with an in memory Python dictionary. As long as you have enough RAM and don't plan to use the index at a later date, this will be fastest. If set to True or a filename, then an SQLite index is used to hold the offsets. This means very low RAM requirements, but is a lot slower because the offsets are written to disk and the SQLite index is updated as we go. I expect this part can be optimised (e.g. try to build the index at the end, try committing in batches). I'm still testing this, but the core of the work is done I think. Once we're happy with the public API, we can concentrate on things like the SQLite schema, and optimising the code. Peter P.S. I know it will need a little work to fail gracefully on Python 2.4 when SQLite isn't installed. From biopython at maubp.freeserve.co.uk Mon Jun 7 18:23:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 19:23:05 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: Peter wrote: >... > > http://github.com/peterjc/biopython/tree/index-sqlite > > ... an SQLite index is used to hold > the offsets. This means very low RAM requirements, but is a lot > slower because the offsets are written to disk and the SQLite > index is updated as we go. I expect this part can be optimised > (e.g. try to build the index at the end, try committing in batches). Having now tried using this on some files with tens of millions of records, tuning how we use SQLite is going to be important. Peter From bioinformed at gmail.com Mon Jun 7 21:10:42 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 7 Jun 2010 17:10:42 -0400 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: > Peter wrote: > >... > > > > http://github.com/peterjc/biopython/tree/index-sqlite > > > > ... an SQLite index is used to hold > > the offsets. This means very low RAM requirements, but is a lot > > slower because the offsets are written to disk and the SQLite > > index is updated as we go. I expect this part can be optimised > > (e.g. try to build the index at the end, try committing in batches). > > Having now tried using this on some files with tens of millions of > records, tuning how we use SQLite is going to be important. > > Wouldn't a Berkeley database be much much faster for constructing simple key to offset mappings? -Kevin From anaryin at gmail.com Tue Jun 8 00:45:05 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 7 Jun 2010 19:45:05 -0500 Subject: [Biopython-dev] [GSOC] Report - Week 1 Message-ID: Dear all, Eric suggested me to write a weekly email wrapping up my progress, any problems I encountered, new ideas, etc. So, here's week 1 :) *Proposed Tasks:* Wiki *Project's Github account:* Link * Progress:* *1. Renumbering Residues* I wrote a small function in Structure.py (link) that iterates over the residues in a chain and subtracts the original first residue number. This keeps gaps intacts. Worked on my machine for a set of 75 proteins I was working on. Also allows for people to change the starting residue for whatever reason, the default being 1. I had originally thought of having a SEQREQ parsing function and using this as a base for the new renumbering. However, most structures that lack residues (gaps) still count them in the numbering. Since there is no parser for SEQRES, I thought this to be the best option. *Example * ... s = p.get_structure('a', '2KSX.pdb') s.renumber_residues() s.renumber_residues(start=0) *2. Disulphide bond search* I originally proposed to use the NeighborSearch method but I didn't know that subtracting two atom objects gave me their distance. I used this instead. I defined a threshold of 3A for a S-S since the average is 2.05A. I tried to get some paper/doc from other software where such a limit would be already defined but I didn't find any.. thus, I assigned 3 because its results agreed with the SSBOND records. The user can provide a threshold integer or float as an argument to make the search stricter or broader. The function generates first an iterator with all the pairs of cysteines possible in the protein. It then checks and yields those with distances between the SG atoms of the cystein below the threshold. The result is also an iterator with tuples containing pairs of residue objects. *Example* ... s = p.get_structure('a', '2KSX.pdb') [i for i in s.search_ss_bonds()] [(, ), (, ), (, ), (, ), (, )] len([i for i in s.search_ss_bonds(threshold=100)]) 45 *Problems:* *3. Biological Unit* I added code to parse_pdb_header to extract the REMARK 350 section. They contain something like this (1IHM.pdb ): REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000 REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000 REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000 REMARK 350 BIOMT1 2 0.500000 -0.809017 -0.309017 0.00000 REMARK 350 BIOMT2 2 0.809017 0.309017 0.500000 0.00000 REMARK 350 BIOMT3 2 -0.309017 -0.500000 0.809017 0.00000 REMARK 350 BIOMT1 3 -0.309017 -0.500000 -0.809017 0.00000 I parse out the 4th column to identify each transformation. I store a 3x3 rotation matrix and the translation vector separately. It is then easy to apply them to each atom record via the transform function. Now, the problem lies in what the output should be. We broke it down to two main options: a. Create a new structure object for each rotated/translated object, thus making the final output a list of structures. This takes quite a while actually. I tried this with a deepcopy method to copy each structure and it took over 30 seconds on my machine for that PDB file above. b. Add the new rotated objects as new chains in the original structure. This is actually a good solution because it allows people to use other methods (the SS search comes to mind) on quartenary structures. It also allows the user to write a file with all the structures in their place using PDBIO quite seamlessly. However, it might be complicated to deal with an excess of chains, or if not all chains are supposed to be rotated (dunno if the case actually exists). My personal belief is that B is the way to go. Although it adulterates the original structure with alien chains, it allows much greater flexibility. I haven't tested it though. ---- Comments? :) Jo?o [...] Rodrigues From anaryin at gmail.com Tue Jun 8 03:42:27 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 7 Jun 2010 22:42:27 -0500 Subject: [Biopython-dev] [GSOC] Report - Week 1 In-Reply-To: References: Message-ID: Just my own heads up and comment. I thought of using MODEL records to hold the rotated structures. Citing the PDB format guidelines: This record is used only when more than one model appears in an entry. *Generally, > it is employed mainly for NMR structures.* The chemical connectivity > should be the same for each model. ATOM, HETATM, ANISOU, and TER records for > each model structure and are interspersed as needed between MODEL and ENDMDL > records. > Since REMARK 350 seems to be a X-Ray exclusive feature and conversely MODEL a NMR one, I believe this could also be a possible solution. I'm adding the code I wrote to Git. There is a huge speed problem with that deepcopy method.. if someone has a faster/better alternative, it would be great as this takes around 2 seconds per matrix. Best! Jo?o [...] Rodrigues @ http://www.biopython.org/wiki/User:Joaor On Mon, Jun 7, 2010 at 7:45 PM, Jo?o Rodrigues wrote: > Dear all, > > Eric suggested me to write a weekly email wrapping up my progress, any > problems I encountered, new ideas, etc. So, here's week 1 :) > > *Proposed Tasks:* Wiki > *Project's Github account:* Link > * > Progress:* > > *1. Renumbering Residues* > > I wrote a small function in Structure.py (link) > that iterates over the residues in a chain and subtracts the original first > residue number. This keeps gaps intacts. Worked on my machine for a set of > 75 proteins I was working on. Also allows for people to change the starting > residue for whatever reason, the default being 1. > > I had originally thought of having a SEQREQ parsing function and using this > as a base for the new renumbering. However, most structures that lack > residues (gaps) still count them in the numbering. Since there is no parser > for SEQRES, I thought this to be the best option. > > *Example > * > ... > s = p.get_structure('a', '2KSX.pdb') > s.renumber_residues() > s.renumber_residues(start=0) > > > *2. Disulphide bond search* > > I originally proposed to use the NeighborSearch method but I didn't know > that subtracting two atom objects gave me their distance. I used this > instead. > > I defined a threshold of 3A for a S-S since the average is 2.05A. I tried > to get some paper/doc from other software where such a limit would be > already defined but I didn't find any.. thus, I assigned 3 because its > results agreed with the SSBOND records. The user can provide a threshold > integer or float as an argument to make the search stricter or broader. > > The function generates first an iterator with all the pairs of cysteines > possible in the protein. It then checks and yields those with distances > between the SG atoms of the cystein below the threshold. The result is also > an iterator with tuples containing pairs of residue objects. > > *Example* > > ... > s = p.get_structure('a', '2KSX.pdb') > [i for i in s.search_ss_bonds()] > [(, >), (, icode= >), (, resseq=95 icode= >), (, het= resseq=66 icode= >), (, CYS het= resseq=200 icode= >)] > len([i for i in s.search_ss_bonds(threshold=100)]) > 45 > > > > *Problems:* > > *3. Biological Unit* > > I added code to parse_pdb_header to extract the REMARK 350 section. They > contain something like this (1IHM.pdb > ): > > REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 > 0.00000 > REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 > 0.00000 > REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 > 0.00000 > REMARK 350 BIOMT1 2 0.500000 -0.809017 -0.309017 > 0.00000 > REMARK 350 BIOMT2 2 0.809017 0.309017 0.500000 > 0.00000 > REMARK 350 BIOMT3 2 -0.309017 -0.500000 0.809017 > 0.00000 > REMARK 350 BIOMT1 3 -0.309017 -0.500000 -0.809017 0.00000 > > I parse out the 4th column to identify each transformation. I store a 3x3 > rotation matrix and the translation vector separately. It is then easy to > apply them to each atom record via the transform function. > > Now, the problem lies in what the output should be. We broke it down to two > main options: > > a. Create a new structure object for each rotated/translated object, thus > making the final output a list of structures. This takes quite a while > actually. I tried this with a deepcopy method to copy each structure and it > took over 30 seconds on my machine for that PDB file above. > > b. Add the new rotated objects as new chains in the original structure. > This is actually a good solution because it allows people to use other > methods (the SS search comes to mind) on quartenary structures. It also > allows the user to write a file with all the structures in their place using > PDBIO quite seamlessly. However, it might be complicated to deal with an > excess of chains, or if not all chains are supposed to be rotated (dunno if > the case actually exists). > > My personal belief is that B is the way to go. Although it adulterates the > original structure with alien chains, it allows much greater flexibility. I > haven't tested it though. > > ---- > > Comments? :) > > Jo?o [...] Rodrigues > > From thomas.hamelryck at gmail.com Tue Jun 8 06:39:53 2010 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Tue, 8 Jun 2010 08:39:53 +0200 Subject: [Biopython-dev] [GSOC] Report - Week 1 In-Reply-To: References: Message-ID: Hi all, I think it's great that Bio.PDB is being updated. Here are some remarks: I haven't seen much discussion about the one key feature of Bio.PDB that definitely needs to be improved: its speed. With the enormous increase of the number of structures, extracting data using Bio.PDB is too slow. Would be good to move some parts to C. A second issues is nicely illustrated by the following code snippet: > s = p.get_structure('a', '2KSX.pdb') > [i for i in s.search_ss_bonds()] I think this is NOT the way to do it. PDB files can contain anything RNA, DNA, sugars, small molecules... It is thus not a good idea to directly associate protein-specific methods to the structure class; it will lead to a bloated Structure class and a lot of irrelevant methods (ie. search_ss_bonds is meaningless for a PDB file that contains RNA). Currently, one creates Polypeptide objects from a Structure object using a factory design pattern (via PPBuilder); the Polypeptide class implements some protein specific methods. I believe that is a much cleaner way to do it (though we need a Protein class that represents collections of connected polypeptides). One can also make sure that all such derived objects (Protein, NA, DNA,...) adhere to the same interface by providing a suitable base class with shared functionality - in that way, the whole thing is also extendible. Something like: s = p.get_structure('a', '2KSX.pdb') pb = ProteinBuilder() proteins = pb.build(structure) ssbridges = proteins.get_ss_bonds() Here, "proteins" would represent a collection of polypeptide chains. Cheers, -Thomas -- Thomas Hamelryck, Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://www.binf.ku.dk/research/structural_bioinformatics/ From lgautier at gmail.com Tue Jun 8 07:00:10 2010 From: lgautier at gmail.com (Laurent) Date: Tue, 08 Jun 2010 09:00:10 +0200 Subject: [Biopython-dev] Biopython-dev Digest, Vol 89, Issue 8 In-Reply-To: References: Message-ID: <4C0DEA7A.1020606@gmail.com> On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote: > On Mon, Jun 7, 2010 at 2:23 PM, Peterwrote: > >> > Peter wrote: >>> > >... >>> > > >>> > > http://github.com/peterjc/biopython/tree/index-sqlite >>> > > >>> > > ... an SQLite index is used to hold >>> > > the offsets. This means very low RAM requirements, but is a lot >>> > > slower because the offsets are written to disk and the SQLite >>> > > index is updated as we go. I expect this part can be optimised >>> > > (e.g. try to build the index at the end, try committing in batches). >> > >> > Having now tried using this on some files with tens of millions of >> > records, tuning how we use SQLite is going to be important. >> > >> > > Wouldn't a Berkeley database be much much faster for constructing simple key > to offset mappings? > > -Kevin > Yes. If one is only looking for a key/value associative structure, the NOSQL solutions will be faster (tokyocabinet seems to be one of the fastest, up to 100x when compared to BerkleyDB http://www.ioremap.net/node/235 ). L. From biopython at maubp.freeserve.co.uk Tue Jun 8 09:35:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 10:35:15 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote: > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: >> >> Having now tried using this on some files with tens of millions of >> records, tuning how we use SQLite is going to be important. >> >> > Wouldn't a Berkeley database be much much faster for constructing > simple key to offset mappings? > Maybe - now that I've done the refactoring on Bio.SeqIO.index() to allow two back ends (python dict or SQLite) trying a third (BDB) is much easier. Did you know BDB was used in the old OBDA index files? However, Python 2.6 deprecated bsddb (the Python Interface to Berkeley DB library) and Python is pushing people to SQLite3 instead. Peter From krother at rubor.de Tue Jun 8 09:59:43 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 8 Jun 2010 11:59:43 +0200 Subject: [Biopython-dev] Tested Fixup branch for Bio.PDB Message-ID: Hi Eric, I've checked out your pdbfixes branch and ran our 431 Unit Tests of ModeRNA with it. There were no changes to the master Bio.PDB branch --> for us everything OK. Details: ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and uses Bio.PDB for most of its operations: reading files, adding/copying/manipulating residues/atoms, superimposing structures, searching neighbors by KDTree, writing files. Right, the tests most probably did not depend directly on the code you changed, but as I understand you wanted to go sure the branch didnt break anything by accident. Best Regards, Kristian From bioinformed at gmail.com Tue Jun 8 11:00:44 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 8 Jun 2010 07:00:44 -0400 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 5:35 AM, Peter wrote: > On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote: > > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: > >> > >> Having now tried using this on some files with tens of millions of > >> records, tuning how we use SQLite is going to be important. > >> > > Wouldn't a Berkeley database be much much faster for constructing > > simple key to offset mappings? > > Maybe - now that I've done the refactoring on Bio.SeqIO.index() to > allow two back ends (python dict or SQLite) trying a third (BDB) is > much easier. Did you know BDB was used in the old OBDA index > files? However, Python 2.6 deprecated bsddb (the Python Interface > to Berkeley DB library) and Python is pushing people to SQLite3 > instead. > > Hi Peter, I am aware that SQLite is taking over the job of serving as the default embedded database for Python and am in vigorous agreement with that trend. I use SQLite for a wide range of tasks and am extremely happy with it for most applications. Unfortunately, for pure key-value mapping tasks, I've found SQLite to be 4-10x slower than a well-tuned BDB tree, even with batched updates and using the most aggressive SQLite performance pragmas. My results may not be typical, but I thought I'd raise the issue given the magnitude of the performance difference. Best regards, -Kevin From mjldehoon at yahoo.com Tue Jun 8 12:19:28 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Jun 2010 05:19:28 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records In-Reply-To: Message-ID: <14055.47665.qm@web62401.mail.re1.yahoo.com> --- On Mon, 6/7/10, Peter wrote: > I agree that records from all the different BLAST output > formats should be represented by a common base class - > but not necessarily the same class. > For example, the default plain text and XML formats include > the pairwise alignments, but the tabular output does not. To > me having a sub-class which stores the pairwise alignments seems > natural here. Why do we need a sub-class? We don't do this in Bio.SeqIO, where GenBank files contain much more information than Fasta files, but both are represented by a SeqRecord. Best, --Michiel. From biopython at maubp.freeserve.co.uk Tue Jun 8 12:32:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 13:32:05 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <14055.47665.qm@web62401.mail.re1.yahoo.com> References: <14055.47665.qm@web62401.mail.re1.yahoo.com> Message-ID: On Tue, Jun 8, 2010 at 1:19 PM, Michiel de Hoon wrote: > --- On Mon, 6/7/10, Peter wrote: >> I agree that records from all the different BLAST output >> formats should be represented by a common base class - >> but not necessarily the same class. >> For example, the default plain text and XML formats include >> the pairwise alignments, but the tabular output does not. To >> me having a sub-class which stores the pairwise alignments seems >> natural here. > > Why do we need a sub-class? We don't do this in Bio.SeqIO, > where GenBank files contain much more information than Fasta > files, but both are represented by a SeqRecord. OK, I guess you could have some properties which are left empty (like the annotations dictionary or features list in a SeqRecord from a FASTA file). Peter From mjldehoon at yahoo.com Tue Jun 8 13:44:01 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Jun 2010 06:44:01 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records In-Reply-To: Message-ID: <756890.46421.qm@web62404.mail.re1.yahoo.com> --- On Tue, 6/8/10, Peter wrote: > > Why do we need a sub-class? We don't do this in > > Bio.SeqIO, where GenBank files contain much more > > information than Fasta files, but both are > > represented by a SeqRecord. > > OK, I guess you could have some properties which are left > empty > (like the annotations dictionary or features list in a > SeqRecord from a FASTA file). I would prefer that, as it keeps things simple and consistent with other parts of Biopython. But let's see how it goes. Over the weekend I'll set up a rudimentary Blast parser and record so we can see what it would look like in practice. --Michiel From bpederse at gmail.com Tue Jun 8 15:47:18 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 8 Jun 2010 08:47:18 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 4:00 AM, Kevin Jacobs wrote: > On Tue, Jun 8, 2010 at 5:35 AM, Peter wrote: > >> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote: >> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote: >> >> >> >> Having now tried using this on some files with tens of millions of >> >> records, tuning how we use SQLite is going to be important. >> >> >> > Wouldn't a Berkeley database be much much faster for constructing >> > simple key to offset mappings? >> >> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to >> allow two back ends (python dict or SQLite) trying a third (BDB) is >> much easier. Did you know BDB was used in the old OBDA index >> files? However, Python 2.6 deprecated bsddb (the Python Interface >> to Berkeley DB library) and Python is pushing people to SQLite3 >> instead. >> >> > Hi Peter, > > I am aware that SQLite is taking over the job of serving as the default > embedded database for Python and am in vigorous agreement with that trend. > ?I use SQLite for a wide range of tasks and am extremely happy with it for > most applications. ?Unfortunately, for pure key-value mapping tasks, I've > found ?SQLite to be 4-10x slower than a well-tuned BDB tree, even with > batched updates and using the most aggressive SQLite performance pragmas. My > results may not be typical, but I thought I'd raise the issue given the > magnitude of the performance difference. > > Best regards, > -Kevin > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > my results may not be typical either, but using an earlier version of peter's sqlite biopython branch and comparing to screed (http://github.com/acr/screed), and my file-index (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i found that biopython's implementation is at most, a bit more than 2x slower. and it does the fastq parsing much more rigorously. also, i didn't see much difference between berkeleydb and tokyocabinet--though the ctypes-based TC wrapper i was using has since been streamlined. here's what i saw for 15+ million records with this script: http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py /opt/src/methylcode/data/s_1_sequence.txt benchmarking fastq file with 15646356 records (62585424 lines) performing 500000 random queries screed ------ create: 704.764 search: 51.717 biopython-sqlite ---------------- create: 727.868 search: 92.947 fileindex --------- create: 294.356 search: 53.701 From biopython at maubp.freeserve.co.uk Tue Jun 8 16:35:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 17:35:07 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen wrote: > > my results may not be typical either, but using an earlier version of > peter's sqlite biopython branch and comparing to screed > (http://github.com/acr/screed), and my file-index > (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i > found that biopython's implementation is at most, a bit more than 2x > slower. and it does the fastq parsing much more rigorously. > > also, i didn't see much difference between berkeleydb and > tokyocabinet--though the ctypes-based TC wrapper i was using has since > been streamlined. > here's what i saw for 15+ million records with this script: > http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py > > /opt/src/methylcode/data/s_1_sequence.txt > benchmarking fastq file with 15646356 records (62585424 lines) > performing 500000 random queries > > screed > ------ > create: 704.764 > search: 51.717 > > biopython-sqlite > ---------------- > create: 727.868 > search: 92.947 > > fileindex > --------- > create: 294.356 > search: 53.701 Are you using a recent version of screed (with SQLite internally)? Which back end are your "fileindex" numbers for? BDB? I'd say that the slow "search" from (the old branch of) Biopython is down to our FASTQ parsing time, which includes lots of object creation. The get_raw method can be useful here depending on what you want to achieve: http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ The version you tried didn't do anything clever with the SQLite indexes, batched inserts etc. I'm hoping the current code will be faster (although there is likely a penalty from having two switchable back ends). Brent, could you re-run this benchmark with this code: http://github.com/peterjc/biopython/tree/index-sqlite-batched You'll need to change the Biopython call in your test script from this (it was renamed before landing on the trunk): fi = SeqIO.indexed_dict(f, idx, "fastq") to this: fi = SeqIO.index(f, idx, "fastq", db=True) or give an explicit filename: fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx") where db is the new parameter for controlling where and if the lookup table is stored on disk. Peter From anaryin at gmail.com Tue Jun 8 17:10:48 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Jun 2010 12:10:48 -0500 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hello all, I'm replying here to what Thomas wrote on the GSOC Report thread because it seems a better place. PDB files can contain anything RNA, DNA, sugars, small molecules... It is > thus not a good idea to > directly associate protein-specific methods to the structure class; it will > lead to a bloated Structure class and a lot of irrelevant methods (ie. > search_ss_bonds is meaningless for a PDB file that contains RNA). Agree. Currently, one creates Polypeptide objects from a Structure object using a > factory design pattern (via PPBuilder); the Polypeptide class implements > some protein specific methods. I believe that is a much cleaner way to do it > (though we need a Protein class that represents collections of connected > polypeptides). One can also make sure that all such derived objects > (Protein, NA, DNA,...) adhere to the same interface by providing a suitable > base class with shared functionality - in that way, the whole thing is also > extendible. > I think there has been already some discussion about this. My personal opinion/suggestion is having a structure like: Bio.PDB/ _______/Protein.py _______/DNA.py _______/RNA.py that would translate to an usage of something like: from Bio.PDB import Protein structure = Protein('1ABC.pdb') structure.search_ss_bonds() but not structure.calc_melting_temperature() (just an example) Protein() would call PDBParser(). It could also include, to a certain extent, an Alphabet-like feature to assure residue names are OK (this goes a bit with this proposal). I believe this goes a bit into what you said. Having a class that basically abstracts what we do now (Bio.PDB.PDBParser) and allows for molecule-specific methods. However, it also leads to some problems: Protein/DNA complexes come to mind. How does this sound? I think it goes with what Eric said in the first post of this thread and what Thomas replied in the GSOC thread. We should also change the PDB name to Struct to better reflect the purpose of the module. All of the other additions like Bio.Struct.WWW would still apply. And I don't see a major problem in breaking the existing code by adding this. Jo?o From tiagoantao at gmail.com Tue Jun 8 19:12:00 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 8 Jun 2010 20:12:00 +0100 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 10:35 AM, Peter wrote: > Comments? Maybe put this on the wiki as doc for good practice? From biopython at maubp.freeserve.co.uk Tue Jun 8 19:41:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 20:41:03 +0100 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: 2010/6/8 Tiago Ant?o : > On Mon, Jun 7, 2010 at 10:35 AM, Peter wrote: >> Comments? > > Maybe put this on the wiki as doc for good practice? So this does seems like a sensible approach (for those of use with commit access to the main repository)? We can add it to the git usage page then... http://www.biopython.org/wiki/GitUsage Peter From eric.talevich at gmail.com Tue Jun 8 21:45:42 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 8 Jun 2010 17:45:42 -0400 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: On Mon, Jun 7, 2010 at 5:35 AM, Peter wrote: > Hi all, > > I thought I'd write down some notes about how I've been using git recently. > This may be of interest to any of the other core developers (those of us > with read-write access to the main repository), and I might get some good > tips from any discussion. The key point is that I have read+write access > to two repositories on github (the official repository AND my own fork), > so there are different advantages/disadvantages about which I choose > to work with directly as my main repository. > > [...] > > Instead, I have a github repository of my own (what github calls a > fork), and I push branches there. > > http://github.com/biopython/biopython - the official branch(es) > http://github.com/peterjc/biopython - my branches > > How does this work in practice? Like this - I clone the master > and add a reference to my repository (and I do the same when I > want to grab a branch from another developer): > > git clone git at github.com:biopython/biopython.git > cd biopython > git remote add peterjc git at github.com:peterjc/biopython.git > git fetch peterjc > > Then make a new local branch as usual, and when ready to share > it publicly, I push it to *my* repository on github: > > git branch new-work > git checkout new-work > git commit ... > git push peterjc new-work > > This would then appear as a new-work branch on my github page. > Then if I (or someone else) wants to access these branches later > (e.g. from another machine) just use the checkout tracked remote > branch. For example, > > git clone git at github.com:biopython/biopython.git > cd biopython > git remote add peterjc git at github.com:peterjc/biopython.git > git fetch peterjc > git checkout -t peterjc/seqio-imgt > > This then looks like a normal branch (called just "seqio-imgt" in > this example), but git knows it is linked to the remote branch on > the "peterjc" repository (not the origin which is the "official" > repository). > This looks reasonable to me. I'd add that the procedure to delete a public branch from your personal fork on GitHub is a little obscure: git branch -a # list local and remote branches git branch -d new-work # delete a local branch that's been merged already git push peterjc :new-work # delete the public branch from GitHub This doesn't do what you'd expect: git branch -d peterjc/new-work That only removes your local reference to the the public branch; the branch is still visible on GitHub. (It's kind of hard to find in the GitHub documentation.) I'd have to check, but I guess that if the original git clone is done > with git://github.com/biopython/biopython.git instead (read only > access) the same procedure could be used by non core devs. > However, I'm not sure this is clearer for them. I think the current > procedure (on our wiki) where you add a remote reference to > the "upstream" official repository works better in this case. > I still have an "upstream" reference to the main repo. I wouldn't want to accidentally push something foolish to the main repo with a stray "git push"... better to have the safe thing happen by default. If the initial clone was from biopython master, and you later create a personal forkon GitHub, then it's not too hard to switch the references around in your local repo to make the public fork your "origin". -Eric From bugzilla-daemon at portal.open-bio.org Tue Jun 8 22:52:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Jun 2010 18:52:28 -0400 Subject: [Biopython-dev] [Bug 3096] New: PPBuilder build_peptides bugs Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3096 Summary: PPBuilder build_peptides bugs Product: Biopython Version: Not Applicable Platform: Other OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: skong at zymeworks.com Given a chain of backbone connected residues 'IXRGXTGL' that contains two non-standard amino acids 'X' in between, building peptide with only standard amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not be returned as a peptide since it is just one residue. Currently biopython would return 'IXGXGL', with two bugs in between: 1. Skipping a standard amino acid R and T after each X, while keeping X (Should skip X instead not R or T). Related to http://bugzilla.open-bio.org/show_bug.cgi?id=2910 and http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html 2. Return one peptide even though after filtering the two X residues which connect 'I', 'RG', 'TGL' are no longer present and fragment 'IRGTGL' cannot be considered as a valid peptide without the two Xs connecting them. The above sequence 'IXRGXTGL' are taken from 1bfe and mutated. The 'mutation' referred here is simply renaming the residue name to something that is not standard and represented as 'X'. Each solution proposed below is meant to fix respective bug above: 1. Insert (not accept(prev) or not accept(next)) after if aa_only check at line 299 of Bio/PDB/Polypeptide.py 2. Insert pp=None when either of the residues compared are filtered at line 300 or Bio/PDB/Polypeptide.py Amino acids filtering bug in method build_peptides() of class _PPBuilder ofin Bio/PDB/Polypeptide.py: Original: for chain in chain_list: chain_it=iter(chain) prev=chain_it.next() pp=None for next in chain_it: if aa_only and not accept(prev): prev=next continue if is_connected(prev, next): if pp is None: pp=Polypeptide() pp.append(prev) pp_list.append(pp) pp.append(next) else: pp=None prev=next return pp_list Fixed: for chain in chain_list: chain_it=iter(chain) prev=chain_it.next() pp=None for next in chain_it: if aa_only and (not accept(prev) or not accept(next)): prev=next; pp=None continue if is_connected(prev, next): if pp is None: pp=Polypeptide() pp.append(prev) pp_list.append(pp) pp.append(next) else: pp=None prev=next return pp_list Attached here is the code used to test the above case, with and without mutations, and with and without standard amino acid filtering. The case without mutation is just to show that the backbone atoms of the mutated version are connected: from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import PPBuilder, is_aa class StandardAABuilder(PPBuilder): """ Polypeptide builder which accepts only standard amino acids.""" def _accept(self, residue): return is_aa(residue, standard=True) def extract_peptides(model): """Extracts the peptides from a model. Returns a list of Peptide object.""" output = [] for peptide in PPBuilder().build_peptides(model): seq = str(peptide.get_sequence()) output.append(seq) return output def extract_peptides_saa(model): """Extracts the peptides from a model. Returns a list of Peptide object.""" output = [] for peptide in StandardAABuilder().build_peptides(model): seq = str(peptide.get_sequence()) output.append(seq) return output if __name__ == '__main__': oripdb = open('chopped_pdb1bfe.ent') sto = PDBParser().get_structure('', oripdb) seqao = extract_peptides(sto) seqbo = extract_peptides_saa(sto) print 'ori seq all ' print seqao print 'ori seq standard only' print seqbo pdb = open('chopped_mutated_pdb1bfe.ent') st = PDBParser().get_structure('', pdb) seqa = extract_peptides(st) seqb = extract_peptides_saa(st) print 'mut seq all' print seqa print 'mut seq standard only ' print seqb Attached below are the two fragments of PDB files, pre and post mutated. chopped_pdb1bfe.ent ATOM 85 N ILE A 316 37.386 71.217 31.070 1.00 36.97 N ATOM 86 CA ILE A 316 38.311 71.290 29.949 1.00 33.71 C ATOM 87 C ILE A 316 37.634 72.103 28.862 1.00 33.93 C ATOM 88 O ILE A 316 36.415 72.216 28.839 1.00 36.46 O ATOM 89 CB ILE A 316 38.651 69.876 29.404 1.00 35.79 C ATOM 90 CG1 ILE A 316 39.331 69.049 30.501 1.00 36.78 C ATOM 91 CG2 ILE A 316 39.572 69.979 28.187 1.00 37.71 C ATOM 92 CD1 ILE A 316 39.881 67.724 30.023 1.00 39.20 C ATOM 93 N HIS A 317 38.425 72.679 27.969 1.00 35.61 N ATOM 94 CA HIS A 317 37.880 73.473 26.881 1.00 37.92 C ATOM 95 C HIS A 317 38.360 72.928 25.540 1.00 37.79 C ATOM 96 O HIS A 317 39.463 73.240 25.094 1.00 37.44 O ATOM 97 CB HIS A 317 38.303 74.930 27.052 1.00 35.19 C ATOM 98 CG HIS A 317 37.888 75.519 28.363 1.00 35.76 C ATOM 99 ND1 HIS A 317 36.611 75.981 28.602 1.00 37.74 N ATOM 100 CD2 HIS A 317 38.575 75.701 29.516 1.00 37.59 C ATOM 101 CE1 HIS A 317 36.529 76.420 29.844 1.00 38.74 C ATOM 102 NE2 HIS A 317 37.706 76.262 30.421 1.00 36.76 N ATOM 103 N ARG A 318 37.527 72.109 24.905 1.00 38.78 N ATOM 104 CA ARG A 318 37.884 71.512 23.627 1.00 42.04 C ATOM 105 C ARG A 318 38.469 72.559 22.699 1.00 45.14 C ATOM 106 O ARG A 318 39.592 72.425 22.205 1.00 42.05 O ATOM 107 CB ARG A 318 36.657 70.880 22.967 1.00 42.93 C ATOM 108 CG ARG A 318 36.934 70.321 21.576 1.00 38.60 C ATOM 109 CD ARG A 318 35.654 70.038 20.821 1.00 35.39 C ATOM 110 NE ARG A 318 34.624 69.538 21.724 1.00 34.96 N ATOM 111 CZ ARG A 318 34.539 68.278 22.141 1.00 31.51 C ATOM 112 NH1 ARG A 318 35.419 67.373 21.736 1.00 25.19 N ATOM 113 NH2 ARG A 318 33.579 67.929 22.983 1.00 29.10 N ATOM 114 N GLY A 319 37.690 73.604 22.461 1.00 49.96 N ATOM 115 CA GLY A 319 38.138 74.668 21.592 1.00 55.53 C ATOM 116 C GLY A 319 38.459 74.219 20.180 1.00 58.85 C ATOM 117 O GLY A 319 37.583 73.766 19.440 1.00 58.98 O ATOM 118 N SER A 320 39.734 74.334 19.823 1.00 61.64 N ATOM 119 CA SER A 320 40.219 73.992 18.493 1.00 63.16 C ATOM 120 C SER A 320 40.212 72.517 18.110 1.00 65.27 C ATOM 121 O SER A 320 39.558 72.127 17.145 1.00 65.12 O ATOM 122 CB SER A 320 41.634 74.542 18.316 1.00 65.36 C ATOM 123 OG SER A 320 42.124 74.255 17.019 1.00 72.05 O ATOM 124 N THR A 321 40.955 71.702 18.853 1.00 67.43 N ATOM 125 CA THR A 321 41.049 70.274 18.562 1.00 67.73 C ATOM 126 C THR A 321 40.220 69.430 19.529 1.00 66.41 C ATOM 127 O THR A 321 39.244 69.917 20.095 1.00 70.21 O ATOM 128 CB THR A 321 42.517 69.810 18.620 1.00 70.22 C ATOM 129 OG1 THR A 321 42.613 68.453 18.169 1.00 77.03 O ATOM 130 CG2 THR A 321 43.049 69.915 20.045 1.00 72.07 C ATOM 131 N GLY A 322 40.608 68.168 19.707 1.00 61.22 N ATOM 132 CA GLY A 322 39.892 67.286 20.614 1.00 53.23 C ATOM 133 C GLY A 322 40.037 67.705 22.065 1.00 48.00 C ATOM 134 O GLY A 322 40.138 68.892 22.372 1.00 50.41 O ATOM 135 N LEU A 323 40.044 66.734 22.968 1.00 41.92 N ATOM 136 CA LEU A 323 40.190 67.033 24.385 1.00 35.58 C ATOM 137 C LEU A 323 41.613 66.738 24.874 1.00 31.41 C ATOM 138 O LEU A 323 41.932 66.921 26.046 1.00 30.47 O ATOM 139 CB LEU A 323 39.160 66.240 25.191 1.00 35.76 C ATOM 140 CG LEU A 323 37.716 66.576 24.802 1.00 39.50 C ATOM 141 CD1 LEU A 323 36.733 65.796 25.670 1.00 38.15 C ATOM 142 CD2 LEU A 323 37.493 68.074 24.955 1.00 38.58 C PDB FILE: mutated_chopped_pdb1bfe.ent ATOM 85 N ILE A 316 37.386 71.217 31.070 1.00 36.97 N ATOM 86 CA ILE A 316 38.311 71.290 29.949 1.00 33.71 C ATOM 87 C ILE A 316 37.634 72.103 28.862 1.00 33.93 C ATOM 88 O ILE A 316 36.415 72.216 28.839 1.00 36.46 O ATOM 89 CB ILE A 316 38.651 69.876 29.404 1.00 35.79 C ATOM 90 CG1 ILE A 316 39.331 69.049 30.501 1.00 36.78 C ATOM 91 CG2 ILE A 316 39.572 69.979 28.187 1.00 37.71 C ATOM 92 CD1 ILE A 316 39.881 67.724 30.023 1.00 39.20 C ATOM 93 N HIE A 317 38.425 72.679 27.969 1.00 35.61 N ATOM 94 CA HIE A 317 37.880 73.473 26.881 1.00 37.92 C ATOM 95 C HIE A 317 38.360 72.928 25.540 1.00 37.79 C ATOM 96 O HIE A 317 39.463 73.240 25.094 1.00 37.44 O ATOM 97 CB HIE A 317 38.303 74.930 27.052 1.00 35.19 C ATOM 98 CG HIE A 317 37.888 75.519 28.363 1.00 35.76 C ATOM 99 ND1 HIE A 317 36.611 75.981 28.602 1.00 37.74 N ATOM 100 CD2 HIE A 317 38.575 75.701 29.516 1.00 37.59 C ATOM 101 CE1 HIE A 317 36.529 76.420 29.844 1.00 38.74 C ATOM 102 NE2 HIE A 317 37.706 76.262 30.421 1.00 36.76 N ATOM 103 N ARG A 318 37.527 72.109 24.905 1.00 38.78 N ATOM 104 CA ARG A 318 37.884 71.512 23.627 1.00 42.04 C ATOM 105 C ARG A 318 38.469 72.559 22.699 1.00 45.14 C ATOM 106 O ARG A 318 39.592 72.425 22.205 1.00 42.05 O ATOM 107 CB ARG A 318 36.657 70.880 22.967 1.00 42.93 C ATOM 108 CG ARG A 318 36.934 70.321 21.576 1.00 38.60 C ATOM 109 CD ARG A 318 35.654 70.038 20.821 1.00 35.39 C ATOM 110 NE ARG A 318 34.624 69.538 21.724 1.00 34.96 N ATOM 111 CZ ARG A 318 34.539 68.278 22.141 1.00 31.51 C ATOM 112 NH1 ARG A 318 35.419 67.373 21.736 1.00 25.19 N ATOM 113 NH2 ARG A 318 33.579 67.929 22.983 1.00 29.10 N ATOM 114 N GLY A 319 37.690 73.604 22.461 1.00 49.96 N ATOM 115 CA GLY A 319 38.138 74.668 21.592 1.00 55.53 C ATOM 116 C GLY A 319 38.459 74.219 20.180 1.00 58.85 C ATOM 117 O GLY A 319 37.583 73.766 19.440 1.00 58.98 O ATOM 118 N XQQ A 320 39.734 74.334 19.823 1.00 61.64 N ATOM 119 CA XQQ A 320 40.219 73.992 18.493 1.00 63.16 C ATOM 120 C XQQ A 320 40.212 72.517 18.110 1.00 65.27 C ATOM 121 O XQQ A 320 39.558 72.127 17.145 1.00 65.12 O ATOM 122 CB XQQ A 320 41.634 74.542 18.316 1.00 65.36 C ATOM 123 OG XQQ A 320 42.124 74.255 17.019 1.00 72.05 O ATOM 124 N THR A 321 40.955 71.702 18.853 1.00 67.43 N ATOM 125 CA THR A 321 41.049 70.274 18.562 1.00 67.73 C ATOM 126 C THR A 321 40.220 69.430 19.529 1.00 66.41 C ATOM 127 O THR A 321 39.244 69.917 20.095 1.00 70.21 O ATOM 128 CB THR A 321 42.517 69.810 18.620 1.00 70.22 C ATOM 129 OG1 THR A 321 42.613 68.453 18.169 1.00 77.03 O ATOM 130 CG2 THR A 321 43.049 69.915 20.045 1.00 72.07 C ATOM 131 N GLY A 322 40.608 68.168 19.707 1.00 61.22 N ATOM 132 CA GLY A 322 39.892 67.286 20.614 1.00 53.23 C ATOM 133 C GLY A 322 40.037 67.705 22.065 1.00 48.00 C ATOM 134 O GLY A 322 40.138 68.892 22.372 1.00 50.41 O ATOM 135 N LEU A 323 40.044 66.734 22.968 1.00 41.92 N ATOM 136 CA LEU A 323 40.190 67.033 24.385 1.00 35.58 C ATOM 137 C LEU A 323 41.613 66.738 24.874 1.00 31.41 C ATOM 138 O LEU A 323 41.932 66.921 26.046 1.00 30.47 O ATOM 139 CB LEU A 323 39.160 66.240 25.191 1.00 35.76 C ATOM 140 CG LEU A 323 37.716 66.576 24.802 1.00 39.50 C ATOM 141 CD1 LEU A 323 36.733 65.796 25.670 1.00 38.15 C ATOM 142 CD2 LEU A 323 37.493 68.074 24.955 1.00 38.58 C -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bpederse at gmail.com Wed Jun 9 04:33:12 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 8 Jun 2010 21:33:12 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 9:35 AM, Peter wrote: > On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen wrote: >> >> my results may not be typical either, but using an earlier version of >> peter's sqlite biopython branch and comparing to screed >> (http://github.com/acr/screed), and my file-index >> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i >> found that biopython's implementation is at most, a bit more than 2x >> slower. and it does the fastq parsing much more rigorously. >> >> also, i didn't see much difference between berkeleydb and >> tokyocabinet--though the ctypes-based TC wrapper i was using has since >> been streamlined. >> here's what i saw for 15+ million records with this script: >> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py >> >> /opt/src/methylcode/data/s_1_sequence.txt >> benchmarking fastq file with 15646356 records (62585424 lines) >> performing 500000 random queries >> >> screed >> ------ >> create: 704.764 >> search: 51.717 >> >> biopython-sqlite >> ---------------- >> create: 727.868 >> search: 92.947 >> >> fileindex >> --------- >> create: 294.356 >> search: 53.701 > > Are you using a recent version of screed (with SQLite internally)? > > Which back end are your "fileindex" numbers for? BDB? > > I'd say that the slow "search" from (the old branch of) Biopython is > down to our FASTQ parsing time, which includes lots of object > creation. The get_raw method can be useful here depending on > what you want to achieve: > http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > > The version you tried didn't do anything clever with the SQLite > indexes, batched inserts etc. I'm hoping the current code will be > faster (although there is likely a penalty from having two switchable > back ends). Brent, could you re-run this benchmark with this code: > http://github.com/peterjc/biopython/tree/index-sqlite-batched > > You'll need to change the Biopython call in your test script from > this (it was renamed before landing on the trunk): > > fi = SeqIO.indexed_dict(f, idx, "fastq") > > to this: > > fi = SeqIO.index(f, idx, "fastq", db=True) > > or give an explicit filename: > > fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx") > > where db is the new parameter for controlling where and if > the lookup table is stored on disk. > > Peter > done. the previous times and the current were using py-tcdb not bsddb. the author of tcdb made some improvements so it's faster this time, and your SeqIO implementation is almost 2x as fast to load as the previous one. that's a nice implementation. i didn't try get_raw. these timints are are with your latest version, and the version of screed pulled from http://github.com/acr/screed master today. /opt/src/methylcode/data/s_1_sequence.txt benchmarking fastq file with 15646356 records (62585424 lines) performing 500000 random queries screed ------ create: 699.210 search: 51.043 biopython-sqlite ---------------- create: 386.647 search: 93.391 fileindex --------- create: 184.088 search: 48.887 From bugzilla-daemon at portal.open-bio.org Wed Jun 9 08:43:02 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Jun 2010 04:43:02 -0400 Subject: [Biopython-dev] [Bug 3096] PPBuilder build_peptides bugs In-Reply-To: Message-ID: <201006090843.o598h2tx024780@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3096 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-09 04:43 EST ------- (In reply to comment #0) > Given a chain of backbone connected residues 'IXRGXTGL' that contains two > non-standard amino acids 'X' in between, building peptide with only standard > amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not > be returned as a peptide since it is just one residue. Currently biopython > would return 'IXGXGL', with two bugs in between: What is wrong with returning 'IXGXGL'? The PDB contains a peptide of six linked residues doesn't it? It looks like Bio.PDB is doing something sensible. P.S. You didn't fill in which version of Biopython you are using. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 9 08:55:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 09:55:37 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen wrote: >> >> The version you tried didn't do anything clever with the SQLite >> indexes, batched inserts etc. I'm hoping the current code will be >> faster (although there is likely a penalty from having two switchable >> back ends). Brent, could you re-run this benchmark with this code: >> http://github.com/peterjc/biopython/tree/index-sqlite-batched >> ... > > done. Thank you Brent :) > the previous times and the current were using py-tcdb not bsddb. > the author of tcdb made some improvements so it's faster this time, OK, so you are using Tokyo Cabinet to store the lookup table here rather than BDB. Link, http://code.google.com/p/py-tcdb/ > and your SeqIO implementation is almost 2x as fast to load as the > previous one. that's a nice implementation. i didn't try get_raw. I've got some more re-factoring in mind which should help a little more (but mainly to make the structure clearer). > these timints are are with your latest version, and the version of > screed pulled from http://github.com/acr/screed master today. Having had a quick look, they are using SQLite3 in much the say way as I was initially. They create the index before loading (rather than after loading) and they use a single insert per offset (rather than using a batch in a transaction or the executemany method). I'm pretty sure from my experiments those changes would speed up screed's loading time a lot (probably inline with the speed up I achieved). > /opt/src/methylcode/data/s_1_sequence.txt > benchmarking fastq file with 15646356 records (62585424 lines) > performing 500000 random queries > > screed > ------ > create: 699.210 > search: 51.043 > > biopython-sqlite > ---------------- > create: 386.647 > search: 93.391 > > fileindex > --------- > create: 184.088 > search: 48.887 That's got us looking more competitive. As noted above, I think sceed's loading time could be much reduced by tweaking how they use SQLite3. I wonder what the breakdown for fileindex is between calling Tokyo Cabinet and the fileindex code itself? I guess we should try TK as the back end in Bio.SeqIO.index() for comparison. Peter P.S. Could you measure the database file sizes on disk? From thomas.hamelryck at gmail.com Wed Jun 9 12:18:41 2010 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Wed, 9 Jun 2010 14:18:41 +0200 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hi, On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues wrote: > > from Bio.PDB import Protein > structure = Protein('1ABC.pdb') > structure.search_ss_bonds() > Indeed, that would run into problems for complexes where proteins, RNA, DNA, etc. occur in the same file. It makes much more sense to have a Structure centred approach: proteins=Protein(structure) chains=proteins.get_chains() chain_a=chains["A"] polypeptides=chain_a.get_peptides() rnas=RNA(structure) etc. -Thomas -- Thomas Hamelryck, Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://wiki.binf.ku.dk/User:Thomas_Hamelryck http://www.binf.ku.dk/research/structural_bioinformatics/ From lgautier at gmail.com Wed Jun 9 12:28:20 2010 From: lgautier at gmail.com (Laurent) Date: Wed, 09 Jun 2010 14:28:20 +0200 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: <4C0F88E4.7070607@gmail.com> What about having a class instance instead ? This would let one change the index storage system very easily. For example, to use a dictionary: Bio.SeqIO.index(keyval_map = dict() ) A minimal requirement for the instance 'keyval_map' passed would be to implement the methods __getitem__(self, key) and __setitem__(self, key, value), allowing the "duck typing" approach commonly found in Python. An SQLite-based index would be a matter of having a class such as: class KeyValSQLite(object): def __init__(self, filename): # create the database into file "filename" pass def __getitem__(self, key): """ return the value """ # select whatever in something where key=''... pass def __setitem__(self, key, value): # update... pass The this would be a call like: Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db")) Now that you have the idea, getting a custom index based on BDB or anything should be a breeze... L. On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote: > Hi all, > > Thanks for the lively discussion on the main list, > > http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html > ... > http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html > > I've spent the afternoon updating my old branch which uses SQLite > to store the record identifier to file offset mapping. Using the code > on this branch, Bio.SeqIO.index() supports a new optional argument > currently called "db" (other names I like including "cache", suggestions > welcome): > > http://github.com/peterjc/biopython/tree/index-sqlite > > The default (False) is not to use SQLite, but continue with an in > memory Python dictionary. As long as you have enough RAM > and don't plan to use the index at a later date, this will be fastest. > > If set to True or a filename, then an SQLite index is used to hold > the offsets. This means very low RAM requirements, but is a lot > slower because the offsets are written to disk and the SQLite > index is updated as we go. I expect this part can be optimised > (e.g. try to build the index at the end, try committing in batches). > > I'm still testing this, but the core of the work is done I think. > Once we're happy with the public API, we can concentrate > on things like the SQLite schema, and optimising the code. > > Peter > > P.S. I know it will need a little work to fail gracefully on Python 2.4 > when SQLite isn't installed. > From biopython at maubp.freeserve.co.uk Wed Jun 9 12:53:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 13:53:39 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: <4C0F88E4.7070607@gmail.com> References: <4C0F88E4.7070607@gmail.com> Message-ID: On Wed, Jun 9, 2010 at 1:28 PM, Laurent wrote: > What about having a class instance instead ? This would let one change the > index storage system very easily. That is essentially what the recent code on my branch is doing, but the back end isn't being exposed to the public API (yet). > The this would be a call like: > > Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db")) > > > Now that you have the idea, getting a custom index based on BDB or > anything should be a breeze... Indeed. Most DB like back ends should offset a bulk loader we can exploit via the dict's update method. Peter From eric.talevich at gmail.com Wed Jun 9 13:31:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 9 Jun 2010 09:31:18 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 1:10 PM, Jo?o Rodrigues wrote: > Hello all, > > I'm replying here to what Thomas wrote on the GSOC Report thread because it > seems a better place. > > PDB files can contain anything RNA, DNA, sugars, small molecules... It is >> thus not a good idea to >> directly associate protein-specific methods to the structure class; it >> will lead to a bloated Structure class and a lot of irrelevant methods (ie. >> search_ss_bonds is meaningless for a PDB file that contains RNA). > > > Agree. > > Currently, one creates Polypeptide objects from a Structure object using a >> factory design pattern (via PPBuilder); the Polypeptide class implements >> some protein specific methods. I believe that is a much cleaner way to do it >> (though we need a Protein class that represents collections of connected >> polypeptides). One can also make sure that all such derived objects >> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable >> base class with shared functionality - in that way, the whole thing is also >> extendible. >> > > I think there has been already some discussion about this. My personal > opinion/suggestion is having a structure like: > > Bio.PDB/ > _______/Protein.py > _______/DNA.py > _______/RNA.py > > that would translate to an usage of something like: > > from Bio.PDB import Protein > structure = Protein('1ABC.pdb') > structure.search_ss_bonds() > > but not > > structure.calc_melting_temperature() (just an example) > How about: from Bio import Struct # extract the protein from a bound TF structure complex = Struct.read("3IKT.pdb") prot = complex.as_protein() # which is a wrapper for: from Bio.Struct.Protein import Protein # if Protein contains a Structure instance: prot = Protein(complex) # or, if Protein inherits from Structure: prot = Protein.from_structure(complex) The Bio.Struct.Protein module would mostly wrap Bio.PDB's protein-specific functionality, and contain a class called Protein which you construct using a Bio.PDB.Structure.Structure instance, in some way. I think the convenience methods as_protein, as_dna and as_rna are acceptable additions to the Structure class if that saves us from (a) polluting Structure with protein- and RNA-specific methods, or (b) requiring a slew of imports to reach any new functionality. You can add as_protein yourself and leave the other methods for other brave souls to implement. (Bio.Struct.RNA deserves its own directory, and I don't know of anyone working on a structural DNA branch.) Protein() would call PDBParser(). It could also include, to a certain > extent, an Alphabet-like feature to assure residue names are OK (this goes a > bit with this proposal). > I believe this goes a bit into what you said. Having a class that basically > abstracts what we do now (Bio.PDB.PDBParser) and allows for > molecule-specific methods. However, it also leads to some problems: > Protein/DNA complexes come to mind. > > How does this sound? I think it goes with what Eric said in the first post > of this thread and what Thomas replied in the GSOC thread. We should also > change the PDB name to Struct to better reflect the purpose of the module. > All of the other additions like Bio.Struct.WWW would still apply. And I > don't see a major problem in breaking the existing code by adding this. > To be clear, we don't need to rename anything -- Bio.Struct and Bio.PDB can live in harmony for the foreseeable future. Best, Eric From bpederse at gmail.com Wed Jun 9 14:42:29 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Wed, 9 Jun 2010 07:42:29 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 1:55 AM, Peter wrote: > On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen wrote: >>> >>> The version you tried didn't do anything clever with the SQLite >>> indexes, batched inserts etc. I'm hoping the current code will be >>> faster (although there is likely a penalty from having two switchable >>> back ends). Brent, could you re-run this benchmark with this code: >>> http://github.com/peterjc/biopython/tree/index-sqlite-batched >>> ... >> >> done. > > Thank you Brent :) > >> the previous times and the current were using py-tcdb not bsddb. >> the author of tcdb made some improvements so it's faster this time, > > OK, so you are using Tokyo Cabinet to store the lookup table here > rather than BDB. Link, http://code.google.com/p/py-tcdb/ > >> and your SeqIO implementation is almost 2x as fast to load as the >> previous one. that's a nice implementation. i didn't try get_raw. > > I've got some more re-factoring in mind which should help a little > more (but mainly to make the structure clearer). > >> these timints are are with your latest version, and the version of >> screed pulled from http://github.com/acr/screed master today. > > Having had a quick look, they are using SQLite3 in much the > say way as I was initially. They create the index before loading > (rather than after loading) and they use a single insert per > offset (rather than using a batch in a transaction or the > executemany method). I'm pretty sure from my experiments > those changes would speed up screed's loading time a lot > (probably inline with the speed up I achieved). > >> /opt/src/methylcode/data/s_1_sequence.txt >> benchmarking fastq file with 15646356 records (62585424 lines) >> performing 500000 random queries >> >> screed >> ------ >> create: 699.210 >> search: 51.043 >> >> biopython-sqlite >> ---------------- >> create: 386.647 >> search: 93.391 >> >> fileindex >> --------- >> create: 184.088 >> search: 48.887 > > That's got us looking more competitive. As noted above, I think > sceed's loading time could be much reduced by tweaking how > they use SQLite3. I wonder what the breakdown for fileindex is > between calling Tokyo Cabinet and the fileindex code itself? > I guess we should try TK as the back end in Bio.SeqIO.index() > for comparison. > > Peter > > P.S. Could you measure the database file sizes on disk? > for raw reads, screed, fileindex(tcdb), biopython respectively: -rw-r--r-T 1 brentp users 3.3G 2009-11-17 13:32 /opt/src/methylcode/data/s_1_sequence.txt -rw-r--r-- 1 brentp brentp 3.8G 2010-06-08 16:09 /opt/src/methylcode/data/s_1_sequence.txt_screed -rw-r--r-- 1 brentp brentp 1.2G 2010-06-08 16:21 /opt/src/methylcode/data/s_1_sequence.txt.fidx -rw-r--r-- 1 brentp brentp 1.5G 2010-06-08 21:15 /opt/src/methylcode/data/s_1_sequence.txt.bidx that's not using any compression for the fileindex. i think the overhead of the fileindex code + tcdb code is pretty low now. i think there'd only be improvement using a cython or c version of a TC wrapper--and even then, not much. -brentp From biopython at maubp.freeserve.co.uk Wed Jun 9 14:55:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 15:55:23 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 9:55 AM, Peter wrote: > > Having had a quick look, they are using SQLite3 in much the > say way as I was initially. They create the index before loading > (rather than after loading) and they use a single insert per > offset (rather than using a batch in a transaction or the > executemany method). I'm pretty sure from my experiments > those changes would speed up screed's loading time a lot > (probably inline with the speed up I achieved). > Do you fancy trying this version of screed? It seems much faster on medium sized FASTQ files:- http://github.com/peterjc/screed/tree/sqlite-tweaks I'm still running a few tests myself, but will pass this on to the screed team unless I find some regressions. Peter From bpederse at gmail.com Wed Jun 9 15:56:27 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Wed, 9 Jun 2010 08:56:27 -0700 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 7:55 AM, Peter wrote: > On Wed, Jun 9, 2010 at 9:55 AM, Peter wrote: >> >> Having had a quick look, they are using SQLite3 in much the >> say way as I was initially. They create the index before loading >> (rather than after loading) and they use a single insert per >> offset (rather than using a batch in a transaction or the >> executemany method). I'm pretty sure from my experiments >> those changes would speed up screed's loading time a lot >> (probably inline with the speed up I achieved). >> > > Do you fancy trying this version of screed? It seems much > faster on medium sized FASTQ files:- > > http://github.com/peterjc/screed/tree/sqlite-tweaks > > I'm still running a few tests myself, but will pass this on to > the screed team unless I find some regressions. > > Peter > not too much difference. screed ------ create: 666.381 search: 51.839 From biopython at maubp.freeserve.co.uk Wed Jun 9 16:19:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Jun 2010 17:19:24 +0100 Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite In-Reply-To: References: Message-ID: On Wed, Jun 9, 2010 at 4:56 PM, Brent Pedersen wrote: > On Wed, Jun 9, 2010 at 7:55 AM, Peter wrote: >> On Wed, Jun 9, 2010 at 9:55 AM, Peter wrote: >> >> Do you fancy trying this version of screed? It seems much >> faster on medium sized FASTQ files:- >> >> http://github.com/peterjc/screed/tree/sqlite-tweaks >> >> I'm still running a few tests myself, but will pass this on to >> the screed team unless I find some regressions. >> >> Peter >> > > not too much difference. > > screed > ------ > create: 666.381 > search: 51.839 Still noticeable, but not quite as much of a speed up as I was seeing (but different example, different OS, etc). Anyway, I've sent them a "pull request" and they can merge it if they like. Peter From rodrigo_faccioli at uol.com.br Wed Jun 9 17:35:24 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Wed, 9 Jun 2010 14:35:24 -0300 Subject: [Biopython-dev] Working directly on the main git repository In-Reply-To: References: Message-ID: About your Github's problem, you may try to perform the command below, after you removed your local branch. git push git at github.com:/.git :heads/ I've found the command below in [1]. [1] http://originblog.wordpress.com/2008/04/28/github-tips-removing-a-remote-branch/ Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Tue, Jun 8, 2010 at 6:45 PM, Eric Talevich wrote: > On Mon, Jun 7, 2010 at 5:35 AM, Peter >wrote: > > > Hi all, > > > > I thought I'd write down some notes about how I've been using git > recently. > > This may be of interest to any of the other core developers (those of us > > with read-write access to the main repository), and I might get some good > > tips from any discussion. The key point is that I have read+write access > > to two repositories on github (the official repository AND my own fork), > > so there are different advantages/disadvantages about which I choose > > to work with directly as my main repository. > > > > [...] > > > > Instead, I have a github repository of my own (what github calls a > > fork), and I push branches there. > > > > http://github.com/biopython/biopython - the official branch(es) > > http://github.com/peterjc/biopython - my branches > > > > How does this work in practice? Like this - I clone the master > > and add a reference to my repository (and I do the same when I > > want to grab a branch from another developer): > > > > git clone git at github.com:biopython/biopython.git > > cd biopython > > git remote add peterjc git at github.com:peterjc/biopython.git > > git fetch peterjc > > > > Then make a new local branch as usual, and when ready to share > > it publicly, I push it to *my* repository on github: > > > > git branch new-work > > git checkout new-work > > git commit ... > > git push peterjc new-work > > > > This would then appear as a new-work branch on my github page. > > Then if I (or someone else) wants to access these branches later > > (e.g. from another machine) just use the checkout tracked remote > > branch. For example, > > > > git clone git at github.com:biopython/biopython.git > > cd biopython > > git remote add peterjc git at github.com:peterjc/biopython.git > > git fetch peterjc > > git checkout -t peterjc/seqio-imgt > > > > This then looks like a normal branch (called just "seqio-imgt" in > > this example), but git knows it is linked to the remote branch on > > the "peterjc" repository (not the origin which is the "official" > > repository). > > > > This looks reasonable to me. I'd add that the procedure to delete a public > branch from your personal fork on GitHub is a little obscure: > > git branch -a # list local and remote branches > git branch -d new-work # delete a local branch that's been merged already > git push peterjc :new-work # delete the public branch from GitHub > > This doesn't do what you'd expect: > git branch -d peterjc/new-work > > That only removes your local reference to the the public branch; the branch > is still visible on GitHub. > > (It's kind of hard to find in the GitHub documentation.) > > > I'd have to check, but I guess that if the original git clone is done > > with git://github.com/biopython/biopython.git instead (read only > > access) the same procedure could be used by non core devs. > > However, I'm not sure this is clearer for them. I think the current > > procedure (on our wiki) where you add a remote reference to > > the "upstream" official repository works better in this case. > > > > I still have an "upstream" reference to the main repo. I wouldn't want to > accidentally push something foolish to the main repo with a stray "git > push"... better to have the safe thing happen by default. > > If the initial clone was from biopython master, and you later create a > personal forkon GitHub, then it's not too hard to switch the references > around in your local repo to make the public fork your "origin". > > -Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed Jun 9 23:56:35 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 9 Jun 2010 19:56:35 -0400 Subject: [Biopython-dev] Tested Fixup branch for Bio.PDB In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 5:59 AM, Kristian Rother wrote: > > Hi Eric, > > I've checked out your pdbfixes branch and ran our 431 Unit Tests of > ModeRNA with it. There were no changes to the master Bio.PDB branch --> > for us everything OK. > > Details: > ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and > uses Bio.PDB for most of its operations: reading files, > adding/copying/manipulating residues/atoms, superimposing structures, > searching neighbors by KDTree, writing files. > > Right, the tests most probably did not depend directly on the code you > changed, but as I understand you wanted to go sure the branch didnt break > anything by accident. > Thanks, Kristian! I didn't expect the patches to break anything, but it's hard to be sure until someone else has tried it. I've pushed the pdbfixes branch to Biopython's master branch on GitHub. Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Jun 10 16:24:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Jun 2010 17:24:20 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 12:59 PM, Peter wrote: > > With that in mind, as I mentioned yesterday maybe we should just > update the documentation to suggest using os.system() when you > just need the return code and there is no stdin to worry about: > I've added a basic example to the tutorial now, but the potential trouble is any output from the called tool will spew out at the python prompt (if working at the terminal). This may or may not be an issue. ClustalW for example is rather verbose. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 10 18:18:41 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Jun 2010 14:18:41 -0400 Subject: [Biopython-dev] [Bug 3098] New: GenBank/EMBL parser breaks for between features at origin Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3098 Summary: GenBank/EMBL parser breaks for between features at origin Product: Biopython Version: 1.54 Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I was testing Bio.SeqIO with with a GenBank file gbpln1.seq which includes: LOCUS AB042240 134545 bp DNA circular PLN 02-MAY-2006 ... misc_feature 134545^1 /standard_name="JLA" /note="Junction IRA-LSC" ORIGIN ... This is a "between" feature of length zero at the origin of this circular genome. This is a special case since normally between positions "start^end" have end=start+1 (using one based counting) which the parser does not allow for. The same applies to EMBL files as well, e.g. http://www.ebi.ac.uk/cgi-bin/expasyfetch?AB042240 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 10 18:35:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Jun 2010 14:35:48 -0400 Subject: [Biopython-dev] [Bug 3098] GenBank/EMBL parser breaks for between features at origin In-Reply-To: Message-ID: <201006101835.o5AIZm0b025094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3098 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-10 14:35 EST ------- Fixed, http://github.com/biopython/biopython/commit/80aa43e5434316d151bca5916442a3429b8724e2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Jun 10 19:18:38 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 Jun 2010 15:18:38 -0400 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 7:59 AM, Peter wrote: > > Even if the Python documentation seems to be discouraging it, > using os.system() seems simple, robust, and cross platform. We > could even update the tutorial now and post it online - it should > make some people's lives a little easier. > The Python docs claim os.system(cmd) is equivalent to subprocess.call(cmd, shell=True): http://docs.python.org/library/subprocess.html#replacing-os-system As I understood it, the reason for usually skipping the shell on Unix systems was for additional security -- the called program sees the same thing either way. Should we use this as a "teachable moment" involving the subprocess module in the tutorial? -Eric From anaryin at gmail.com Thu Jun 10 23:45:02 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 10 Jun 2010 18:45:02 -0500 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hello all, I'm having some issues dealing with this :x I created a module Bio.Struct that has the following contents: __init__.py Protein.py WWW/ The __init__.py file has a read() method that calls PDBParser and returns a Structure object. So far so good I think. Then I added a method to Bio.PDB.Structure more or less like this: def as_protein(self): from Bio.Struct.Protein import Protein prot = Protein(self) return prot so when you call it you get a new object. Protein is a class that inherits from Structure and that has the search_ss_bonds function. I can make the new object get all the methods from Structure AND from Protein, but when I try to execute search_ss_bonds, it fails because child_list, a Structure method, comes empty.. In fact, the whole SMCRA object comes empty.. How do I effectively do the inheritance on the Protein class? from Bio.PDB.Structure import Structure class Protein(Structure): def __init__(self, protein): self = protein This is what I last tried and doesn't work.. I've tried Structure.__init__, and several other things but to no avail. I'm sure this is simple OOP but I really can't understand that well how to do it ... Care to give a hand to a friend in need? :) Thanks in advance! By the way, I assume that if I got no comments on anything else on the GSOC thread that I'm doing a perfect job :P Thanks for that too :D Best! Jo?o [...] Rodrigues From eric.talevich at gmail.com Fri Jun 11 01:49:39 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 Jun 2010 21:49:39 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Thu, Jun 10, 2010 at 7:45 PM, Jo?o Rodrigues wrote: > Hello all, > > I'm having some issues dealing with this :x > > I created a module Bio.Struct that has the following contents: > > __init__.py > Protein.py > WWW/ > > The __init__.py file has a read() method that calls PDBParser and returns a > Structure object. So far so good I think. Then I added a method to > Bio.PDB.Structure more or less like this: > > def as_protein(self): > > from Bio.Struct.Protein import Protein > prot = Protein(self) > return prot > > so when you call it you get a new object. Protein is a class that inherits > from Structure and that has the search_ss_bonds function. > > I can make the new object get all the methods from Structure AND from > Protein, but when I try to execute search_ss_bonds, it fails because > child_list, a Structure method, comes empty.. In fact, the whole SMCRA > object comes empty.. > > How do I effectively do the inheritance on the Protein class? > > from Bio.PDB.Structure import Structure > > class Protein(Structure): > > def __init__(self, protein): > > self = protein > > This is what I last tried and doesn't work.. I've tried Structure.__init__, > and several other things but to no avail. I'm sure this is simple OOP but I > really can't understand that well how to do it ... > > Care to give a hand to a friend in need? :) > > Thanks in advance! By the way, I assume that if I got no comments on > anything else on the GSOC thread that I'm doing a perfect job :P Thanks for > that too :D > > Best! > > Jo?o [...] Rodrigues > Hi Jo?o, You have it mostly correct, but you need to call the parent class's constructor, too. Here's the constructor for Structure: def __init__(self, id): self.level="S" Entity.__init__(self, id) And here it is for Entity: def __init__(self, id): self.id=id self.full_id=None self.parent=None self.child_list=[] self.child_dict={} # Dictionary that keeps addictional properties self.xtra={} See the problem? Every subclass of Entity takes an "id" argument and sets the other attributes separately. In Bio.Phylo, I used another convention for converting an object of one type to a sub-class of the original type, as you're doing here. Rather than change the arguments to the constructor (which could have weird side-effects), I added a class method in the target class: @classmethod def from_structure(cls, struct): # Instantiate a Protein with the structure's id # Assign the other attributes individually from struct Then Structure.as_protein() becomes fairly simple. Alternatively, you could skip implementing Protein.from_structure() and do the attribute reassignment in as_protein(). Or, covering all the options, implement from_structure() but not as_protein(), and let the user figure it out. Do you think it would also be useful if as_protein() or from_structure() dropped any non-protein molecules during the conversion, and raise an error if nothing's left? Or would that cause more problems than it solves? Best, Eric From biopython at maubp.freeserve.co.uk Mon Jun 14 14:44:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Jun 2010 15:44:50 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> Message-ID: Hi all, You may recall late last year I posted about adding a reverse complement method to the SeqRecord, and addition support: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html SeqRecord addition was included in Biopython 1.53, but not the reverse_complement() method - which is something I wanted to use again today to reverse complement an annotated GenBank file and have all the SeqFeature locations flipped for me. I've rescued my old code and its unit tests and created a new branch for it: http://github.com/peterjc/biopython/commits/seqrecord-rc As I said at the end of last year, I think the general idea of a SeqRecord reverse_complement() method is nice but the details about handling the annotation is tricky. When we discussed slicing and addition, it was agreed that we should be cautious to avoid blindly transferring annotation inappropriately. The code on this branch allows the user to choose for each annotation type if it should be dropped (False), kept (True) or set to a supplied new value. The docstring has examples of how this works (which double as doctests). Jose - I've CC'd you since I know you wrote your own SeqRecord subclass with a complement() method (but not a reverse_complement() method) for Franklin. I'm curious about this choice. Cedar - I've CC'd you since you asked about this kind of think last year: http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html Regards, Peter From biopython at maubp.freeserve.co.uk Mon Jun 14 14:50:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Jun 2010 15:50:31 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: On Mon, Jun 14, 2010 at 3:43 PM, Kristian Rother wrote: > > > Hi Peter, > > just digesting BioPy mails from last week. > >>> Where should the str subclass for secondary structures that the parsers >>> create go? Could it be Bio.Struct.RNA? >> >> You don't think plain strings in the SeqRecord's letter_annotation >> dict would be enough? > > Not really - base pairing makes most normal string functions useless. > > >> Assuming you do need something then >> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering >> as alternatives to Bio.Struct.RNA. > > OK, I'll try that. > > Thanks, > ? Kristian > > Hi Kristian, Could you explain at little more about why plain strings wouldn't be suitable here. What kind of things do you want to do with them? Peter From krother at rubor.de Mon Jun 14 14:55:21 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 16:55:21 +0200 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements Message-ID: <1cf21a9224e1cd3ad4c8e2853d99100b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXwtdXg==-webmailer2@server03.webmailer.hosteurope.de> Hi guys, I'm fine with your ideas regarding different wrappers for Bio.PDB.Structure objects discussed last week, in particular: - creating Bio.Struct.RNA or Bio.PDB.RNA with a Structure instance. - having a structure.as_rna() helper method as suggested by Eric (but this is no must). I'd like to take what Joao does for proteins and add some basic equivalent for RNA structures shortly after. Best Regards, Kristian Quoting Thomas Hamelryck : > Hi, > > On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues wrote: > >> >> from Bio.PDB import Protein >> structure = Protein('1ABC.pdb') >> structure.search_ss_bonds() >> > > Indeed, that would run into problems for complexes where proteins, RNA, DNA, > etc. occur in the same file. It makes much more sense to have a Structure > centred approach: > > proteins=Protein(structure) > chains=proteins.get_chains() > chain_a=chains["A"] > polypeptides=chain_a.get_peptides() > > rnas=RNA(structure) > > etc. > > -Thomas From krother at rubor.de Mon Jun 14 15:01:48 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 17:01:48 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: Hi, much of what I do with RNA secondary structures strongly depends on iterating base pairs, e.g.. >>> sec = Secstruc("(((...)).)") >>> for bp in sec.basepairs(): >>> print bp (0, 9) (1, 7) (2, 6) also: >>> sec.get_helices() >>> sec.get_bulges() >>> sec.get_hairpins() >>> sec.contains_pseudoknot() .. and a couple of similar ones. The reason why I'd prefer to have something more than a string as a sec feature is that I wouldn't want to do all the time: sec = Secstruc(my_seq['secondary_structure']) sec.get_helices() but my_seq['secondary_structure'].get_helices() instead. Best Regards, Kristian >> Hi Peter, >> >> just digesting BioPy mails from last week. >> >>>> Where should the str subclass for secondary structures that the >>>> parsers >>>> create go? Could it be Bio.Struct.RNA? >>> >>> You don't think plain strings in the SeqRecord's letter_annotation >>> dict would be enough? >> >> Not really - base pairing makes most normal string functions useless. >> >> >>> Assuming you do need something then >>> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering >>> as alternatives to Bio.Struct.RNA. >> >> OK, I'll try that. >> >> Thanks, >> ? Kristian >> >> > > Hi Kristian, > > Could you explain at little more about why plain strings wouldn't be > suitable here. What kind of things do you want to do with them? > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From krother at rubor.de Mon Jun 14 15:13:19 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 17:13:19 +0200 Subject: [Biopython-dev] creating Protein(structure) object Message-ID: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Hi Joao, what you are describing is the classical Decorator Pattern (see http://en.wikipedia.org/wiki/Decorator_pattern). In the books, they say that the Decorator (Protein) must implement all methods of the decorated object (Structure). Of course, for a class as big as Bio.PDB.Structure, this sucks a lot. I see two alternatives: (1) override Protein.__getattr__(self, attr) to return self.struc.attr if it exists. I tried this recently and it worked fine until the decorated class used Python properties, when it started getting ugly again. (2) have Protein inherit from Structure, and grab all the children from the structure class, e.g.: class Protein(Structure): def __init__(self, struc): """ The given Structure instance becomes a Protein. """ Structure.__init__(self, struc.id) for child in struc.child_list: # eventually check if its a protein chain. self.add_child(child) Any comments? Kristian > Hello all, > > I'm having some issues dealing with this :x > > I created a module Bio.Struct that has the following contents: > > __init__.py > Protein.py > WWW/ > > The __init__.py file has a read() method that calls PDBParser and returns a > Structure object. So far so good I think. Then I added a method to > Bio.PDB.Structure more or less like this: > > def as_protein(self): > from Bio.Struct.Protein import Protein > prot = Protein(self) > return prot > > so when you call it you get a new object. Protein is a class that inherits > from Structure and that has the search_ss_bonds function. > > I can make the new object get all the methods from Structure AND from > Protein, but when I try to execute search_ss_bonds, it fails because > child_list, a Structure method, comes empty.. In fact, the whole SMCRA > object comes empty.. > > How do I effectively do the inheritance on the Protein class? > > from Bio.PDB.Structure import Structure > > class Protein(Structure): > > def __init__(self, protein): > > self = protein > > This is what I last tried and doesn't work.. I've tried Structure.__init__, > and several other things but to no avail. I'm sure this is simple OOP but I > really can't understand that well how to do it ... > > Care to give a hand to a friend in need? :) > > Thanks in advance! By the way, I assume that if I got no comments on > anything else on the GSOC thread that I'm doing a perfect job :P Thanks for > that too :D > > Best! > > Jo?o [...] Rodrigues > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > From biopython at maubp.freeserve.co.uk Mon Jun 14 15:23:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Jun 2010 16:23:25 +0100 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother wrote: > > Hi, > > much of what I do with RNA secondary structures strongly depends on > iterating base pairs, e.g.. > >>>> sec = Secstruc("(((...)).)") >>>> for bp in sec.basepairs(): >>>> ? ?print bp > (0, 9) > (1, 7) > (2, 6) > > also: >>>> sec.get_helices() >>>> sec.get_bulges() >>>> sec.get_hairpins() >>>> sec.contains_pseudoknot() > .. and a couple of similar ones. > > The reason why I'd prefer to have something more than a string as a sec > feature is that I wouldn't want to do all the time: > > sec = Secstruc(my_seq['secondary_structure']) > sec.get_helices() > > but > > my_seq['secondary_structure'].get_helices() > > instead. > > Best Regards, > ? Kristian That helped - thanks. Does your Secstruc object behave like a Python sequence (string/list/tuple) in that it has a length and can be sliced (as if acting on the string representation)? If so then it should be fine to store in the SeqRecord's letter_annotation dictionary. Peter From krother at rubor.de Mon Jun 14 15:41:05 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 14 Jun 2010 17:41:05 +0200 Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA In-Reply-To: References: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de> <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de> <20100614164348.186267pfu17v2ntw@horde.genesilico.pl> Message-ID: <3e6714450418534d741476aa0b64b374-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1WWAhZWg==-webmailer2@server03.webmailer.hosteurope.de> Hi Peter, > That helped - thanks. Does your Secstruc object behave like a Python > sequence (string/list/tuple) in that it has a length and can be sliced Yes, it does. > If so then it should be fine to > store in the SeqRecord's letter_annotation dictionary. Best, Kristian > On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother wrote: >> >> Hi, >> >> much of what I do with RNA secondary structures strongly depends on >> iterating base pairs, e.g.. >> >>>>> sec = Secstruc("(((...)).)") >>>>> for bp in sec.basepairs(): >>>>> ? ?print bp >> (0, 9) >> (1, 7) >> (2, 6) >> >> also: >>>>> sec.get_helices() >>>>> sec.get_bulges() >>>>> sec.get_hairpins() >>>>> sec.contains_pseudoknot() >> .. and a couple of similar ones. >> >> The reason why I'd prefer to have something more than a string as a sec >> feature is that I wouldn't want to do all the time: >> >> sec = Secstruc(my_seq['secondary_structure']) >> sec.get_helices() >> >> but >> >> my_seq['secondary_structure'].get_helices() >> >> instead. >> >> Best Regards, >> ? Kristian > > That helped - thanks. Does your Secstruc object behave like a Python > sequence (string/list/tuple) in that it has a length and can be sliced (as > if acting on the string representation)? If so then it should be fine to > store in the SeqRecord's letter_annotation dictionary. > > Peter > > From anaryin at gmail.com Mon Jun 14 17:58:56 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 14 Jun 2010 12:58:56 -0500 Subject: [Biopython-dev] creating Protein(structure) object In-Reply-To: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: Hello Kristian, The way I'm doing it as a workaround is: class Protein(Structure): def __init__(self, protein): Structure.__init__(self, protein.id) self.full_id = protein.full_id self.child_list = protein.child_list self.child_dict = protein.child_dict self.parent = protein.parent self.xtra = protein.xtra It works because every method I'm using deepcopies this anyway.. The way of adding the childs seems the correct way to go but it won't copy headers... should we want this? Thanks :) J From eric.talevich at gmail.com Mon Jun 14 20:27:24 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 14 Jun 2010 16:27:24 -0400 Subject: [Biopython-dev] creating Protein(structure) object In-Reply-To: References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: Hi guys, Another convention with the Decorator pattern is to ensure that all of the method arguments that existed in the original class are also present in the decorated one. This includes the constructor. Decoration simply adds another feature to whatever was already there. Jo?o Rodrigues wrote: > Hello Kristian, > > The way I'm doing it as a workaround is: > > class Protein(Structure): > > def __init__(self, protein): > > Structure.__init__(self, protein.id) > > self.full_id = protein.full_id > self.child_list = protein.child_list > self.child_dict = protein.child_dict > self.parent = protein.parent > self.xtra = protein.xtra > The way the constructors of Structure and other Entity subclasses work is to create a new object with the appropriate, empty attributes -- i.e. no children. Other code then attaches children to the class. To decorate a Structure with Protein-specific functionality, I would consider: 1. The Entity constructor takes an ID, and creates empty containers for child Entities. (Models, in this case.) So Protein.__init__ needs to start like: class Protein(Structure): def __init__(self, id): # take any keyword arguments? Structure.__init__(self, id) # handle any keyword arguments here 2. We need to be able to convert an existing Structure to a new Protein. That's new functionality, so it needs either a keyword argument in __init__, or a separate method or function. If we add a keyword argument to __init__, then the implementation is basically two completely different operations depending on if a Structure was passed or not. Plus, there's still that 'id' argument to deal with. 3. Instantiating a Protein directly would mean importing the Bio.Struct.Protein module manually, in addition to "from Bio import Struct". More to the point, Bio.Struct.Protein consists of lower-level functionality that a casual Struct user shouldn't have to dig into, as long as Structure.as_protein() exists. So there's no value in making Protein.__init__ "do what I mean" at the expense of clarity in the code. Better to make the code very obvious and explicit here, and focus on API prettiness from a different angle. 4. The next most convenient place for Structure-to-Protein conversion is on the Structure class. This presents a nice API that will be sufficient for most users: from Bio import Struct prot = Struct.read('1ABC.pdb').as_protein() But, going back to OOP principles, the Structure class shouldn't need to know anything about the Protein class's internals -- though it's free to call any public method and make things nicer for the user. So, finally, we need a class method* on Protein that Structure.as_protein() can call. Hence, Protein.from_structure(). [*] A class method can be called without first instantiating the class. Since we're trying to construct a new object here, we need to be able to call this Protein method before the Protein object exists. No worries, just use the @classmethod decorator. > It works because every method I'm using deepcopies this anyway.. > If someone modifies the original Structure object after you've created a Protein this way -- e.g. renumbering residues, or with their own function -- it will also modify the Protein object, since lists and dicts are shared. Is this what you want? If you're concerned about memory usage, you can also look at implementing __deepcopy__. > The way of adding the childs seems the correct way to go but it won't copy > headers... should we want this? > You code for copying the Structure's children looks right to me, except I think it's best to be little paranoid with Python lists and make deep copies anyway. I suppose you could also copy any header info that's relevant to proteins, using the same approach. Best, Eric From anaryin at gmail.com Tue Jun 15 03:06:03 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 14 Jun 2010 22:06:03 -0500 Subject: [Biopython-dev] creating Protein(structure) object In-Reply-To: References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: Ok, thanks for the long explanation! I'll merge what you and Kristian said and come up with a better interface. As is, I call is like this: s = Struct.read("1abc.pdb") # by the way, I added a trick to avoid the mandatory name of the structure p = s.as_protein() Best J From jblanca at btc.upv.es Tue Jun 15 05:55:45 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 15 Jun 2010 07:55:45 +0200 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> Message-ID: <201006150755.45162.jblanca@btc.upv.es> On Monday 14 June 2010 16:44:50 Peter wrote: > Hi all, > > You may recall late last year I posted about adding a reverse > complement method to the SeqRecord, and addition support: > http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.htm >l http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html > > SeqRecord addition was included in Biopython 1.53, but > not the reverse_complement() method - which is something > I wanted to use again today to reverse complement an > annotated GenBank file and have all the SeqFeature > locations flipped for me. I've rescued my old code and > its unit tests and created a new branch for it: > http://github.com/peterjc/biopython/commits/seqrecord-rc > > As I said at the end of last year, I think the general idea of > a SeqRecord reverse_complement() method is nice but the > details about handling the annotation is tricky. When we > discussed slicing and addition, it was agreed that we > should be cautious to avoid blindly transferring annotation > inappropriately. The code on this branch allows the user to > choose for each annotation type if it should be dropped > (False), kept (True) or set to a supplied new value. The > docstring has examples of how this works (which double > as doctests). Having a reverse_complement method would be useful for us. But it could be quite tricky to reverse complement some features. For instance we have SNP features that include a reference nucleotide. We would had to complement that nucleotide too. Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 15 09:08:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Jun 2010 10:08:14 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <201006150755.45162.jblanca@btc.upv.es> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> <201006150755.45162.jblanca@btc.upv.es> Message-ID: On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca wrote: > > Having a reverse_complement method would be useful for us. But it could be > quite tricky to reverse complement some features. For instance we have SNP > features that include a reference nucleotide. We would had to complement that > nucleotide too. > Could you give an example? I assume you are talking about the annotation of the feature (i.e. the qualifiers dictionary of a SeqFeature object). Peter From jblanca at btc.upv.es Tue Jun 15 09:23:27 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 15 Jun 2010 11:23:27 +0200 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <201006150755.45162.jblanca@btc.upv.es> Message-ID: <201006151123.27158.jblanca@btc.upv.es> On Tuesday 15 June 2010 11:08:14 Peter wrote: > On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca wrote: > > Having a reverse_complement method would be useful for us. But it could > > be quite tricky to reverse complement some features. For instance we have > > SNP features that include a reference nucleotide. We would had to > > complement that nucleotide too. > > Could you give an example? I assume you are talking about the annotation > of the feature (i.e. the qualifiers dictionary of a SeqFeature object). That is right in some instances the qualifiers should be modified. For instance if we have an ORF with a qualifier 'forward':True, it should be changed. I don't think this change can be done automatically . -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 15 09:42:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Jun 2010 10:42:47 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <201006151123.27158.jblanca@btc.upv.es> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> <201006150755.45162.jblanca@btc.upv.es> <201006151123.27158.jblanca@btc.upv.es> Message-ID: On Tue, Jun 15, 2010 at 10:23 AM, Jose Blanca wrote: > On Tuesday 15 June 2010 11:08:14 Peter wrote: >> On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca wrote: >> > Having a reverse_complement method would be useful for us. But it could >> > be quite tricky to reverse complement some features. For instance we have >> > SNP features that include a reference nucleotide. We would had to >> > complement that nucleotide too. >> >> Could you give an example? I assume you are talking about the annotation >> of the feature (i.e. the qualifiers dictionary of a SeqFeature object). > > That is right in some instances the qualifiers should be modified. For > instance if we have an ORF with a qualifier 'forward':True, it should be > changed. I don't think this change can be done automatically . Yes, that sort of thing would be very difficult to do automatically. We come back to the question of what the default should be - blindly copy, or just drop this information. I would say for most feature annotation (and I am thinking about GenBank and EMBL style files here) there isn't anything strand specific to worry about, so in general copying is fine. Clearly this is not a safe assumption for SNP features. Peter From krother at rubor.de Tue Jun 15 14:06:52 2010 From: krother at rubor.de (Kristian Rother) Date: Tue, 15 Jun 2010 16:06:52 +0200 Subject: [Biopython-dev] RNA Alphabet: request for comments Message-ID: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de> Hi, I've commited a proof-of-concept implementation how modified RNA bases could be made compatible to Biopython Alphabets. Comments are very welcome, especially because I had to change two lines in the Seq class to make it work. The code can be viewed on: http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa (on github: krother/biopython, branch rna_alphabet). The two main classes are: RNAAlphabetEntry(str) that contains different abbreviations for one base. and ModifiedRNAString(str) that behaves like a string except that it iterates through RNAAlphabetEntry objects. Thus, you can do: >>> from Bio.Alphabet.ModifiedRNAAlphabet import modified_rna >>> from Bio.Seq import Seq >>> from Bio.RNA.ModifiedRNAString import ModifiedRNAString >>> >>> mod_seq = ModifiedRNAString('AA:"A') >>> seq = Seq(mod_seq, modified_rna) >>> for char in seq: >>> print char adenosine adenosine 2-O-methyladenosine 1-methyladenosine adenosine (see Unit test for details). Best Regards, Kristian From biopython at maubp.freeserve.co.uk Tue Jun 15 14:46:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Jun 2010 15:46:10 +0100 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de> References: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother wrote: > > Hi, > > I've commited a proof-of-concept implementation how modified RNA bases > could be made compatible to Biopython Alphabets. Comments are very > welcome, especially because I had to change two lines in the Seq class to > make it work. > > The code can be viewed on: > http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa > (on github: krother/biopython, branch rna_alphabet). > > The two main classes are: > RNAAlphabetEntry(str) that contains different abbreviations for one base. > and > ModifiedRNAString(str) that behaves like a string except that it iterates > through RNAAlphabetEntry objects. > Why not create a Seq subclass instead of your class ModifiedRNAString(str)? This would then implement suitable (reverse) complement etc. I would also have __iter__ and __getitem__ for a single letter return an instance of RNAAlphabetEntry (which would act like a single character string). Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 15 16:23:00 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Jun 2010 12:23:00 -0400 Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord? In-Reply-To: Message-ID: <201006151623.o5FGN0K6028619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-15 12:22 EST ------- Patch applied to this branch: http://github.com/peterjc/biopython/tree/seqrecord-rc -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Wed Jun 16 08:32:29 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 16 Jun 2010 10:32:29 +0200 Subject: [Biopython-dev] RNA Alphabet: request for comments Message-ID: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Hi Peter, > Why not create a Seq subclass instead of your class ModifiedRNAString(str)? This turned out to be a lot simpler. Worked right away. New commit at: http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70 more comments welcome. Next steps from my side would be: 1) add all modifications to the Alphabet. 2) add some RNA-specific methods. 3) add more tests. 4) sync with latest master branch. 5) request code merge. Best regards, Kristian Quoting Peter : > On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother wrote: >> >> Hi, >> >> I've commited a proof-of-concept implementation how modified RNA bases >> could be made compatible to Biopython Alphabets. Comments are very >> welcome, especially because I had to change two lines in the Seq class to >> make it work. >> >> The code can be viewed on: >> http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa >> (on github: krother/biopython, branch rna_alphabet). >> >> The two main classes are: >> RNAAlphabetEntry(str) that contains different abbreviations for one base. >> and >> ModifiedRNAString(str) that behaves like a string except that it iterates >> through RNAAlphabetEntry objects. >> > > Why not create a Seq subclass instead of your class ModifiedRNAString(str)? > This would then implement suitable (reverse) complement etc. > > I would also have __iter__ and __getitem__ for a single letter return > an instance > of RNAAlphabetEntry (which would act like a single character string). > > Peter > > > > From biopython at maubp.freeserve.co.uk Wed Jun 16 08:51:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Jun 2010 09:51:03 +0100 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: On Wed, Jun 16, 2010 at 9:32 AM, Kristian Rother wrote: > > Hi Peter, > >> Why not create a Seq subclass instead of your class ModifiedRNAString(str)? > > This turned out to be a lot simpler. Worked right away. New commit at: > > http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70 > > more comments welcome. Why do you need the _set_sequence method? Why not just put that small piece of code inside the __init__ method? > Next steps from my side would be: > > 1) add all modifications to the Alphabet. > 2) add some RNA-specific methods. > 3) add more tests. > 4) sync with latest master branch. > 5) request code merge. > > Best regards, > ? ? Kristian If this works out we should look at doing a Protein 3-letter code version for use with PDB sequences (I'm thinking about the modified amino acids). Peter From krother at rubor.de Wed Jun 16 09:03:37 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 16 Jun 2010 11:03:37 +0200 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: Hi Peter, > Why do you need the _set_sequence method? Why not just put that > small piece of code inside the __init__ method? In _set_sequence there'll be a small parser taking care of modifications where the one-letter abbreviations do not suffice. E.g. a sequence could be "CCC022UCCC" (22U is a 5-hydroxyuridine). --> being parsed into a list of RNAAlphabetEntries ['C','C','C','22U','C','C','C'] So the code will grow a little, but the basic idea stays the same. If someone wants a one-letter representation, it could be "CCCxCCC", but this is degenerate because 'x' is used for several modifications. Best Regards, Kristian >>> Why not create a Seq subclass instead of your class >>> ModifiedRNAString(str)? >> >> This turned out to be a lot simpler. Worked right away. New commit at: >> >> http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70 >> >> more comments welcome. > > Why do you need the _set_sequence method? Why not just put that > small piece of code inside the __init__ method? > >> Next steps from my side would be: >> >> 1) add all modifications to the Alphabet. >> 2) add some RNA-specific methods. >> 3) add more tests. >> 4) sync with latest master branch. >> 5) request code merge. >> >> Best regards, >> ? ? Kristian > > If this works out we should look at doing a Protein 3-letter code version > for use with PDB sequences (I'm thinking about the modified amino acids). > > Peter > > From biopython at maubp.freeserve.co.uk Wed Jun 16 09:41:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Jun 2010 10:41:35 +0100 Subject: [Biopython-dev] RNA Alphabet: request for comments In-Reply-To: References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de> Message-ID: On Wed, Jun 16, 2010 at 10:03 AM, Kristian Rother wrote: > > Hi Peter, > >> Why do you need the ?_set_sequence method? Why not just put that >> small piece of code inside the __init__ method? > > In _set_sequence there'll be a small parser taking care of modifications > where the one-letter abbreviations do not suffice. E.g. a sequence could > be > > "CCC022UCCC" > > (22U is a 5-hydroxyuridine). > > --> being parsed into a list of RNAAlphabetEntries > ['C','C','C','22U','C','C','C'] > > So the code will grow a little, but the basic idea stays the same. > > If someone wants a one-letter representation, it could be "CCCxCCC", but > this is degenerate because 'x' is used for several modifications. > > Best Regards, > ? Kristian Thinking ahead, we are planning to make the Seq objects use string comparison instead of object identity. When that happens, I would suggest in your subclass you implement the the equality method so that if you are comparing against another instance of the modified RNA Seq compare at the more detailed "22U" level, and if not then for compatibility compare at the single letter level ("x" even though degenerate). Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 16 12:43:07 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 16 Jun 2010 08:43:07 -0400 Subject: [Biopython-dev] [Bug 3100] New: Bio.PDB.ResidueDepth distance calculation error Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3100 Summary: Bio.PDB.ResidueDepth distance calculation error Product: Biopython Version: 1.54b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: andres.colubri at gmail.com ResidueDepth.py in Bio.PDB contains an error at line 100: d2=sum(d*d, 1) This uses the built-in sum() function, which just sums all the elements of d*d, starting at 1. But it should use numpy's sum instead: d2=numpy.sum(d*d, 1) To check the error, try the following code: from Bio.PDB import from Bio.PDB.ResidueDepth import parser = PDBParser() str = parser.get_structure('test', '3M38.pdb') surf = get_surface('3M38.pdb', PDB_TO_XYZR='./pdb_to_xyzr', MSMS='./msms') print min_dist(surf[10], surf) 3M38.pdb could be replaced by any other pdb file. The result of this calculation printed to the console should be zero, since we are calculating the minimum distance to the surface of a point belonging to the surface. But this gives a value greater than zero. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From lueck at ipk-gatersleben.de Wed Jun 16 13:18:00 2010 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 16 Jun 2010 15:18:00 +0200 Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris In-Reply-To: References: Message-ID: <001a01cb0d56$581dd610$1022a8c0@ipkgatersleben.de> Hello! Sorry for the late reply but I just came back from my holidays. I have been to EuroSciPy 2009 and it's was really great (I also gave a talk where biopython was several times mentioned ;-). Since it's was problematic to go last time, I decided to skip it this year (principally I have to come private). Unfortunately I hear now that the biopython people will be there and I would be very interested to meet you, since I'm using biopython a lot. I have to see what I still can do. Would be great to see us! Stefanie -----Urspr?ngliche Nachricht----- Von: biopython-dev-bounces at lists.open-bio.org [mailto:biopython-dev-bounces at lists.open-bio.org] Im Auftrag von Peter Gesendet: Samstag, 5. Juni 2010 16:50 An: Biopython-Dev Mailing List Betreff: [Biopython-dev] EuroSciPy 2010 conference in Paris Hi all, Are any Biopython folk planning to be at the EuroSciPy conference in Paris this year (July 2010)? They are still finalising the Scientific track, but the list of tutorials is quite interesting already: http://www.euroscipy.org/conference/euroscipy2010 Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Fri Jun 18 13:19:02 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 09:19:02 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastaq In-Reply-To: Message-ID: <201006181319.o5IDJ2Oj022977@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 cjfields at bioperl.org changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bioperl-guts-l at bioperl.org |biopython-dev at biopython.org ------- Comment #3 from cjfields at bioperl.org 2010-06-18 09:18 EST ------- (In reply to comment #2) > (In reply to comment #1) > > I'm making a wild guess that this is Biopython and not BioPerl. > > Yes, it's Biopython, Can you halp me, please? or can you give me a link where > to find the answer for my problem? Thank you very much. Reassigning to the Biopython devs. This should go to their list now, hopefully you'll get a response. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 13:45:37 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 09:45:37 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181345.o5IDjbNB023730@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Error converting sff into |Error converting sff into |fastaq |fastq ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-18 09:45 EST ------- Thanks Chris. Giorgio - Could you confirm which version of Biopython are you using? To me the error message suggests the SFF file is corrupted (damaged). Is it very large? Could you attach it to this bug (or email it to me personally) to check? Have you been able to process the SFF file with any other tools (e.g. sff_extract which should work on Windows/Linux/Mac, or the Roche tools which are Linux only)? If you copied the SFF file over your network, or over the internet from your sequencing center, perhaps there was an error there. Could you try re-downloading the SFF file? Regards, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 15:03:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 11:03:45 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181503.o5IF3j23025689@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #5 from gcasaburi at tiscali.it 2010-06-18 11:03 EST ------- (In reply to comment #4) > Thanks Chris. > Giorgio - Could you confirm which version of Biopython are you using? > To me the error message suggests the SFF file is corrupted (damaged). Is it > very large? Could you attach it to this bug (or email it to me personally) to > check? > Have you been able to process the SFF file with any other tools (e.g. > sff_extract which should work on Windows/Linux/Mac, or the Roche tools which > are Linux only)? > If you copied the SFF file over your network, or over the internet from your > sequencing center, perhaps there was an error there. Could you try > re-downloading the SFF file? > Regards, > Peter Thank u for the answer. I have the last version of Biopython, The file is 1,12 giga, so i think is difficult to attach the file. The file has been taken directly from the usb port of the 454 with a pendrive and now is in a normal PC. With Biopthon i'v been able to read and open this sff file, but at the end of the reading appers the message (Value error:...). So when i try to convert the file in fasta the same message apper to be, bloking any work. So why the file is open reading, with all information (flow, lewnght) but impossible to edit, convert??? Thank u hope u can help us. Grater from ITALY -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 15:28:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 11:28:01 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181528.o5IFS1iY026418@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-18 11:28 EST ------- (In reply to comment #5) > Thank u for the answer. I have the last version of Biopython, Good. > The file is 1,12 giga, so i think is difficult to attach the file. Yes, too big to attach or email :( > The file has been taken directly from the usb port of the 454 with a > pendrive and now is in a normal PC. I would try copying it again using a different USB memory stick / pen drive. > With Biopthon i'v been able to read and open this sff file, but at the end > of the reading appers the message (Value error:...). So when i try to convert > the file in fasta the same message apper to be, bloking any work. So why the > file is open reading, with all information (flow, lewnght) but impossible to > edit, convert??? Thank u hope u can help us. > Grater from ITALY It sounds like there is an error is near the end of the file. You can open the file and read lots of reads up until the error. If you use Bio.SeqIO.parse() or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps the file is truncated (only partly copied)? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 18 17:35:00 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 18 Jun 2010 13:35:00 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006181735.o5IHZ0SW030183@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #7 from gcasaburi at tiscali.it 2010-06-18 13:35 EST ------- (In reply to comment #6) > (In reply to comment #5) > > Thank u for the answer. I have the last version of Biopython, > > Good. > > > The file is 1,12 giga, so i think is difficult to attach the file. > > Yes, too big to attach or email :( > > > The file has been taken directly from the usb port of the 454 with a > > pendrive and now is in a normal PC. > > I would try copying it again using a different USB memory stick / pen drive. > > > With Biopthon i'v been able to read and open this sff file, but at the end > > of the reading appers the message (Value error:...). So when i try to convert > > the file in fasta the same message apper to be, bloking any work. So why the > > file is open reading, with all information (flow, lewnght) but impossible to > > edit, convert??? Thank u hope u can help us. > > Grater from ITALY > > It sounds like there is an error is near the end of the file. You can open the > file and read lots of reads up until the error. If you use Bio.SeqIO.parse() > or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps > the file is truncated (only partly copied)? > > Peter > I will try to recopy the file on another pendrive. I thought like you, may be the file has a corruption at the end. I don't think is truncated, in fact is a .sff that represents one region of the "ptp", but the same error appers with another file .sff2 that represents the second region of the "ptp" (diveded in two regions for the same "run", totally 2 regions, each for one sample, two samples in total). So i don't know if there is a syntax command to modify the error value. Thank you Giorgio -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 22 13:11:15 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Jun 2010 09:11:15 -0400 Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord? In-Reply-To: Message-ID: <201006221311.o5MDBF8o003119@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-22 09:11 EST ------- (In reply to comment #0) > My motivating example is to take an ACE file loaded with SeqIO, remove the > gaps, and output the contigs as FASTQ or QUAL files. This requires the > per-letter-annotation to be sliced to match the ungapped sequence. > > Likewise any features fully contained within ungapped regions should be > retained and their co-ordinates shifted. I'm not sure if we should do anything > about features spanning a gap - the simple option which I have implemented is > they are lost. This is done via the existing SeqRecord slicing and addition > code. I've been trying building SeqFeature objects for the reads in an ACE file, http://github.com/peterjc/biopython/tree/ace-reads In this case when I call the SeqRecord ungap method, many of my read features are lost with the current implementation (because they included gaps). This also showed the ungap code to be quite slow for features. I'm going to have another look at this. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 22 14:58:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Jun 2010 10:58:39 -0400 Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord? In-Reply-To: Message-ID: <201006221458.o5MEwd0I005797@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3060 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1482 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-22 10:58 EST ------- (From update of attachment 1482) (In reply to comment #3) > > I've been trying building SeqFeature objects for the reads in an ACE file, > http://github.com/peterjc/biopython/tree/ace-reads > > In this case when I call the SeqRecord ungap method, many of my read features > are lost with the current implementation (because they included gaps). This > also showed the ungap code to be quite slow for features. I'm going to have > another look at this. My new code handles SeqFeature ungapping so as to preserve all the features by adjusting their end points. This is also much faster: http://github.com/peterjc/biopython/tree/ungap2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Tue Jun 22 19:25:17 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 22 Jun 2010 14:25:17 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file Message-ID: Hello all, I've been using some non-standard pdb files outputted by some programs and they miss the chemical element column in each ATOM line. I was looking at the PDBParser code and element is dealt with like this: if element is None: import warnings from PDBExceptions import PDBConstructionWarning warnings.warn("Atom object (name=%s) without element" % name, PDBConstructionWarning) element = "?" print name, "--> ?" elif len(element)>2 or element != element.upper() or element != element.strip(): raise ValueError(element) self.element=element In my case, the element line is not "None" but just an empty string - ' ' - which fails these tests and is then passed on. This would be no problem at all, but I've added a "mass" attribute to the Atom object defined like this: self.mass = IUPACData.atom_weigths[element] I've added the ? to the atom_weights list as I thought it would deal with the empty element cases. I'd suggest adding to the first if statement a test to check if the element string is empty and if so, treat it as None. if element is None or element is '': import warnings from PDBExceptions import PDBConstructionWarning warnings.warn("Atom object (name=%s) without element" % name, PDBConstructionWarning) element = "?" print name, "--> ?" elif len(element)>2 or element != element.upper() or element != element.strip(): raise ValueError(element) self.element=element What do you think? Best! Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org From biopython at maubp.freeserve.co.uk Wed Jun 23 09:11:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Jun 2010 10:11:06 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues wrote: > Hello all, > > I've been using some non-standard pdb files outputted by some programs and > they miss the chemical element column in each ATOM line. I was looking at > the PDBParser code and element is dealt with like this: > > ? ? ? ?if element is None: > ? ? ? ? ? ?import warnings > ? ? ? ? ? ?from PDBExceptions import PDBConstructionWarning > ? ? ? ? ? ?warnings.warn("Atom object (name=%s) without element" % name, > ? ? ? ? ? ? ? ? ? ? ? ? ?PDBConstructionWarning) > ? ? ? ? ? ?element = "?" > ? ? ? ? ? ?print name, "--> ?" > ? ? ? ?elif len(element)>2 or element != element.upper() or element != > element.strip(): > ? ? ? ? ? ?raise ValueError(element) > ? ? ? ?self.element=element > > > In my case, the element line is not "None" but just an empty string - ' ' - > which fails these tests and is then passed on. That makes sense, since element=line[76:78].strip() will give an empty string. A change as you suggest makes sense, but I think just using "if element:" would be nicer. Peter From biopython at maubp.freeserve.co.uk Wed Jun 23 10:28:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Jun 2010 11:28:22 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS Message-ID: Hi all, >From some unit test output posted by Manabu Ishii via Twitter I think the test suite is having problems checking for external tools on non-English operating systems (e.g. Debian in Japanese): http://d.hatena.ne.jp/manabou/20100619 http://twitter.com/manabou I've tried to update a few to do a better job (test_Muscle_tool.py, test_Clustalw_tool.py and test_Emboss.py), but what I really need is someone to run the test suite on a non English system - ideally without all these command line tools installed. The tests should notice when the tool is missing, and be skipped without errors. Could anyone with a non-English OS try running the latest code from git (or even the latest release) to see if you get similar problems? Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 23 13:21:25 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Jun 2010 09:21:25 -0400 Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq In-Reply-To: Message-ID: <201006231321.o5NDLPm0017094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3102 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-23 09:21 EST ------- Hi Giorgio, Did coping the file again help? In addition to trying to read the SFF files with other tools (like sff_extract or the Roche ssfinfo) as suggested, I have some additional things you could try. Firstly try this private function to see how many reads there should be: filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff" from Bio import SeqIO print SeqIO.SffIO._sff_file_header(open(filename, "rb"))[3] Then compare this to the number of reads you could extract up until the error. Secondly, see if the index can be loaded or not: filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff" from Bio import SeqIO d = SeqIO.index(filename, "sff") print len(d) If it is just one or two bad reads, this may allow you to jump to specific records (and so avoid getting stuck on the bad ones). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Jun 23 16:52:47 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 23 Jun 2010 11:52:47 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: Ok, I've changed it in my local branch to if not element since that covers both None and empty strings. Best, Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org On Wed, Jun 23, 2010 at 4:11 AM, Peter wrote: > On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues wrote: > > Hello all, > > > > I've been using some non-standard pdb files outputted by some programs > and > > they miss the chemical element column in each ATOM line. I was looking at > > the PDBParser code and element is dealt with like this: > > > > if element is None: > > import warnings > > from PDBExceptions import PDBConstructionWarning > > warnings.warn("Atom object (name=%s) without element" % name, > > PDBConstructionWarning) > > element = "?" > > print name, "--> ?" > > elif len(element)>2 or element != element.upper() or element != > > element.strip(): > > raise ValueError(element) > > self.element=element > > > > > > In my case, the element line is not "None" but just an empty string - ' ' > - > > which fails these tests and is then passed on. > > That makes sense, since element=line[76:78].strip() will give an empty > string. A change as you suggest makes sense, but I think just using > "if element:" would be nicer. > > Peter > From biopython at maubp.freeserve.co.uk Thu Jun 24 08:26:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 09:26:50 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Wed, Jun 23, 2010 at 5:52 PM, Jo?o Rodrigues wrote: > Ok, I've changed it in my local branch to if not element since that covers > both None and empty strings. > > Best, > > Jo?o [...] Rodrigues > @ http://doeidoei.wordpress.org I've you've done that little change as a single commit, then I can use git cherry-pick to apply it to the master branch. But first you need to push this work to github.com Peter From biopython at maubp.freeserve.co.uk Thu Jun 24 08:32:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 09:32:46 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues wrote: > Hello all, > > I've been using some non-standard pdb files outputted by some programs and > they miss the chemical element column in each ATOM line. ... This would be no > problem at all, but I've added a "mass" attribute to the Atom object defined like this: > > ? ? ? ?self.mass = IUPACData.atom_weigths[element] > > I've added the ? to the atom_weights list as I thought it would deal with > the empty element cases. I wonder if using None or NAN would be better than zero here? Or just an exception. This is difficult for me to say without a better idea of what you will be using the atomic weights for. On a separate point, if you have an old fashioned PDB file without the element column, you can probably work out the element anyway. For example CA in a normal amino acids residue means the alpha carbon, so the element is carbon (although in a HETATM there is a possibility it is Calcium I think). So I think it would be possible to infer the element in many cases (but not all). However, this is going to be a reasonable amount of work to write and test. How common are this kind of PDB file for the work you are doing - do many modelling packages omit the element? Have you contacted the program authors to request they include the element column in future? Peter From anaryin at gmail.com Thu Jun 24 16:36:36 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Jun 2010 11:36:36 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: > > I wonder if using None or NAN would be better than zero here? Or just an > exception. This is difficult for me to say without a better idea of what > you > will be using the atomic weights for. > Right now I'm just using them for the center of mass calculation. > > On a separate point, if you have an old fashioned PDB file without the > element > column, you can probably work out the element anyway. For example CA in > a normal amino acids residue means the alpha carbon, so the element is > carbon (although in a HETATM there is a possibility it is Calcium I think). > So I think it would be possible to infer the element in many cases (but not > all). However, this is going to be a reasonable amount of work to write and > test. >From non HETATMs its possible from the first letter of the atom name (or it is H if the first letter is a digit). For HETATMs, names match elements IIRC. Do you think it's worth the try? It shouldn't be hard to write and the cases where it would fail would be sporadic. > How common are this kind of PDB file for the work you are doing - do > many modelling packages omit the element? > Have you contacted the program authors to request they include the > element column in future? > Well... several packages make this, specially webservers.. Contacting them authors wouldn't bring those many favourable answers IMO. I've commited it here: http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc I commented out the self.mass part since we're still working on it. Best, J From biopython at maubp.freeserve.co.uk Thu Jun 24 16:54:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 17:54:41 +0100 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues wrote: >> >> I wonder if using None or NAN would be better than zero here? Or just an >> exception. This is difficult for me to say without a better idea of what >> you will be using the atomic weights for. >> > > Right now I'm just using them for the center of mass calculation. > Well if you don't know an atom's mass, you can't calculate the real center of mass. Maybe this should throw an exception? >> On a separate point, if you have an old fashioned PDB file without the >> element column, you can probably work out the element anyway. ... > > From non HETATMs its possible from the first letter of the atom name (or it > is H if the first letter is a digit). For HETATMs, names match elements > IIRC. > > Do you think it's worth the try? It shouldn't be hard to write and the cases > where it would fail would be sporadic. Eric - what do you think? >> How common are this kind of PDB file for the work you are doing - do >> many modelling packages omit the element? > > >> Have you contacted the program authors to request they include the >> element column in future? >> > > Well... several packages make this, specially webservers.. Contacting them > authors wouldn't bring those many favourable answers IMO. I'd ask politely anyway ;) > I've commited it here: > http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc > > I commented out the self.mass part since we're still working on it. I've cherry-picked that for the trunk - could you test the master branch please (just to make sure this worked as you expected)? Thanks, Peter From eric.talevich at gmail.com Thu Jun 24 18:05:11 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 24 Jun 2010 14:05:11 -0400 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: On Thu, Jun 24, 2010 at 12:54 PM, Peter wrote: > On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues wrote: > >> > >> I wonder if using None or NAN would be better than zero here? Or just an > >> exception. This is difficult for me to say without a better idea of what > >> you will be using the atomic weights for. > >> > > > > Right now I'm just using them for the center of mass calculation. > > > > Well if you don't know an atom's mass, you can't calculate the real > center of mass. Maybe this should throw an exception? > And the center of mass calculation was for coarse-graining structures, right? What would be most useful there? (a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them (b) Give unknown atoms a weight of None, and have CoM check for this and disregard those atoms (similar effect) -- preferably issuing a warning (c) Like (b), but CoM raises an exception (d) Give CoM a keyword argument for how to treat this (e.g. strict=True/False), so course-graining can be permissive but direct use of CoM can raise an exception if desired. (However, if warnings are used then the warnings module already lets you convert specific warnings into exceptions.) >> On a separate point, if you have an old fashioned PDB file without the > >> element column, you can probably work out the element anyway. ... > > > > From non HETATMs its possible from the first letter of the atom name (or > it > > is H if the first letter is a digit). For HETATMs, names match elements > > IIRC. > > > > Do you think it's worth the try? It shouldn't be hard to write and the > cases > > where it would fail would be sporadic. > > Eric - what do you think? > Sounds useful to me. Where would it fail, and how should failures be treated? Unrecognized atom names, and then issue a warning and leave the element attribute blank? (See options above...) Cheers, Eric From anaryin at gmail.com Thu Jun 24 18:25:45 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Jun 2010 13:25:45 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: > > And the center of mass calculation was for coarse-graining structures, > right? What would be most useful there? > > (a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them > CoM counts with the number of atoms so 0.0 will not work anyways actually. > (b) Give unknown atoms a weight of None, and have CoM check for this and > disregard those atoms (similar effect) -- preferably issuing a warning > I'd prefer this. Exclude atoms from the calculation. But then this might have an impact in the location of the mass.. > (c) Like (b), but CoM raises an exception > (d) Give CoM a keyword argument for how to treat this (e.g. > strict=True/False), so course-graining can be permissive but direct use of > CoM can raise an exception if desired. (However, if warnings are used then > the warnings module already lets you convert specific warnings into > exceptions.) > My suggestion. CoM can be either geometrical or gravitical. The first assumes equal mass for everyone, the second does not. If there's a mass that doesn't exist, the CoM would default to geometrical and issue a warning. Having a flag in CoM can also be valuable but I guess this would be redundant with the warning/exception (permissive/strict) in the Atom class. > > > >> On a separate point, if you have an old fashioned PDB file without the >> >> element column, you can probably work out the element anyway. ... >> > >> > From non HETATMs its possible from the first letter of the atom name (or >> it >> > is H if the first letter is a digit). For HETATMs, names match elements >> > IIRC. >> > >> > Do you think it's worth the try? It shouldn't be hard to write and the >> cases >> > where it would fail would be sporadic. >> >> Eric - what do you think? >> > > Sounds useful to me. Where would it fail, and how should failures be > treated? Unrecognized atom names, and then issue a warning and leave the > element attribute blank? (See options above...) > I'd implement it in the Atom class. Instead of having this check (lines 75-76): elif len(element)>2 or element != element.upper() or element != element.strip(): raise ValueError(element) there would be a check against IUPACData.atom_weight.keys(). If the element is not found, then it would try to check the atom name and issue a warning. If this fails, exception thrown. Sounds good? Best! J From anaryin at gmail.com Thu Jun 24 20:25:23 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Jun 2010 15:25:23 -0500 Subject: [Biopython-dev] Parsing "element" out of PDB file In-Reply-To: References: Message-ID: Ok, I was looking at the element attribution and there's a slight problem. I thought I could easily fetch if the atom is from an ATOM or HETATM, but since the "parenting" of the Atom is only done *after* the Atom is created, there is no way (as is) of knowing where it comes from. Therefore, I thought of the following work around. *hetero_flag* is already defined when the Atom is created. It could be passed to the Atom as another of its arguments. It would then be a conditional like this inside the Atom class: if not element or element not in IUPACData: if hetatm: if atom.name in IUPACData: element = atom.name else: element = ? else: # Not HETATM t_element = atom.name[0] if not atom.name[0].isdigit() else atom.name[1] if t_element in IUPACData: element = t_element else: element = ? else: # Has element and it is in IUPACData element = element The advantage is that either if you don't give an element or if it fails the IUPACData check, it will try to recover it from the atom name. It also makes it possible to thrown an exception when the element is not found. Or a warning since for now, only the CoM function uses it and it has a failsafe against it (defaults to geometrical). Opinions? Jo?o From bugzilla-daemon at portal.open-bio.org Fri Jun 25 11:49:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Jun 2010 07:49:35 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201006251149.o5PBnZpA007121@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1327 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 25 11:51:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Jun 2010 07:51:16 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201006251151.o5PBpGE9007286@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1329 is|0 |1 obsolete| | ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-25 07:51 EST ------- (From update of attachment 1329) I've got a branch using regular expressions which seems to cover all the location strings I've found in testing. It is at least twice the speed of the old parser. http://github.com/peterjc/biopython/tree/location-parsing2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Jun 25 15:21:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Jun 2010 16:21:46 +0100 Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing Message-ID: Hi all, I've been working on and off recently on rewriting the location parsing for GenBank/EMBL features: http://bugzilla.open-bio.org/show_bug.cgi?id=2738 I have a branch ready for public testing, http://github.com/peterjc/biopython/commits/location-parsing2 The old code is still there (and indeed right now gets used as a fall back with a warning if an unrecognised location is seen). I'd like to label it (plus Bio.Parsers and Bio.Parsers.spark) as obsolete for the next release, and then deprecate them the subsequence release. The old code takes each location string, parses it with SPARK and generates a set of token objects for each element (see the code in Bio.GenBank.LocationParser) and then turns that into SeqFeature location and position objects. All this object creation is probably a major reason why the old code is slow. The new code takes each location string, and parses it with a mix of regular expressions and simple Python code, and then builds the SeqFeature location and position objects. On my tests this is at least twice as fast, typically between three and four times faster. The intention is this parser change will result in no functional changes at all. As part of this work I have been extending the feature unit tests, and have also run some more extensive additional tests locally (GenBank files for plants, viruses, environmental samples etc). I'm reasonably sure this covers all the location variants... but with GenBank and EMBL files you can never be sure ;) Would anyone like to volunteer to test the new branch before I merge it to the trunk? I'm also interested in comments on the code itself. Note I have tried to avoid any refactoring until the old code is actually deprecated. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 25 17:46:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Jun 2010 13:46:14 -0400 Subject: [Biopython-dev] [Bug 3103] New: Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3103 Summary: Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML Product: Biopython Version: 1.54 Platform: Other OS/Version: Linux Status: NEW Severity: minor Priority: P5 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: vimalkumarvelayudhan at gmail.com I created an RPM recently for Biopython version 1.54 and got this error from rpmlint python-biopython.i586:???W:???unable-to-read-zip???/usr/share/python-biopython/Tests/PhyloXML/ncbi_taxonomy_mollusca.xml.zip:???Bad???magic???number???for???central???directory This appears for both the .tar.gz and the .zip version. I could do a manual unzip of the file though. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 27 15:31:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 27 Jun 2010 11:31:11 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006271531.o5RFVBTP001043@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #1 from eric.talevich at gmail.com 2010-06-27 11:31 EST ------- Interesting. Where did you get this release of Biopython 1.54? From PyPI, or GitHub? I downloaded this file from phyloxml.org originally, and haven't changed it. This file is used in the unit tests, and Python's zipfile library doesn't seem to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies it as: "Zip archive data, at least v2.0 to extract" It's actually not a very important part of the unit tests anyway, so if it's causing you trouble, I could give you a patch to remove this file from the unit tests. (If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd like to include a fix for, too. It's fixed in Biopython's trunk already, but slipped past our release process for v.1.54.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 27 16:45:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 27 Jun 2010 12:45:28 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006271645.o5RGjSBd019564@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #2 from vimalkumarvelayudhan at gmail.com 2010-06-27 12:45 EST ------- The archives were downloaded from http://biopython.org/DIST/biopython-1.54.tar.gz http://biopython.org/DIST/biopython-1.54.zip I could remove the zip file during the build process and can also patch the Phylo.Nexus for the next release if you could forward it to me. (In reply to comment #1) > Interesting. Where did you get this release of Biopython 1.54? From PyPI, or > GitHub? > > I downloaded this file from phyloxml.org originally, and haven't changed it. > This file is used in the unit tests, and Python's zipfile library doesn't seem > to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies > it as: > "Zip archive data, at least v2.0 to extract" > > It's actually not a very important part of the unit tests anyway, so if it's > causing you trouble, I could give you a patch to remove this file from the unit > tests. > > (If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd > like to include a fix for, too. It's fixed in Biopython's trunk already, but > slipped past our release process for v.1.54.) > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sun Jun 27 22:21:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 27 Jun 2010 23:21:43 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 23, 2010 at 11:28 AM, Peter wrote: > Hi all, > > From some unit test output posted by Manabu Ishii via Twitter I > think the test suite is having problems checking for external tools > on non-English operating systems (e.g. Debian in Japanese): > http://d.hatena.ne.jp/manabou/20100619 > http://twitter.com/manabou > > I've tried to update a few to do a better job (test_Muscle_tool.py, > test_Clustalw_tool.py and test_Emboss.py), but what I really need > is someone to run the test suite on a non English system - ideally > without all these command line tools installed. The tests should > notice when the tool is missing, and be skipped without errors. > > Could anyone with a non-English OS try running the latest code > from git (or even the latest release) to see if you get similar > problems? I've also included an idea from Manabu Ishii to set environment variable LANG=C to get the default of USA English. This should work on Linux etc, and is probably harmless on Windows. Again, testing would be most welcome (any non-English OS), Thanks Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 28 12:23:25 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Jun 2010 08:23:25 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006281223.o5SCNPog015539@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #3 from eric.talevich at gmail.com 2010-06-28 08:23 EST ------- Created an attachment (id=1517) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1517&action=view) Patch to remove ncbi_xml_mollusca.xml.zip from the Phylo unit test This patch should fix the problem reported in Bug 3103. Created with git format-patch. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 28 12:25:20 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Jun 2010 08:25:20 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006281225.o5SCPKo9015639@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 ------- Comment #4 from eric.talevich at gmail.com 2010-06-28 08:25 EST ------- Created an attachment (id=1518) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1518&action=view) Patch to fix a bug in NexusIO This patch fixes another bug in NexusIO, parsing the support values on branches. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Mon Jun 28 17:55:30 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Tue, 29 Jun 2010 00:55:30 +0700 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: Peter, I have built and run the latest code from git on Russian Ubuntu 10.4. Entrez tests have failed. Muscle, clustal and emboss tests have been skipped successfully. The tests have been executed from build.py script and I am not sure how to generate test report. Redirecting the script output to file didn't help. On Mon, Jun 28, 2010 at 5:21 AM, Peter wrote: > On Wed, Jun 23, 2010 at 11:28 AM, Peter > wrote: > > Hi all, > > > > From some unit test output posted by Manabu Ishii via Twitter I > > think the test suite is having problems checking for external tools > > on non-English operating systems (e.g. Debian in Japanese): > > http://d.hatena.ne.jp/manabou/20100619 > > http://twitter.com/manabou > > > > I've tried to update a few to do a better job (test_Muscle_tool.py, > > test_Clustalw_tool.py and test_Emboss.py), but what I really need > > is someone to run the test suite on a non English system - ideally > > without all these command line tools installed. The tests should > > notice when the tool is missing, and be skipped without errors. > > > > Could anyone with a non-English OS try running the latest code > > from git (or even the latest release) to see if you get similar > > problems? > > I've also included an idea from Manabu Ishii to set environment > variable LANG=C to get the default of USA English. This should > work on Linux etc, and is probably harmless on Windows. > > Again, testing would be most welcome (any non-English OS), > > Thanks > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Best regards, Konstantin From biopython at maubp.freeserve.co.uk Tue Jun 29 09:57:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Jun 2010 10:57:27 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov wrote: > Peter, > I have built and run the latest code from git on Russian Ubuntu 10.4. Thank you, > Entrez tests have failed. That can happen due to network problems. I'd like to see the error though. > Muscle, clustal and emboss tests have been skipped successfully. Good :) > The tests have been executed from build.py script and I am not sure how to > generate test report. Redirecting the script output to file didn't help. I normally just run "python setup.py test" from the source directory or "python run_tests.py" from the Tests subdirectory at the terminal, and copy and paste the interesting bits of the output. If you want to capture the test output to a file, you should probably redirect both stdout and stderr: python run_tests.py &> output.txt Regards, Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 29 19:08:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 29 Jun 2010 15:08:45 -0400 Subject: [Biopython-dev] [Bug 3103] Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML In-Reply-To: Message-ID: <201006291908.o5TJ8j66032031@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3103 vimalkumarvelayudhan at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from vimalkumarvelayudhan at gmail.com 2010-06-29 15:08 EST ------- Thank you. RPMs packaged with patches applied and can be found at http://download.opensuse.org/repositories/science:/vlinux/ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Wed Jun 30 03:27:20 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Wed, 30 Jun 2010 10:27:20 +0700 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: Peter, actually the problems with Entrez tools are Unicode related. I suppose, that the test failures are related with the current working dir path: it contains a non-English word in it, thus it can not be represented as an ascii string. Also there are similar problems with Genbank to Sql tests. Please, see the error-log attached. On Tue, Jun 29, 2010 at 4:57 PM, Peter wrote: > On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov > wrote: > > Peter, > > I have built and run the latest code from git on Russian Ubuntu 10.4. > > Thank you, > > > Entrez tests have failed. > > That can happen due to network problems. I'd like to see the error though. > > > Muscle, clustal and emboss tests have been skipped successfully. > > Good :) > > > The tests have been executed from build.py script and I am not sure how > to > > generate test report. Redirecting the script output to file didn't help. > > I normally just run "python setup.py test" from the source directory or > "python run_tests.py" from the Tests subdirectory at the terminal, and > copy and paste the interesting bits of the output. > > If you want to capture the test output to a file, you should probably > redirect > both stdout and stderr: > > python run_tests.py &> output.txt > > Regards, > > Peter > -- Best regards, Konstantin -------------- next part -------------- running test test_Ace ... ok test_AlignIO ... ok test_AlignIO_convert ... ok test_BioSQL ... FAIL test_BioSQL_SeqIO ... /home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: order location operators are not fully supported % feature.location_operator) /home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: bond location operators are not fully supported % feature.location_operator) ok test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw. test_Cluster ... ok test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL. test_Emboss ... skipping. Install EMBOSS if you want to use Bio.Emboss. test_EmbossPhylipNew ... skipping. Install the Emboss package 'PhylipNew' if you want to use the Bio.Emboss.Applications wrappers for phylogenetic tools. test_EmbossPrimer ... ok test_Entrez ... FAIL test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF. test_GenBank ... ok test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsBitmaps ... skipping. Install ReportLab if you want to use Bio.Graphics. test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics. test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... ok test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... ok test_MEME ... ok test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper. test_MarkovModel ... ok test_Medline ... ok test_Motif ... ok test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper. test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_BLAST_tools ... skipping. Install the NCBI BLAST+ command line tools if you want to use the Bio.Blast.Applications wrapper. test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... ok test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_Phylo ... ok test_PhyloXML ... ok test_Phylo_depend ... skipping. Install NetworkX if you want to use Bio.Phylo._utils. test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop. test_PopGen_GenePop_EasyController ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop. test_PopGen_GenePop_nodepend ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper. test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... ok test_SeqIO ... ok test_SeqIO_FastaIO ... ok test_SeqIO_QualityIO ... ok test_SeqIO_convert ... ok test_SeqIO_features ... ok test_SeqIO_index ... ok test_SeqIO_online ... ok test_SeqRecord ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper. test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... ok test_lowess ... ok test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite1 ... ok test_prosite2 ... ok test_prosite_patterns ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Application docstring test ... ok Bio.Seq docstring test ... ok Bio.SeqFeature docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqIO.PhdIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.SffIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Blast.Applications docstring test ... ok Bio.Clustalw docstring test ... ok Bio.Emboss.Applications docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... FAIL Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok Bio.Statistics.lowess docstring test ... ok ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NC_000932. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 423, in test_NC_000932 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NC_005816. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 419, in test_NC_005816 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NT_019265. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 427, in test_NT_019265 self.loop(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, arab1. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 447, in test_arab1 self.loop(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, cor6_6. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 443, in test_cor6_6 self.loop(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, noref. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 435, in test_no_ref self.loop(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, one_of. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 439, in test_one_of self.loop(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, protein_refseq2. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 431, in test_protein_refseq2 self.loop(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb") File "test_BioSQL.py", line 456, in loop db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, NC_000932. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 496, in test_NC_000932 self.trans(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, NC_005816. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 492, in test_NC_005816 self.trans(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, NT_019265. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 500, in test_NT_019265 self.trans(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, arab1. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 520, in test_arab1 self.trans(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, cor6_6. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 516, in test_cor6_6 self.trans(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, noref. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 508, in test_no_ref self.trans(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, one_of. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 512, in test_one_of self.trans(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: GenBank file to BioSQL, then again to a new namespace, protein_refseq2. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 504, in test_protein_refseq2 self.trans(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb") File "test_BioSQL.py", line 529, in trans db = server.new_database(db_name) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database self.adaptor.execute(sql, (db_name,authority, description)) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute self.dbutils.execute(self.cursor, sql, args) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute cursor.execute(sql.replace("%s", "?"), args or ()) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. ====================================================================== ERROR: Test parsing XML returned by EFetch, Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3451, in test_journals record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, Nucleotide database (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3893, in test_nucleotide1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, Protein database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 4045, in test_nucleotide2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, OMIM database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3607, in test_omim record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, PubMed database (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3034, in test_pubmed1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, PubMed database (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3237, in test_pubmed2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EFetch, Taxonomy database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3784, in test_taxonomy record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML output returned by EGQuery (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2706, in test_egquery1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML output returned by EGQuery (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2858, in test_egquery2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing database list returned by EInfo ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 26, in test_list record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing database info returned by EInfo ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 72, in test_pubmed record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing cancerchromosomes links returned by ELink ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2690, in test_cancerchromosomes record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing medline indexed articles returned by ELink ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 1965, in test_medline record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing Nucleotide to Protein links returned by ELink ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 1239, in test_nucleotide record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 934, in test_pubmed1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 1253, in test_pubmed2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed link returned by ELink (third test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2404, in test_pubmed3 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (fourth test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2431, in test_pubmed4 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (fifth test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2499, in test_pubmed5 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing pubmed links returned by ELink (sixth test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 2669, in test_pubmed6 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EPost ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 535, in test_epost record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EPost with an invalid id (overflow tag) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 553, in test_invalid record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by EPost with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 545, in test_wrong self.assertRaises(RuntimeError, Entrez.read, handle) File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises callableObj(*args, **kwargs) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from the Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 322, in test_journals record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch when no items were found ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 502, in test_notfound record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from the Nucleotide database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 444, in test_nucleotide record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed Central ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 366, in test_pmc record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from the Protein database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 479, in test_protein record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed (first test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 107, in test_pubmed1 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed (second test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 136, in test_pubmed2 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESearch from PubMed (third test) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 289, in test_pubmed3 record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML output returned by ESpell ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 3013, in test_espell record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 653, in test_journals record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Nucleotide database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 766, in test_nucleotide record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Protein database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 727, in test_protein record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from PubMed ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 576, in test_pubmed record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Structure database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 805, in test_structure record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the Taxonomy database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 855, in test_taxonomy record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary from the UniSTS database ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 895, in test_unists record = Entrez.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== ERROR: Test parsing XML returned by ESummary with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 921, in test_wrong self.assertRaises(RuntimeError, Entrez.read, handle) File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises callableObj(*args, **kwargs) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read record = handler.read(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read self.parser.ParseFile(handle) File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler path = os.path.join(self.dtd_dir, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128) ====================================================================== FAIL: Doctest: Bio.Wise._build_align_cmdline ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.6/doctest.py", line 2152, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for Bio.Wise._build_align_cmdline File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 23, in _build_align_cmdline ---------------------------------------------------------------------- File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 26, in Bio.Wise._build_align_cmdline Failed example: _build_align_cmdline(["dnal"], ("seq1.fna", "seq2.fna"), "/tmp/output", kbyte=100000) Expected: 'dnal -kbyte 100000 seq1.fna seq2.fna > /tmp/output' Got: 'dnal -kbyte 100000 -quiet seq1.fna seq2.fna > /tmp/output' ---------------------------------------------------------------------- File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 28, in Bio.Wise._build_align_cmdline Failed example: _build_align_cmdline(["psw"], ("seq1.faa", "seq2.faa"), "/tmp/output_aa") Expected: 'psw -kbyte 300000 seq1.faa seq2.faa > /tmp/output_aa' Got: 'psw -kbyte 300000 -quiet seq1.faa seq2.faa > /tmp/output_aa' ---------------------------------------------------------------------- Ran 144 tests in 192.676 seconds FAILED (failures = 3) From biopython at maubp.freeserve.co.uk Wed Jun 30 10:19:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 11:19:19 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov wrote: > Peter, > actually the problems with Entrez tools are Unicode related. > I suppose, that the test failures are related with? the current working dir > path: it contains a non-English word in it, thus it can not be represented > as an ascii string. > Also there are similar problems with Genbank to Sql tests. > > Please, see the error-log attached. Thank you for the error log. Yes, there do seem to be problems with having the source code under a unicode path. Could you try moving the folder from /home/okko/??????/biopython to /home/okko/biopython and repeat the test? That would help confirm this hypothesis. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 12:47:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 13:47:14 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 11:19 AM, Peter wrote: > On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov > wrote: >> Peter, >> actually the problems with Entrez tools are Unicode related. >> I suppose, that the test failures are related with? the current working dir >> path: it contains a non-English word in it, thus it can not be represented >> as an ascii string. >> Also there are similar problems with Genbank to Sql tests. >> >> Please, see the error-log attached. > > Thank you for the error log. Yes, there do seem to be problems > with having the source code under a unicode path. Could you > try moving the folder from /home/okko/??????/biopython to > /home/okko/biopython and repeat the test? That would help > confirm this hypothesis. I created a similar directory name on my (English) version of Mac OS X, and get the same Entrez failure. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 13:05:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 14:05:53 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 1:47 PM, Peter wrote: > > I created a similar directory name on my (English) version of > Mac OS X, and get the same Entrez failure. > Hi Konstantin, Could you retest using the latest code from github? I hope that now test_Entrez.py will work for you. Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 13:31:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 14:31:58 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 2:05 PM, Peter wrote: > On Wed, Jun 30, 2010 at 1:47 PM, Peter wrote: >> >> I created a similar directory name on my (English) version of >> Mac OS X, and get the same Entrez failure. >> > > Hi Konstantin, > > Could you retest using the latest code from github? I hope that now > test_Entrez.py will work for you. The second update should also fix test_BioSQL.py as well. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 14:24:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 15:24:57 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov wrote: > The fixes work! > Only one test fails, but it doesn't look related to non-English OS > problems.? I've attached the new test log. Great :) I hadn't done anything about the Bio.Wise docstring test failure yet, but it isn't linked to the non-English OS at all. I'll start a new thread... Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 30 15:22:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 30 Jun 2010 11:22:16 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201006301522.o5UFMGvo028548@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2010-06-30 11:22 EST ------- I've merged my github branch into the master. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 30 15:23:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 16:23:12 +0100 Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing In-Reply-To: References: Message-ID: On Fri, Jun 25, 2010 at 4:21 PM, Peter wrote: > Hi all, > > I've been working on and off recently on rewriting the location > parsing for GenBank/EMBL features: > http://bugzilla.open-bio.org/show_bug.cgi?id=2738 > > I have a branch ready for public testing, ... Would anyone like > to volunteer to test the new branch before I merge it to the trunk? I've just merged it - testing and feedback still welcome of course. Peter From biopython at maubp.freeserve.co.uk Wed Jun 30 14:38:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 15:38:59 +0100 Subject: [Biopython-dev] Running unit tests on non-English OS In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 3:24 PM, Peter wrote: > On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov > wrote: >> The fixes work! >> Only one test fails, but it doesn't look related to non-English OS >> problems.? I've attached the new test log. > > Great :) > > I hadn't done anything about the Bio.Wise docstring test failure yet, > but it isn't linked to the non-English OS at all. I'll start a new thread... > Solved. The doctest was working UNLESS the test output was being sent to a file. Peter