From krother at rubor.de Wed Dec 1 05:18:22 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 1 Dec 2010 11:18:22 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de> Hi Joao, Do you have a separate GIT branch for these three features? I would volunteer to pull them, try a local merge, and run auto & manual tests. Best regards, Kristian > Hello all, > > I've been looking at the code I wrote for the GSOC to see what is ready to > be merged in the main branch. I have to thank Kristian and whoever > participated in the Python & Friends for the input. > > From what I gathered, and from my own tests, I believe the following > functions are solid enough: > > > 1. > Bio/PDB/Atom.py: > automatically guessing atom element from atom name > 2. Bio/PDB/Structure.py > 1. Building biological unit from REMARK 350 in the header > (link > ) > 2. Renumbering residues > (link > ) > > > Let me know what you all think. > > Best, > > Jo??o [...] Rodrigues > http://doeidoei.wordpress.com > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Wed Dec 1 05:33:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Dec 2010 10:33:46 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de> References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de> Message-ID: On Wed, Dec 1, 2010 at 10:18 AM, Kristian Rother wrote: > > Hi Joao, > > Do you have a separate GIT branch for these three features? > > I would volunteer to pull them, try a local merge, and run auto & manual > tests. > > Best regards, > ? Kristian I think Joao just has the one branch at the moment. If it would be feasible to split out the functionality it would be easier to merge incrementally. For example, a new branch (from the master) just for the atom element stuff in Bio.PDB shouldn't be too hard. If while working on the GSoC changes you didn't mix up changes in single commits then you (Joao) might find "git cherry-pick" useful. Otherwise doing a "git diff" between the GSoC branch and the master for the Bio.PDB files only could give you a useful patch to start from. Does any of that make sense? Peter From anaryin at gmail.com Wed Dec 1 06:13:29 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 1 Dec 2010 12:13:29 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de> Message-ID: Sorry Peter, your email got completely hidden in my mailbox.. gmail bug. I told Kristian I wouldn't mind at all creating a new branch just for these features but I really don't know how to do it. I'll look into that git cherry-pick command and see what I can do :) Thanks! Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Wed, Dec 1, 2010 at 11:33 AM, Peter wrote: > On Wed, Dec 1, 2010 at 10:18 AM, Kristian Rother wrote: > > > > Hi Joao, > > > > Do you have a separate GIT branch for these three features? > > > > I would volunteer to pull them, try a local merge, and run auto & manual > > tests. > > > > Best regards, > > Kristian > > I think Joao just has the one branch at the moment. If it > would be feasible to split out the functionality it would be > easier to merge incrementally. > > For example, a new branch (from the master) just for > the atom element stuff in Bio.PDB shouldn't be too hard. > If while working on the GSoC changes you didn't mix > up changes in single commits then you (Joao) might > find "git cherry-pick" useful. Otherwise doing a "git diff" > between the GSoC branch and the master for the > Bio.PDB files only could give you a useful patch to > start from. Does any of that make sense? > > Peter > From anaryin at gmail.com Wed Dec 1 07:12:53 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 1 Dec 2010 13:12:53 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: Ok, I managed to branch it. There were some other files needing attention other than Atom.py and IUPACData.py so it took a while to pinpoint them all.. lesson learned to be careful with commits :) If you want to test it yourselves, here it is: https://github.com/JoaoRodrigues/biopython/tree/atom-element/ Best! And thanks for the help :) Jo?o From anaryin at gmail.com Wed Dec 1 12:01:12 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 1 Dec 2010 18:01:12 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: Following Peter's comments I changed some stuff. I also noticed one thing: metal ions like CA and CL have their names starting one character before regular C and N atoms. That allows some discrimination between CA (alpha carbon) and CA (calcium) for example. I'd never noticed this before, thus relying on the hetero_flag to try and exclude metal ions (HETATM) because they would likely be wrong if such an ambiguous case existed. I thus removed the hetero_flag I'd added to Atom objects and expanded the element guessing logic to all atoms. I also changed the tests in test_PDB.py to reflect this. Best! And thanks Peter for the comments! From biopython at maubp.freeserve.co.uk Wed Dec 1 12:15:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Dec 2010 17:15:20 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: On Wed, Dec 1, 2010 at 5:01 PM, Jo?o Rodrigues wrote: > I also noticed one thing: metal ions like CA and CL have their names > starting one character before regular C and N atoms. That allows some > discrimination between CA (alpha carbon) and CA (calcium) for example. I'd > never noticed this before, ... Is this documented in the PDB format definition? More importantly, do third party tools follow this rule? They are the only reason we need the code to guess the element in the first place, right? (Since the PDB provided files should all have the element column). Peter From eric.talevich at gmail.com Wed Dec 1 12:29:35 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 1 Dec 2010 12:29:35 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: On Wed, Dec 1, 2010 at 12:15 PM, Peter wrote: > On Wed, Dec 1, 2010 at 5:01 PM, Jo?o Rodrigues wrote: > > I also noticed one thing: metal ions like CA and CL have their names > > starting one character before regular C and N atoms. That allows some > > discrimination between CA (alpha carbon) and CA (calcium) for example. > I'd > > never noticed this before, ... > > Is this documented in the PDB format definition? More importantly, > do third party tools follow this rule? They are the only reason we > need the code to guess the element in the first place, right? (Since > the PDB provided files should all have the element column). > > I think can rely on this convention. I'd read this somewhere else (maybe on one of Andrew Dalke's pages) but didn't think to apply it to Jo?o's problem. Here's a reference: http://bmerc-www.bu.edu/needle-doc/latest/atom-format.html#pdb-atom-name-anomalies -Eric From anaryin at gmail.com Wed Dec 1 13:34:04 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 1 Dec 2010 18:34:04 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <33168a5db375d7697c34337062e0d2b5-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQVlcUA9fXQ==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: http://www.wwpdb.org/documentation/format32/sect9.html Well, there doesn't seem to be a written rule, but it is shown in the documentation of the format. Also, do you think it's worthy to include a sanity check for those elements that have been assigned? For example when parsing a file checking if the assigned element truly corresponds to what it should be and issuing a warning or even an exception if otherwise? From n.j.loman at bham.ac.uk Thu Dec 2 05:50:51 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Thu, 02 Dec 2010 10:50:51 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length Message-ID: <4CF77A0B.9050204@bham.ac.uk> Hi there Two questions for the developers. 1) I wanted to extract polymorphic sites from a multiple alignment and ended up with some code like this: alignment = AlignIO.read(fn, "nexus") rows = len(alignment) new_alignment = None for n in xrange(alignment.get_alignment_length()): aln = alignment[:,n] if aln[0] * rows != aln: if new_alignment: new_alignment += alignment[:,n:n+1] else: new_alignment = alignment[:,n:n+1] if new_alignment: AlignIO.write([new_alignment], open(fn + ".ply", "w"), "nexus") Is this the best way of doing it? Would a method call in AlignIO to do the same thing be useful to others? 2) When outputting long alignments in Nexus format, MrBayes refuses to read the resulting files saying that the maximum line length is 19900 characters. I'm assuming that is not the maximum input to MrBayes and that it can handle longer alignments if they are split in some way. Would it be possible for Bio.Nexus to split alignments in the appropriate format? Cheers Nick From biopython at maubp.freeserve.co.uk Thu Dec 2 06:43:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 11:43:28 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: <4CF77A0B.9050204@bham.ac.uk> References: <4CF77A0B.9050204@bham.ac.uk> Message-ID: On Thu, Dec 2, 2010 at 10:50 AM, Nick Loman wrote: > Hi there > > Two questions for the developers. > > 1) I wanted to extract polymorphic sites from a multiple alignment and ended > up with some code like this: > > ? alignment = AlignIO.read(fn, "nexus") > ? rows = len(alignment) > ? new_alignment = None > ? for n in xrange(alignment.get_alignment_length()): > ? ? ? aln = alignment[:,n] > ? ? ? if aln[0] * rows != aln: > ? ? ? ? ? if new_alignment: > ? ? ? ? ? ? ? new_alignment += alignment[:,n:n+1] > ? ? ? ? ? else: > ? ? ? ? ? ? ? new_alignment = alignment[:,n:n+1] > ? if new_alignment: > ? ? ? AlignIO.write([new_alignment], open(fn + ".ply", "w"), "nexus") > > Is this the best way of doing it? Would a method call in AlignIO to > do the same thing be useful to others? I've got some code somewhere for iterating over the columns of the alignment, and think I filed an enhancement bug for this. Would that do what you want? > 2) When outputting long alignments in Nexus format, MrBayes refuses to read > the resulting files saying that the maximum line length is 19900 characters. > I'm assuming that is not the maximum input to MrBayes and that it can handle > longer alignments if they are split in some way. Would it be possible for > Bio.Nexus to split alignments in the appropriate format? Are you outputting the large alignment using Bio.AlignIO or using Bio.Nexus directly? The file format details are not fresh in my mind, but I think that long sequences can be split over multiple lines - so if the problem is just with how MrBayes parses the file, that might be fixable. Can you give me a test case for this (maybe generate a simple but large alignment in code) with the MrBayes call that fails? Peter From cy at cymon.org Thu Dec 2 07:03:55 2010 From: cy at cymon.org (Cymon Cox) Date: Thu, 2 Dec 2010 12:03:55 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: References: <4CF77A0B.9050204@bham.ac.uk> Message-ID: On 2 December 2010 11:43, Peter wrote: > On Thu, Dec 2, 2010 at 10:50 AM, Nick Loman wrote: > > Hi there > [...] > > 2) When outputting long alignments in Nexus format, MrBayes refuses to > read > > the resulting files saying that the maximum line length is 19900 > characters. > > I'm assuming that is not the maximum input to MrBayes and that it can > handle > > longer alignments if they are split in some way. Would it be possible for > > Bio.Nexus to split alignments in the appropriate format? > > The file format details are not fresh in my mind, but I think that long > sequences can be split over multiple lines# This is valid interleaved Nexus format: """ #NEXUS begin data; Dimensions ntax=4 nchar=3; Format interleave datatype=dna gap=-; Matrix taxon1 AA taxon2 GG taxon3 CC taxon4 TT taxon1 A taxon2 G taxon3 C taxon4 T ; end; """ Note, "interleave" on the format line. Also beware that some Nexus parsers don't check that taxa in additional blocks are in the same order as the first block - they just assume they are. You can write interleaved Nexus formatted data with Nexus.write_nexus_data(interleave_by_partition=True) provide you have a character partition set. Cheers, C. > - so if the problem is > just with how MrBayes parses the file, that might be fixable. Can > you give me a test case for this (maybe generate a simple but > large alignment in code) with the MrBayes call that fails? > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ____________________________________________________________________ Cymon J. Cox Auxiliary Investigator Plant Systematics and Bioinformatics Research Group (PSB) Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc. Mailing address: Rm. 2.77 Faculdade de Ci?ncias e Tecnologia (FCT), Ed.7, Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7380 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://www.ccmar.ualg.pt/home/index.php?id=202 -8.63/-6.77 From n.j.loman at bham.ac.uk Thu Dec 2 10:25:06 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Thu, 02 Dec 2010 15:25:06 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: References: <4CF77A0B.9050204@bham.ac.uk> Message-ID: <4CF7BA52.10601@bham.ac.uk> Peter wrote: >> Is this the best way of doing it? Would a method call in AlignIO to >> do the same thing be useful to others? >> > I've got some code somewhere for iterating over the columns of > the alignment, and think I filed an enhancement bug for this. > Would that do what you want? > Hi Peter, Yes, that would make the code more readable, definitely. Not sure whether you think a function to return an alignment containing just the polymorphic sites would also be helpful to others. >> 2) When outputting long alignments in Nexus format, MrBayes refuses to read >> the resulting files saying that the maximum line length is 19900 characters. >> I'm assuming that is not the maximum input to MrBayes and that it can handle >> longer alignments if they are split in some way. Would it be possible for >> Bio.Nexus to split alignments in the appropriate format? >> > > Are you outputting the large alignment using Bio.AlignIO or using > Bio.Nexus directly? > In this case I was using Bio.Nexus but it would be the same with Bio.AlignIO. > The file format details are not fresh in my mind, but I think that long > sequences can be split over multiple lines - so if the problem is > just with how MrBayes parses the file, that might be fixable. Can > you give me a test case for this (maybe generate a simple but > large alignment in code) with the MrBayes call that fails? > Sure thing: from Bio import AlignIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Align import MultipleSeqAlignment from Bio.Alphabet import generic_dna import subprocess align1 = MultipleSeqAlignment([ SeqRecord(Seq("A" * 20000, generic_dna), id="Alpha"), SeqRecord(Seq("A" * 20000, generic_dna), id="Beta"), ]) AlignIO.write([align1], "out.nex", "nexus") p = subprocess.Popen(["mb"], stdin=subprocess.PIPE) p.communicate("execute out.nex") This gives the error: MrBayes > execute out.nex Executing file "out.nex" UNIX line termination Longest line length = 20006 A maximum of 19900 characters is allowed on a single line in a file. The longest line of the file out.nex contains at least one line with 20056 characters. Error in command "Execute" Cheers Nick From biopython at maubp.freeserve.co.uk Thu Dec 2 10:55:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 15:55:20 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: <4CF7BA52.10601@bham.ac.uk> References: <4CF77A0B.9050204@bham.ac.uk> <4CF7BA52.10601@bham.ac.uk> Message-ID: On Thu, Dec 2, 2010 at 3:25 PM, Nick Loman wrote: > Peter wrote: >>> >>> Is this the best way of doing it? Would a method call in AlignIO to >>> do the same thing be useful to others? >>> >> >> I've got some code somewhere for iterating over the columns of >> the alignment, and think I filed an enhancement bug for this. >> Would that do what you want? >> > > Hi Peter, > > Yes, that would make the code more readable, definitely. Not sure whether > you think a function to return an alignment containing just the polymorphic > sites would also be helpful to others. > I suspect it wouldn't be of general interest. >>> 2) When outputting long alignments in Nexus format, MrBayes refuses >>> to read the resulting files saying that the maximum line length is 19900 >>> characters. >>> I'm assuming that is not the maximum input to MrBayes and that it can >>> handle longer alignments if they are split in some way. Would it be >>> possible for Bio.Nexus to split alignments in the appropriate format? >>> >> >> Are you outputting the large alignment using Bio.AlignIO or using >> Bio.Nexus directly? >> > > In this case I was using Bio.Nexus but it would be the same with > Bio.AlignIO. > Did you ask Bio.Nexus to write interleaved output? I've got MrBayes 3.1.2, and this seems to fix your example: diff --git a/Bio/AlignIO/NexusIO.py b/Bio/AlignIO/NexusIO.py index 72550b1..c3b1649 100644 --- a/Bio/AlignIO/NexusIO.py +++ b/Bio/AlignIO/NexusIO.py @@ -107,7 +107,7 @@ class NexusWriter(AlignmentWriter): n.alphabet = alignment._alphabet for record in alignment: n.add_sequence(record.id, record.seq.tostring()) - n.write_nexus_data(self.handle) + n.write_nexus_data(self.handle, interleave=True) def _classify_alphabet_for_nexus(self, alphabet): """Returns 'protein', 'dna', 'rna' based on the alphabet (PRIVATE). Does that work for you? Peter From n.j.loman at bham.ac.uk Thu Dec 2 10:27:06 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Thu, 02 Dec 2010 15:27:06 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: References: <4CF77A0B.9050204@bham.ac.uk>

Message-ID: <4CF7BACA.8050206@bham.ac.uk> Cymon Cox wrote: > Note, "interleave" on the format line. Also beware that some Nexus parsers > don't check that taxa in additional blocks are in the same order as the > first block - they just assume they are. > > You can write interleaved Nexus formatted data with > Nexus.write_nexus_data(interleave_by_partition=True) provide you have a > character partition set. > Hi Cymon Thanks for that - this would be a useful workaround, however unfortunately I am combining a bunch of alignments a la: from Bio.Nexus import Nexus import sys handles = [open(fh) for fh in sys.argv[2:]] nexi = [(handle.name, Nexus.Nexus(handle)) for handle in handles] combined = Nexus.combine(nexi) combined.write_nexus_data(filename=sys.argv[1]) I was hoping perhaps this might set up the partitions for me for each alignment which is merged. However, if I use: combined.write_nexus_data(filename=sys.argv[1], interleave_by_partition=True) I get the following error: Traceback (most recent call last): File "combine_alignments.py", line 11, in combined.write_nexus_data(filename=sys.argv[1], interleave_by_partition=True) File "Bio/Nexus/Nexus.py", line 1275, in write_nexus_data raise NexusError('Unknown partition: '+interleave_by_partition) TypeError: cannot concatenate 'str' and 'bool' objects Which suggests that combine does not add partitions for each alignment. I could of course work around this with extra code. Regards, Nick. From biopython at maubp.freeserve.co.uk Thu Dec 2 11:22:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 16:22:27 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: <4CF7BACA.8050206@bham.ac.uk> References: <4CF77A0B.9050204@bham.ac.uk>

<4CF7BACA.8050206@bham.ac.uk> Message-ID: On Thu, Dec 2, 2010 at 3:27 PM, Nick Loman wrote: > > ... > ? raise NexusError('Unknown partition: '+interleave_by_partition) > TypeError: cannot concatenate 'str' and 'bool' objects > That should probably be something like this to avoid the TypeError in the exception: raise NexusError('Unknown partition: %r' % interleave_by_partition) > > Which suggests that combine does not add partitions for each > alignment. I could of course work around this with extra code. > Or that the code isn't expecting True for interleave_by_partition? At first glance the expected argument type isn't obvious to me... Peter From biopython at maubp.freeserve.co.uk Thu Dec 2 11:25:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 16:25:49 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: References: <4CF77A0B.9050204@bham.ac.uk>

<4CF7BACA.8050206@bham.ac.uk> Message-ID: On Thu, Dec 2, 2010 at 4:22 PM, Peter wrote: > On Thu, Dec 2, 2010 at 3:27 PM, Nick Loman wrote: >> >> ... >> ? raise NexusError('Unknown partition: '+interleave_by_partition) >> TypeError: cannot concatenate 'str' and 'bool' objects >> > > That should probably be something like this to avoid the TypeError > in the exception: > > raise NexusError('Unknown partition: %r' % interleave_by_partition) > TypeError now fixed on the trunk: https://github.com/biopython/biopython/commit/0d8189865ef674662fc240cf1e684df1d7f9a4c4 Peter From n.j.loman at bham.ac.uk Thu Dec 2 12:11:13 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Thu, 02 Dec 2010 17:11:13 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: References: <4CF77A0B.9050204@bham.ac.uk> <4CF7BA52.10601@bham.ac.uk> Message-ID: <4CF7D331.6040604@bham.ac.uk> Peter wrote: > Did you ask Bio.Nexus to write interleaved output? > I've got MrBayes 3.1.2, and this seems to fix your example: > > diff --git a/Bio/AlignIO/NexusIO.py b/Bio/AlignIO/NexusIO.py > index 72550b1..c3b1649 100644 > --- a/Bio/AlignIO/NexusIO.py > +++ b/Bio/AlignIO/NexusIO.py > @@ -107,7 +107,7 @@ class NexusWriter(AlignmentWriter): > n.alphabet = alignment._alphabet > for record in alignment: > n.add_sequence(record.id, record.seq.tostring()) > - n.write_nexus_data(self.handle) > + n.write_nexus_data(self.handle, interleave=True) > > def _classify_alphabet_for_nexus(self, alphabet): > """Returns 'protein', 'dna', 'rna' based on the alphabet (PRIVATE). > > > Does that work for you? > Hi Peter, Yes, that does the trick! I wonder if perhaps that's what Cymon meant to say rather than interleave_by_partition (hence the boolean problem) ? Excellent. Cheers Nick From biopython at maubp.freeserve.co.uk Thu Dec 2 12:19:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 17:19:20 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: <4CF7D331.6040604@bham.ac.uk> References: <4CF77A0B.9050204@bham.ac.uk> <4CF7BA52.10601@bham.ac.uk> <4CF7D331.6040604@bham.ac.uk> Message-ID: On Thu, Dec 2, 2010 at 5:11 PM, Nick Loman wrote: > Peter wrote: >> >> Did you ask Bio.Nexus to write interleaved output? >> I've got MrBayes 3.1.2, and this seems to fix your example: >> ... >> Does that work for you? >> > > Hi Peter, > > Yes, that does the trick! I wonder if perhaps that's what Cymon meant > to say rather than interleave_by_partition (hence the boolean problem) ? > > Excellent. Could be - Cymon? Are there any downsides to making Bio.AlignIO used interleaved by default? I guess some tools may prefer non-interleaved output... as a compromise we could switch depending on the number of columns in the alignment? Peter From krother at rubor.de Fri Dec 3 04:58:59 2010 From: krother at rubor.de (Kristian Rother) Date: Fri, 3 Dec 2010 10:58:59 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Hi Joao, I've tested the Atom-element feature, added one more test function, and did a small refactoring of Atom.__init__. In my opinion, this is safe to merge. latest commit: https://github.com/krother/biopython/tree/JoaoRodrigues/atom-element As I understood, the other two features you mentioned are not in this branch. Best, Kristian > Hello all, > > I've been looking at the code I wrote for the GSOC to see what is ready to > be merged in the main branch. I have to thank Kristian and whoever > participated in the Python & Friends for the input. > > From what I gathered, and from my own tests, I believe the following > functions are solid enough: > > > 1. > Bio/PDB/Atom.py: > automatically guessing atom element from atom name > 2. Bio/PDB/Structure.py > 1. Building biological unit from REMARK 350 in the header > (link > ) > 2. Renumbering residues > (link > ) > > > Let me know what you all think. > > Best, > > Jo??o [...] Rodrigues > http://doeidoei.wordpress.com > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Fri Dec 3 05:42:29 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 3 Dec 2010 11:42:29 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: Hello Kristian, Thanks for the refactoring, it's a bit better organized indeed :) I merged your changes into my rep and edited a bit the comments. Regarding the other two features, nop, they are not in this branch. I think I'll create a gsoc_solid branch to keep all the features that seem solid enough to be merged into the master. It's better than creating one branch for each new feature I guess. Jo?o From biopython at maubp.freeserve.co.uk Fri Dec 3 06:37:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 11:37:10 +0000 Subject: [Biopython-dev] Git Branch Organisation? In-Reply-To: References: Message-ID: On Fri, Dec 3, 2010 at 11:07 AM, Jo?o Rodrigues wrote: > Hello Peter, > > I have a question regarding git branch organisation. Would it be > better to have one branch with all the GSOC features that are > stable (e.g. gsoc_stable) that grows as more features are tested > or having one branch for each new feature is the best option? > > Best, > > Jo?o [...] Rodrigues I was going to reply to this issue on the list anyway. At the end of the summer you basically had one branch with lots new stuff and changes to Bio.PDB - *some* of this is "stable" and potentially ready for merging to the trunk, but other parts still need work. Is that right? For smaller self contained changes (like the PDB element stuff), to me a feature branch which we can test on its own and then merge makes sense. The bulk of your work is a whole new module (Bio.Struct), so another branch just for the stable stuff makes sense. If you think it can easily be broken into stages, and that doing this would make it easier for us to evaluate, test, and merge. However, it could go in as one big merge -- it just is more work to review in one lump ;) So there is a balance here - is that any clearer? Peter From anaryin at gmail.com Fri Dec 3 06:41:56 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 3 Dec 2010 12:41:56 +0100 Subject: [Biopython-dev] Git Branch Organisation? In-Reply-To: References:

Message-ID: Crystal clear now, thanks! Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Fri, Dec 3, 2010 at 12:37 PM, Peter wrote: > On Fri, Dec 3, 2010 at 11:07 AM, Jo?o Rodrigues wrote: > > Hello Peter, > > > > I have a question regarding git branch organisation. Would it be > > better to have one branch with all the GSOC features that are > > stable (e.g. gsoc_stable) that grows as more features are tested > > or having one branch for each new feature is the best option? > > > > Best, > > > > Jo?o [...] Rodrigues > > I was going to reply to this issue on the list anyway. > > At the end of the summer you basically had one branch with lots > new stuff and changes to Bio.PDB - *some* of this is "stable" and > potentially ready for merging to the trunk, but other parts still need > work. Is that right? > > For smaller self contained changes (like the PDB element stuff), to > me a feature branch which we can test on its own and then merge > makes sense. > > The bulk of your work is a whole new module (Bio.Struct), so > another branch just for the stable stuff makes sense. If you think > it can easily be broken into stages, and that doing this would make > it easier for us to evaluate, test, and merge. However, it could go > in as one big merge -- it just is more work to review in one lump ;) > > So there is a balance here - is that any clearer? > > Peter > From biopython at maubp.freeserve.co.uk Fri Dec 3 09:18:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 14:18:54 +0000 Subject: [Biopython-dev] Genepop application wrapper class Message-ID: Hi Tiago, You may have noticed from the commits recently that I'm looking at the application wrappers. There is some "spring cleaning" that we can do now that the old ApplicationResult class etc is gone. As part of this I may need to make internal changes to the wrapper _GenePopCommandline regarding how it sets up its parameters. In looking at this I noticed that you seem to be using the _Argument class for all the options, where in most cases I would have used _Option. For example, To get e.g. "BatchNumber=5" at the command line via a parameter named BatchNumber (as the argument name and property name in the wrapper), you have: _Argument(["BatchNumber"], ["input"], None, False, "Number of MCMC batches"), This means you have to set the value of the BatchNumber parameter to "BatchNumber=5" which is repetitive. What I would like to change this to is: _Option(["BatchNumber"], ["input"], None, False, "Number of MCMC batches"), This way you set the value of this parameter to 5 (string or int). Since you made this whole class private I think we can change these details without affecting the public API (your controller class). Is that OK with you? Or would you like to go through these settings since you know the tool better than me? Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Dec 3 11:04:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 16:04:07 +0000 Subject: [Biopython-dev] Changing the details of the app wrapper private API Message-ID: Hi all, For those of you that have used the Bio.Application wrapper classes, I'd like comments on this proposed change: https://github.com/peterjc/biopython/commit/a1f160dea719add408eade99026f84a2af4447c2 The idea is that typically each command line tool parameter (switch, option or argument) must have a name (which in the current implication is a list of strings and this covers both the name used in Python and for switches and options the text used for the command), and should have a description (used for the docstring). What the patch does is swap the __init__ arg order round to make name and description mandatory. As a consequence, I had to go and change every single usage - which all used the order. I made them use the keyword approach for other arguments (helpful for searching for argument usage). The good news is that for a typical option (where the defaults make sense, e.g. use the equals sign) we would have: _Option(["-sformat","sformat"], "Input sequence(s) format (e.g. fasta, genbank)") rather than: _Option(["-sformat","sformat"], [], None, 0, "Input sequence(s) format (e.g. fasta, genbank)") I'm also tempted to shorten the is_required option to just required, and perhaps checker_function to validator? Its a small thing, but given I want to change the (private) API anyway this is a good time to do it (if ever). Related to this, I'd like to replace the types argument, which is now just used as [] (default) or ["file"], with a boolean to control automatic quoting of filename arguments. Previously this list could also include "input" and "output" which were used by the now removed ApplicationResult class. Assuming the proposed patch is merged, we can do this with a simple search and replace of types=["file"] with the new name, e.g. auto_quote=True or similar. Alternatively, perhaps this should be handled with a new subclass of _Option, say _FileOption? [In the medium/long term, I'm wondering if we can drop the set_parameter method, and then do everything via conventional Python properties, perhaps with decorators for flagging things like mandatory arguments.] Regards, Peter From chapmanb at 50mail.com Fri Dec 3 13:36:02 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 3 Dec 2010 13:36:02 -0500 Subject: [Biopython-dev] Changing the details of the app wrapper private API In-Reply-To: References: Message-ID: <20101203183602.GK23468@sobchak.mgh.harvard.edu> Peter; > For those of you that have used the Bio.Application wrapper > classes, I'd like comments on this proposed change: > > https://github.com/peterjc/biopython/commit/a1f160dea719add408eade99026f84a2af4447c2 Thanks much for fixing this up. Sorry I mucked up the API in the first place. No complaints from me and happy to see it being improved, Brad From biopython at maubp.freeserve.co.uk Fri Dec 3 14:10:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 19:10:52 +0000 Subject: [Biopython-dev] Changing the details of the app wrapper private API In-Reply-To: <20101203183602.GK23468@sobchak.mgh.harvard.edu> References: <20101203183602.GK23468@sobchak.mgh.harvard.edu> Message-ID: On Fri, Dec 3, 2010 at 6:36 PM, Brad Chapman wrote: > Peter; >> For those of you that have used the Bio.Application wrapper >> classes, I'd like comments on this proposed change: >> >> https://github.com/peterjc/biopython/commit/a1f160dea719add408eade99026f84a2af4447c2 > > Thanks much for fixing this up. Sorry I mucked up the API in the > first place. No complaints from me and happy to see it being > improved, > > Brad Hi Brad, I wouldn't say you mucked it up - you've got to start from somewhere (right?): the basic design was sound and this proposed change is just a little clean up operation really. Thanks for looking it over - I've take that as an approval and committed it: https://github.com/biopython/biopython/commit/6957dd51be97e3cc258a36ca0904d5cbfd0de328 What are your thoughts regarding the types=["file"] stuff? Should we leave it, replace it with a boolean, or look at the subclass route? Other useful subclasses as well as a _FileOption include things like _IntegerOption and _FloatOption (although the later is interesting with things like "1e-10" which BLAST accepts for example, but not all tools taking float arguments would like that). Peter From biopython at maubp.freeserve.co.uk Fri Dec 3 15:22:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 20:22:40 +0000 Subject: [Biopython-dev] Genepop application wrapper class In-Reply-To: References: Message-ID: On Fri, Dec 3, 2010 at 2:18 PM, Peter wrote: > Hi Tiago, > > You may have noticed from the commits recently that I'm looking > at the application wrappers. There is some "spring cleaning" that > we can do now that the old ApplicationResult class etc is gone. > > As part of this I may need to make internal changes to the > wrapper _GenePopCommandline regarding how it sets up its > parameters. > > In looking at this I noticed that you seem to be using the > _Argument class for all the options, where in most cases I > would have used _Option. For example, > > To get e.g. "BatchNumber=5" at the command line via a > parameter named BatchNumber (as the argument name and > property name in the wrapper), you have: > > ? ? ? ? ? ? ? ?_Argument(["BatchNumber"], > ? ? ? ? ? ? ? ? ? ?["input"], > ? ? ? ? ? ? ? ? ? ?None, > ? ? ? ? ? ? ? ? ? ?False, > ? ? ? ? ? ? ? ? ? ?"Number of MCMC batches"), > > This means you have to set the value of the BatchNumber > parameter to "BatchNumber=5" which is repetitive. What I > would like to change this to is: > > ? ? ? ? ? ? ? ?_Option(["BatchNumber"], > ? ? ? ? ? ? ? ? ? ?["input"], > ? ? ? ? ? ? ? ? ? ?None, > ? ? ? ? ? ? ? ? ? ?False, > ? ? ? ? ? ? ? ? ? ?"Number of MCMC batches"), > > This way you set the value of this parameter to 5 (string or int). > Since you made this whole class private I think we can change > these details without affecting the public API (your controller > class). > > Is that OK with you? Or would you like to go through these > settings since you know the tool better than me? Hi Tiago, Since that email I switched the argument order about a bit (and changed *all* the wrappers to match), so the example above is a little out of date now. I've made the _Argument to _Option switch on this branch: https://github.com/peterjc/biopython/tree/genepop-wrapper The unit tests pass (including the newly added simple doctest on the wrapper class), so I'm pretty sure it is OK to use on the trunk, but I'd like you to check it please. Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Dec 3 16:18:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 21:18:51 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: Hi Joao, I've put the atomic masses into the master, with the variable name fixed and the standard capitalisation: https://github.com/biopython/biopython/tree/54417611d88ef92ae6c80dfaa99d80e5ee463260 https://github.com/biopython/biopython/commit/8d5d0203049d5b75605fc1fc8c591b80304875be I did cherry-pick your commit, but when tweaking it must have lost the authorship info. Don't worry - we'll thank you in the NEWS file. Peter From anaryin at gmail.com Fri Dec 3 16:26:46 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 3 Dec 2010 22:26:46 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: Great! :) Thanks! From chapmanb at 50mail.com Fri Dec 3 17:01:22 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 3 Dec 2010 17:01:22 -0500 Subject: [Biopython-dev] Changing the details of the app wrapper private API In-Reply-To: References: <20101203183602.GK23468@sobchak.mgh.harvard.edu> Message-ID: <20101203220122.GP23468@sobchak.mgh.harvard.edu> Peter; [New application wrapper internals] > Thanks for looking it over - I've take that as an approval and committed it: > https://github.com/biopython/biopython/commit/6957dd51be97e3cc258a36ca0904d5cbfd0de328 Great. Thanks again for taking this on. > What are your thoughts regarding the types=["file"] stuff? Should we > leave it, replace it with a boolean, or look at the subclass route? > Other useful subclasses as well as a _FileOption include things like > _IntegerOption and _FloatOption (although the later is interesting > with things like "1e-10" which BLAST accepts for example, but not > all tools taking float arguments would like that). It might be easiest just to dump that. My initial idea behind this was that if you label things with their output type, they could be processed specifically downstream in a pipeline. This was probably way too ambitious, and it might be simpler to keep it lightweight. The application wrappers are working better as a simple way to specify a command line. Brad From biopython at maubp.freeserve.co.uk Fri Dec 3 18:13:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 23:13:47 +0000 Subject: [Biopython-dev] Changing the details of the app wrapper private API In-Reply-To: <20101203220122.GP23468@sobchak.mgh.harvard.edu> References: <20101203183602.GK23468@sobchak.mgh.harvard.edu> <20101203220122.GP23468@sobchak.mgh.harvard.edu> Message-ID: On Fri, Dec 3, 2010 at 10:01 PM, Brad Chapman wrote: > >> What are your thoughts regarding the types=["file"] stuff? Should we >> leave it, replace it with a boolean, or look at the subclass route? >> Other useful subclasses as well as a _FileOption include things like >> _IntegerOption and _FloatOption (although the later is interesting >> with things like "1e-10" which BLAST accepts for example, but not >> all tools taking float arguments would like that). > > It might be easiest just to dump that. My initial idea behind this > was that if you label things with their output type, they could be > processed specifically downstream in a pipeline. This was probably > way too ambitious, and it might be simpler to keep it lightweight. > The application wrappers are working better as a simple way to > specify a command line. Hey Brad, I think we may be talking a little bit at cross purposes. I'll try to clarify (this may be interesting to the others too). As you just suggested ("just dump that..."), I did (earlier today) remove the "input" and "output" labels given to the parameters via the types argument. These were only used in the old ApplicationResult object (deprecated and just removed after the release of Biopython 1.56). In addition to these two now useless tags (input and output), there was one other tag "file", and that is still present and used in most if not all the wrappers. It is being used for some important functionality - supporting nasty filenames, in particular those with spaces in them. This is more an issue on Windows where even the user's home directory has spaces in it. Consider a silly example, $ tool -input filename.fasta If the filename has spaces you must quote it, $ tool -input "filename with spaces.fasta" In general that works on Windows, Mac, Linux etc. It means that users can do this: cline = WrapperClass(input="filename with space.fasta") or cline = WrapperClass(input='filename with space.fasta') or cline.input = "filename with space.fasta" etc and the wrapper will know to add the quotes for them. This all works as things stand (or at least, we have unit tests to check some examples like this and I'm not aware of any open issues). In order for that to happen the wrapper input parameter would be defined with the following: _Option(["-input", "input"], "input filename", types=["file"]) (That's with the new ordering of name list, description, then optional args) That special "file" entry in the otherwise unused types argument triggers the automatic quoting/escaping done by function _escape_filename in Bio.Application. Looking over the history, this functionality via the "file" tag was added in early 2009, as part of Bug 2815, http://bugzilla.open-bio.org/show_bug.cgi?id=2815 I want to keep this functionality, but change the current interface - which is to use types=["file"] or the default of types=[]. The simplest option is to replace it with a boolean (e.g. filename=True, or auto_quote=True). Also, thinking ahead to Python 3, there may be issues with converting filenames given as unicode strings into byte strings suitable for use in command line strings. In that case maybe filename=True is more future proof than auto_quote=True. Since this is a private API, we don't have to worry much about breaking backwards compatibility - that was a good design choice back then Brad. Regards, Peter Hopefully that wasn't too long and boring ;-) From Andrew.Gallant at tufts.edu Sun Dec 5 14:42:08 2010 From: Andrew.Gallant at tufts.edu (Andrew Gallant) Date: Sun, 5 Dec 2010 14:42:08 -0500 Subject: [Biopython-dev] Ortholog module (InParanoid and RoundUp) Message-ID: Hello, I am a graduate student, and for a course project, I wrote an Ortholog module for Biopython. It currently provides two orthology database wrappers (InParanoid and RoundUp) along with a class hierarchy to contain the data. It completely implements InParanoid's "gene search," and RoundUp's "browse" (gene search) and "retrieve" (clustering) functions, with some rudimentary error detection. RoundUp's clustering makes finding all orthologs between a set of species very easy. I haven't contributed to Biopython before, but assuming this module is desirable, how might I start that process? I have the changes in a forked git repository (which is updated with upstream changes) here [1]. I followed the style guide and included doc strings for all functions/modules/classes. However, I have *not* written any unit tests yet, but certainly will. Please let me know if I've missed anything! Thanks! - Andrew Gallant [1] - https://github.com/BurntSushi/biopython/tree/me From biopython at maubp.freeserve.co.uk Mon Dec 6 06:00:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Dec 2010 11:00:18 +0000 Subject: [Biopython-dev] Ortholog module (InParanoid and RoundUp) In-Reply-To: References: Message-ID: On Sun, Dec 5, 2010 at 7:42 PM, Andrew Gallant wrote: > Hello, > > I am a graduate student, and for a course project, I wrote an Ortholog > module for Biopython. It currently provides two orthology database wrappers > (InParanoid and RoundUp) along with a class hierarchy to contain the data. > > It completely implements InParanoid's "gene search," and RoundUp's "browse" > (gene search) and "retrieve" (clustering) functions, with some rudimentary > error detection. RoundUp's clustering makes finding all orthologs between a > set of species very easy. > > I haven't contributed to Biopython before, but assuming this module is > desirable, how might I start that process? I have the changes in a forked > git repository (which is updated with upstream changes) here [1]. I followed > the style guide and included doc strings for all functions/modules/classes. > However, I have *not* written any unit tests yet, but certainly will. > > Please let me know if I've missed anything! > > Thanks! > - Andrew Gallant > > [1] - https://github.com/BurntSushi/biopython/tree/me Hi Andrew, I would suggest adding some very high level introductory text, perhaps in Bio/Ortholog/__init__.py about what the Bio.Ortholog module does - offers access to a number of websites to do X. Something that should make sense to someone like me who is unfamiliar with InParanoid and RoundUp ;) Do all these services encourage/condone programmatic access? If they offer XML then I guess they do, but worth checking. If they have any usage guidelines, this should also be highlighted in your documentation. From a quick look at your code I don't think this applies, but from past experience HTML scrapers are a bad idea (a long term maintenance headache for one thing). Unit tests would be a very good idea. Try to make the tests general enough to cope with changes in the online datasets (e.g. addition of more search results). Use the requires internet hook as in test_SeqIO_online.py so they can be skipped gracefully if the user is offline, or has requested to run the tests offline. Very easy: import requires_internet requires_internet.check() Peter From chapmanb at 50mail.com Tue Dec 7 08:29:19 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Dec 2010 08:29:19 -0500 Subject: [Biopython-dev] Changing the details of the app wrapper private API In-Reply-To: References: <20101203183602.GK23468@sobchak.mgh.harvard.edu> <20101203220122.GP23468@sobchak.mgh.harvard.edu> Message-ID: <20101207132919.GE4621@sobchak.mgh.harvard.edu> Peter; [Internal Application API] > As you just suggested ("just dump that..."), I did (earlier today) > remove the "input" and "output" labels given to the parameters > via the types argument. These were only used in the old > ApplicationResult object (deprecated and just removed after > the release of Biopython 1.56). In addition to these two now > useless tags (input and output), there was one other tag "file", > and that is still present and used in most if not all the wrappers. > > It is being used for some important functionality - supporting > nasty filenames, in particular those with spaces in them. Cool, sorry I totally missed this use of the types argument. Glad it's actually being useful for something. > I want to keep this functionality, but change the current > interface - which is to use types=["file"] or the default of > types=[]. The simplest option is to replace it with a > boolean (e.g. filename=True, or auto_quote=True). Definitely a good idea +1 for the filename=True argument to turn on quoting. This makes it more clear what is happening, and does give you room in case we need other adjustments in the future. I like this better than subclassing, since hopefully it's a limited case and we won't have to do too much manual adjustment of input parameters. Thanks again for tackling this, Brad From chapmanb at 50mail.com Tue Dec 7 08:59:41 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Dec 2010 08:59:41 -0500 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: Message-ID: <20101207135941.GF4621@sobchak.mgh.harvard.edu> Peter; > You may recall some previous discussion about extending the > Bio.SeqIO.index functionality. I'm particularly interested in > keeping the index on disk to reduce the memory overhead > and thus support NGS files with many millions of reads. e.g. [...] > I've been working on the follow idea on branches in github, > and have something workable using SQLite3 to store a > table of record identifiers, file offset, and file number > (for where we have multiple files indexed together). [...] > https://github.com/peterjc/biopython/tree/index-many This is great and definitely needed. The implementation looks nice and fits with the current index functionality, and SQLite definitely seems like the right choice. So a big +1 on all of this. My only suggestion would be the naming: index_file makes it a little clearer about the intentions, instead of index_many (the best naming would be 'index' for this functionality and 'index_memory' for the in-memory indexing, but the ship has probably sailed on that). Thanks much for taking this on, Brad From chapmanb at 50mail.com Tue Dec 7 08:59:41 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Dec 2010 08:59:41 -0500 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: Message-ID: <20101207135941.GF4621@sobchak.mgh.harvard.edu> Peter; > You may recall some previous discussion about extending the > Bio.SeqIO.index functionality. I'm particularly interested in > keeping the index on disk to reduce the memory overhead > and thus support NGS files with many millions of reads. e.g. [...] > I've been working on the follow idea on branches in github, > and have something workable using SQLite3 to store a > table of record identifiers, file offset, and file number > (for where we have multiple files indexed together). [...] > https://github.com/peterjc/biopython/tree/index-many This is great and definitely needed. The implementation looks nice and fits with the current index functionality, and SQLite definitely seems like the right choice. So a big +1 on all of this. My only suggestion would be the naming: index_file makes it a little clearer about the intentions, instead of index_many (the best naming would be 'index' for this functionality and 'index_memory' for the in-memory indexing, but the ship has probably sailed on that). Thanks much for taking this on, Brad From biopython at maubp.freeserve.co.uk Tue Dec 7 10:11:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 15:11:56 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: <20101207135941.GF4621@sobchak.mgh.harvard.edu> References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> Message-ID: On Tue, Dec 7, 2010 at 1:59 PM, Brad Chapman wrote: > Peter; > >> You may recall some previous discussion about extending the >> Bio.SeqIO.index functionality. I'm particularly interested in >> keeping the index on disk to reduce the memory overhead >> and thus support NGS files with many millions of reads. e.g. > [...] >> I've been working on the follow idea on branches in github, >> and have something workable using SQLite3 to store a >> table of record identifiers, file offset, and file number >> (for where we have multiple files indexed together). > [...] >> https://github.com/peterjc/biopython/tree/index-many > > This is great and definitely needed. The implementation > looks nice and fits with the current index functionality, > and SQLite definitely seems like the right choice. > So a big +1 on all of this. > > My only suggestion would be the naming: index_file makes it a little > clearer about the intentions, instead of index_many (the best > naming would be 'index' for this functionality and 'index_memory' for > the in-memory indexing, but the ship has probably sailed on that). Yes, we've already used "index" for the in-memory index, and its API doesn't lend itself to being extended in this way. So too late now. What do you think of index_files (plural) rather than index_file? Peter From eric.talevich at gmail.com Tue Dec 7 10:40:09 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Dec 2010 10:40:09 -0500 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> Message-ID: On Tue, Dec 7, 2010 at 10:11 AM, Peter wrote: > On Tue, Dec 7, 2010 at 1:59 PM, Brad Chapman wrote: > > Peter; > > > >> You may recall some previous discussion about extending the > >> Bio.SeqIO.index functionality. I'm particularly interested in > >> keeping the index on disk to reduce the memory overhead > >> and thus support NGS files with many millions of reads. e.g. > > [...] > >> I've been working on the follow idea on branches in github, > >> and have something workable using SQLite3 to store a > >> table of record identifiers, file offset, and file number > >> (for where we have multiple files indexed together). > > [...] > >> https://github.com/peterjc/biopython/tree/index-many > > > > This is great and definitely needed. The implementation > > looks nice and fits with the current index functionality, > > and SQLite definitely seems like the right choice. > > So a big +1 on all of this. > > > > My only suggestion would be the naming: index_file makes it a little > > clearer about the intentions, instead of index_many (the best > > naming would be 'index' for this functionality and 'index_memory' for > > the in-memory indexing, but the ship has probably sailed on that). > > Yes, we've already used "index" for the in-memory index, and > its API doesn't lend itself to being extended in this way. So too > late now. > > What do you think of index_files (plural) rather than index_file? > How about index_db or index_sqlite? The fact that it uses a SQLite database for storage seems significant enough to be noted in the name. Thanks for adding this feature, it will be very useful! -Eric From biopython at maubp.freeserve.co.uk Tue Dec 7 10:45:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 15:45:06 +0000 Subject: [Biopython-dev] Changing the details of the app wrapper private API In-Reply-To: <20101207132919.GE4621@sobchak.mgh.harvard.edu> References: <20101203183602.GK23468@sobchak.mgh.harvard.edu> <20101203220122.GP23468@sobchak.mgh.harvard.edu> <20101207132919.GE4621@sobchak.mgh.harvard.edu> Message-ID: On Tue, Dec 7, 2010 at 1:29 PM, Brad Chapman wrote: > Peter; > > [Internal Application API] > >> As you just suggested ("just dump that..."), I did (earlier today) >> remove the "input" and "output" labels given to the parameters >> via the types argument. These were only used in the old >> ApplicationResult object (deprecated and just removed after >> the release of Biopython 1.56). In addition to these two now >> useless tags (input and output), there was one other tag "file", >> and that is still present and used in most if not all the wrappers. >> >> It is being used for some important functionality - supporting >> nasty filenames, in particular those with spaces in them. > > Cool, sorry I totally missed this use of the types argument. Glad > it's actually being useful for something. Easily done if you haven't been looking at this code for a while ;) >> I want to keep this functionality, but change the current >> interface - which is to use types=["file"] or the default of >> types=[]. The simplest option is to replace it with a >> boolean (e.g. filename=True, or auto_quote=True). > > Definitely a good idea +1 for the filename=True argument to turn on > quoting. This makes it more clear what is happening, and does give > you room in case we need other adjustments in the future. I like > this better than subclassing, since hopefully it's a limited case > and we won't have to do too much manual adjustment of input > parameters. OK, filename=True it is (default False). https://github.com/biopython/biopython/commit/fd99b976d5775e35cd251a781fb601ffb6906014 Peter P.S. As part of this work I've also started adding a minimal doctest to each app wrapper to construct a command line but not actually call it. These will then be run regardless of the dependency, and check there are no stupid problems with the wrapper. e.g. https://github.com/biopython/biopython/commit/6a93cb64e0211a9e061019220f96030871832f9e The EMBOSS wrappers still need doctests. In hindsight the test coverage here was a bit lacking.... test_Emboss.py covers a lot but not all the wrappers. Note I'm trying to make each example doctest moderately useful to anyone trying to use the tool for the first time. From biopython at maubp.freeserve.co.uk Tue Dec 7 10:47:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 15:47:48 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> Message-ID: On Tue, Dec 7, 2010 at 3:40 PM, Eric Talevich wrote: > On Tue, Dec 7, 2010 at 10:11 AM, Peter wrote: >> On Tue, Dec 7, 2010 at 1:59 PM, Brad Chapman wrote: >> > My only suggestion would be the naming: index_file makes it a little >> > clearer about the intentions, instead of index_many (the best >> > naming would be 'index' for this functionality and 'index_memory' for >> > the in-memory indexing, but the ship has probably sailed on that). >> >> Yes, we've already used "index" for the in-memory index, and >> its API doesn't lend itself to being extended in this way. So too >> late now. >> >> What do you think of index_files (plural) rather than index_file? > > How about index_db or index_sqlite? The fact that it uses a SQLite > database for storage seems significant enough to be noted in the name. > > Thanks for adding this feature, it will be very useful! > > -Eric I'd actually wondered about index_sqlite as a name myself. However, does it really matter to the user that it is implemented in SQLite3? Also we might one day want to make the backend an option (e.g. SQLite3 or the OBDA BDB index format still used by other Bio* projects). Thanks for the positive comments guys :) Peter From devaniranjan at gmail.com Tue Dec 7 12:27:27 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 12:27:27 -0500 Subject: [Biopython-dev] superimposition question Message-ID: Hello everyone, Suppose I wan to see the conformational variation in protein loops and extracted say 2 loops of same length and want to superimpose the 1st and last resiude (say like clamp them together like pivots) how will I go about doing that? I can use the superimposer and superimpose based on either the 1st/last residue calculate the rot/tran then apply to the entire molecule but don't know how I could do it for say the 1st and last while the intermediate loops are not superimposed but are free moving? Thanks for your help and sorry if its written in a slightly confusing manner. From biopython at maubp.freeserve.co.uk Tue Dec 7 14:59:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 19:59:43 +0000 Subject: [Biopython-dev] superimposition question In-Reply-To: References: Message-ID: On Tue, Dec 7, 2010 at 5:27 PM, George Devaniranjan wrote: > Hello everyone, > > Suppose I wan to see the conformational variation in protein loops and > extracted say 2 loops of same length and want to superimpose the 1st and > last resiude (say like clamp them together like pivots) how will I go about > doing that? I can use the superimposer and superimpose based on either the > 1st/last residue calculate the rot/tran then apply to the entire molecule > but don't know how I could do it for say the 1st and last while the > intermediate loops are not superimposed but are free moving? > > Thanks for your help and sorry if its written in a slightly confusing > manner. Hello George, This kind of end user query would be better off asked on the main Biopython mailing list, rather than the Biopython development list (which is for discussing changes to Biopython itself). http://lists.open-bio.org/mailman/listinfo/biopython Could re-ask the question there? If you can give an actual example (e.g. two PDB identifiers, and which residues you are trying to align) it would probably be clearer. Thanks, Peter P.S. Have you looked at this example? http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ From devaniranjan at gmail.com Tue Dec 7 15:05:44 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 15:05:44 -0500 Subject: [Biopython-dev] superimposition question In-Reply-To: References: Message-ID: Sorry guys, I will repost in the correct one with 2 examples. Thanks George On Tue, Dec 7, 2010 at 2:59 PM, Peter wrote: > On Tue, Dec 7, 2010 at 5:27 PM, George Devaniranjan wrote: > > Hello everyone, > > > > Suppose I wan to see the conformational variation in protein loops and > > extracted say 2 loops of same length and want to superimpose the 1st and > > last resiude (say like clamp them together like pivots) how will I go > about > > doing that? I can use the superimposer and superimpose based on either > the > > 1st/last residue calculate the rot/tran then apply to the entire molecule > > but don't know how I could do it for say the 1st and last while the > > intermediate loops are not superimposed but are free moving? > > > > Thanks for your help and sorry if its written in a slightly confusing > > manner. > > Hello George, > > This kind of end user query would be better off asked on the main > Biopython mailing list, rather than the Biopython development list > (which is for discussing changes to Biopython itself). > http://lists.open-bio.org/mailman/listinfo/biopython > > Could re-ask the question there? If you can give an actual > example (e.g. two PDB identifiers, and which residues you > are trying to align) it would probably be clearer. > > Thanks, > > Peter > > P.S. Have you looked at this example? > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > From chapmanb at 50mail.com Wed Dec 8 07:41:44 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Dec 2010 07:41:44 -0500 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> Message-ID: <20101208124144.GL4621@sobchak.mgh.harvard.edu> Peter and Eric; [naming for on-disk indexing function] > > > My only suggestion would be the naming: index_file makes it a little > > > clearer about the intentions, instead of index_many > > What do you think of index_files (plural) rather than index_file? > How about index_db or index_sqlite? The fact that it uses a SQLite database > for storage seems significant enough to be noted in the name. +1 for index_db. That's clearer than index_file(s), which sort of just implies you are indexing something but not that it is non-memory. It also allows you to have multiple backends in addition to SQLite. Nice. Brad From biopython at maubp.freeserve.co.uk Wed Dec 8 08:00:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Dec 2010 13:00:21 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: <20101208124144.GL4621@sobchak.mgh.harvard.edu> References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> <20101208124144.GL4621@sobchak.mgh.harvard.edu> Message-ID: On Wed, Dec 8, 2010 at 12:41 PM, Brad Chapman wrote: > > Peter and Eric; > > [naming for on-disk indexing function] > >> > > My only suggestion would be the naming: index_file makes it a little >> > > clearer about the intentions, instead of index_many > >> > What do you think of index_files (plural) rather than index_file? > >> How about index_db or index_sqlite? The fact that it uses a SQLite database >> for storage seems significant enough to be noted in the name. > > +1 for index_db. That's clearer than index_file(s), which sort of > just implies you are indexing something but not that it is > non-memory. It also allows you to have multiple backends in addition > to SQLite. Nice. > > Brad OK, index_db works for me. Good suggestion Eric :) Peter From tiagoantao at gmail.com Thu Dec 9 11:34:38 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 9 Dec 2010 16:34:38 +0000 Subject: [Biopython-dev] text files and mac Message-ID: Hi, I am writing this email to ask for some advice (many thanks in advance for any suggestion). I was contacted regarding a bug on the genepop parser on mac (I have no Mac, so I really do not have how to test, neither experience developing there). The parser opens a text file but, on the mac, sometimes the file is CR terminated (is this from old Mac versions? I would expect Mac OS X to be text a-la unix - ie CR-LF). Well, I've found a recommendation to open files with the U modifier ( http://www.gossamer-threads.com/lists/python/dev/755361 ). My question is simple (actually two questions): 1. What is the best practice to open text files in terms of modifiers for open? 2. How common is this format on Mac? Is it old stuff or still used? I have around 20% Mac users and this was never a reported problem. Many thanks, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From nicolas.rapin at bric.ku.dk Thu Dec 9 11:54:34 2010 From: nicolas.rapin at bric.ku.dk (Nicolas Rapin) Date: Thu, 9 Dec 2010 17:54:34 +0100 Subject: [Biopython-dev] text files and mac In-Reply-To: References: Message-ID: <01F31B93-B45C-4374-90AA-9312F626AD71@bric.ku.dk> I have had this problem actually, from txt files exported from excel (for mac obviously). The only work around I could find was to convert the CR terminated into LF terminated files. (textmate does that) hope that helps in some way. n On Dec 9, 2010, at 5:34 PM, Tiago Ant?o wrote: > Hi, > > I am writing this email to ask for some advice (many thanks in advance > for any suggestion). > I was contacted regarding a bug on the genepop parser on mac (I have > no Mac, so I really do not have how to test, neither experience > developing there). > > The parser opens a text file but, on the mac, sometimes the file is CR > terminated (is this from old Mac versions? I would expect Mac OS X to > be text a-la unix - ie CR-LF). > Well, I've found a recommendation to open files with the U modifier ( > http://www.gossamer-threads.com/lists/python/dev/755361 ). > > My question is simple (actually two questions): > 1. What is the best practice to open text files in terms of modifiers for open? > 2. How common is this format on Mac? Is it old stuff or still used? I > have around 20% Mac users and this was never a reported problem. > > Many thanks, > Tiago > > -- > "If you want to get laid, go to college. If you want an education, go > to the library." - Frank Zappa > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Thu Dec 9 12:07:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Dec 2010 17:07:57 +0000 Subject: [Biopython-dev] text files and mac In-Reply-To: References: Message-ID: 2010/12/9 Tiago Ant?o : > Hi, > > I am writing this email to ask for some advice (many thanks in advance > for any suggestion). > I was contacted regarding a bug on the genepop parser on mac (I have > no Mac, so I really do not have how to test, neither experience > developing there). > > The parser opens a text file but, on the mac, sometimes the file is CR > terminated (is this from old Mac versions? I would expect Mac OS X to > be text a-la unix - ie CR-LF). > Well, I've found a recommendation to open files with the U modifier ( > http://www.gossamer-threads.com/lists/python/dev/755361 ). > > My question is simple (actually two questions): > 1. What is the best practice to open text files in terms of modifiers > for open? For text files, the universal read lines mode is a sensible default. It is particularly useful for Unix vs Windows. > 2. How common is this format on Mac? Is it old stuff or still used? I > have around 20% Mac users and this was never a reported problem. I've had some line ending problems, usually from copy/paste between applications. I've not sat down to try and work out if there is a pattern. However, simple CR newlines shouldn't really be used with Mac OS X, but it wouldn't surprise me if there were buggy programs out there. Certainly the norm on Mac OS X is to use Unix style LF as the new line character (compared to DOS/Windows which is CR/LF). Peter From eric.talevich at gmail.com Thu Dec 9 14:03:40 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 9 Dec 2010 14:03:40 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: On Tue, Nov 30, 2010 at 10:45 AM, Jo?o Rodrigues wrote: > Hello all, > > I've been looking at the code I wrote for the GSOC to see what is ready to > be merged in the main branch. I have to thank Kristian and whoever > participated in the Python & Friends for the input. > > Hi Jo?o, It sounds like everyone is happy with this branch: https://github.com/JoaoRodrigues/biopython/tree/atom-element So I will try to fetch your branch, correct the spelling of the IUPACData.atom_weights references (unless you beat me to it), test, rebase onto biopython/biopython/master, and merge it this weekend. Regarding the other two stable features: - renumber_residues looks simple and useful; I'm looking forward to having that feature on the biopython trunk. A labmate of mine recently had a problem where he wanted to renumber just a portion of the residues in a structure -- I don't think we need to extend the function to that use case, though. - biological_unit: I still haven't tried this one myself. Does it work well enough for your needs? Using models to represent each unit of the assembly is the concept I want to be sure about -- e.g. will we be able to detect hydrogen bonds between units, or at least calculate distances between atoms in different units? We talked about creating new chain IDs as an alternative at one point, and iirc the issue was that the original structure might have multiple chains, and the names could collide. Are chain names restricted to single alphabetical characters, and if not, could you get the same effect with appending numbers to the original chain ID, e.g. A -> A0, A1, A2, ... ? (Sorry to rehash this.) Thanks for all your work on this project. -Eric From anaryin at gmail.com Thu Dec 9 15:31:35 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 9 Dec 2010 21:31:35 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Hello Eric, Regarding renumber_residues, I could add the following arguments to make it more flexible: - *first *and *last*: to allow renumbering of specific sections of the structure - *chain* : to limit the renumbering to one chain only. As for the biological units I'll have to look at them again... to be honest I really need to retest them :x Thanks for the comments :) Best! Jo?o From anaryin at gmail.com Thu Dec 9 18:48:25 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 10 Dec 2010 00:48:25 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: I remembered how I sorted the biological unit issue: creating a new model per rotation. This way any chain conflicts are resolved. The downside is a less straightforward visualization in PyMol f example (split_states is required). From updates at feedmyinbox.com Fri Dec 10 04:05:40 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 10 Dec 2010 04:05:40 -0500 Subject: [Biopython-dev] 12/10 biopython Questions - BioStar Message-ID: <4d44f9c1d6d5375d550df186919245cd@74.63.51.88> // Biopython 1.56 - Strange behaviour of SeqIO.parse and SeqIO.write // December 9, 2010 at 5:53 AM http://biostar.stackexchange.com/questions/4160/biopython-1-56-strange-behaviour-of-seqio-parse-and-seqio-write Hello everybody, I'm using Biopython 1.56 compiled from source on Ubuntu 10.10 64-bit. It's a great piece of software and I love to work with it. But there is a very strange behavior of the "Bio.SeqIO.parse()" sequence parser, which cost me several hours to find out: If I uncomment the for-statement with the print commands, "Bio.SeqIO.write()" refuses to write the sequences to the file. Is this the desired behavior? Is the for-loop iterating the parser object to its end and leaves it there? Could anyone help me out? What am I getting wrong? Thanks in advance! Markus Example: handle = open("ls_orchid.fasta", "fasta") parsedfasta = Bio.SeqIO.parse(handle, "fasta") # If you uncomment the following part, 0 Sequences # (instead of many more) are written to the file! #for seq_record in parsedfasta: #print "Description:\t%s..." % seq_record.description #print "Length:\t\t%d" % len(seq_record) #print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50]) output_handle = open("example.fasta", "w") count = Bio.SeqIO.write(parsedfasta, output_handle, "fasta") output_handle.close() print "%d Sequences written to file" % count -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/520215/805c958547f61946466c46fa25ad261e71d15c34/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From tiagoantao at gmail.com Fri Dec 10 14:51:19 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 10 Dec 2010 19:51:19 +0000 Subject: [Biopython-dev] text files and mac In-Reply-To: <01F31B93-B45C-4374-90AA-9312F626AD71@bric.ku.dk> References: <01F31B93-B45C-4374-90AA-9312F626AD71@bric.ku.dk> Message-ID: Hi, 2010/12/9 Nicolas Rapin : > I have had this problem actually, from txt files exported from excel (for mac obviously). > The only work around I could find was to convert the CR terminated into LF terminated files. (textmate does that) Yes, excel seems to be the culprit here. The genepop file of the person was generated with genalex ( http://www.anu.edu.au/BoZo/GenAlEx/ ). It seems to generate CR terminated files. Old Macs also were CR end-of-line. But the new ones are LF only as per the BSD base of Mac OS X. Thanks, Tiago From bugzilla-daemon at portal.open-bio.org Sun Dec 12 13:26:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 12 Dec 2010 13:26:56 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201012121826.oBCIQumd010190@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 n.j.loman at bham.ac.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |n.j.loman at bham.ac.uk ------- Comment #3 from n.j.loman at bham.ac.uk 2010-12-12 13:26 EST ------- I needed this functionality for my project, and I got what I needed through the modified method definition in Bio.GenBank._FeaureConsumer: def residue_type(self, type): """Record the sequence type so we can choose an appropriate alphabet. """ self._seq_type = type if 'circular' in type: self.data.annotations['molecule'] = 'circular' else: self.data.annotations['molecule'] = 'linear' This obviously doesn't address the need to write GenBank files. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Dec 12 13:58:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 12 Dec 2010 13:58:10 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201012121858.oBCIwAPP011089@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-12-12 13:58 EST ------- (In reply to comment #3) > I needed this functionality for my project, and I got what I needed through the > modified method definition in Bio.GenBank._FeaureConsumer: > > def residue_type(self, type): > """Record the sequence type so we can choose an appropriate alphabet. > """ > self._seq_type = type > if 'circular' in type: > self.data.annotations['molecule'] = 'circular' > else: > self.data.annotations['molecule'] = 'linear' > > This obviously doesn't address the need to write GenBank files. > It would be safer to have an elif if 'linear' in type, but sure, something like that looks sensible. I'd wondered about a boolean as well (is it circular or not). What's your preference here? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Dec 12 14:08:00 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 12 Dec 2010 14:08:00 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201012121908.oBCJ80WD011495@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 ------- Comment #5 from n.j.loman at bham.ac.uk 2010-12-12 14:07 EST ------- I guess is_circular = True/False might be a bit nicer and avoids unnecessary string comparisons in user code. I don't really mind though. I thought it was fair enough to assume the sequence is linear if no qualifier is specified in the GenBank file (sequences written by Biopython are like that). That assumption works for my app, in any case. It would be good if I didn't have to worry about is_circular or molecule being absent / None, not sure if you think that is potentially misleading. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Dec 12 14:51:15 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 12 Dec 2010 14:51:15 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201012121951.oBCJpF5t012668@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-12-12 14:51 EST ------- Nick - Do you happen to know off hand what BioPerl uses for circular/linear when loading a GenBank file into BioSQL? As I noted in comment #1 we should try to be consistent, and the easiest way is to store it in the SeqRecord annotations dictionary under the same key name as BioPerl uses, with the same values (that way we don't need any special case code in our BioSQL interface). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Dec 12 16:31:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 12 Dec 2010 16:31:44 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201012122131.oBCLVihj015058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 ------- Comment #7 from n.j.loman at bham.ac.uk 2010-12-12 16:31 EST ------- I know the Seq class equivalent (PrimarySeq) has a property is_circular which is boolean, but am not sure if it is also stored as an annotation. http://search.cpan.org/~birney/bioperl-1.2.3/Bio/PrimarySeq.pm#is_circular I guess is_circular as an annotation would be a pragmatic compatible solution for BioPython. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Mon Dec 13 10:13:48 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 13 Dec 2010 16:13:48 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de> Hi Joao, I agree with Eric that renumbering residues should be possible chain-wise. In my daily work, I am frequently using an interface similar to renumber( chain, start_id ) rather than renumber( structure, start_id ) But having these two in BioPython would be quite cool. We also do numbering of portions renumber (chain, start_id, from_id, to_id) and using insertion codes --> 1A, 1B, 1C, 1D... But I agree these are not so important. Best regards, Kristian > Hello Eric, > > Regarding renumber_residues, I could add the following arguments to make > it > more flexible: > - *first *and *last*: to allow renumbering of specific sections of the > structure > - *chain* : to limit the renumbering to one chain only. > > As for the biological units I'll have to look at them again... to be > honest > I really need to retest them :x > > Thanks for the comments :) > > Best! > > Jo??o > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Mon Dec 13 10:32:04 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 13 Dec 2010 16:32:04 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de> References: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de> Message-ID: Hey Kristian, all, What about this: creating a new file under Bio.PDB called Misc.py or Utilities.py that contains functions like renumber_residues. I'm thinking on other functions like joining PDB files, extracting chains from PDB files, renaming chains etc. These are all operations that one can easily do programatically using Bio.PDB, however they do require coding (10 lines each I'd say). This "collection" of useful snippets could be included in a separate file and even provide an interface to the command line. This was one of Eric's ideas for GSOC I didn't pursue in my project. Regarding renumber_residues explicitly, this would allow using it structure-wide or chain-wide without duplicating code (Right now it sits on Structure.py). I can also keep it as is now and just add chain, from_id, and to_id arguments. Let me know of your opinions, Cheers Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Mon, Dec 13, 2010 at 4:13 PM, Kristian Rother wrote: > > Hi Joao, > > I agree with Eric that renumbering residues should be possible chain-wise. > > In my daily work, I am frequently using an interface similar to > > renumber( chain, start_id ) > > rather than > > renumber( structure, start_id ) > > But having these two in BioPython would be quite cool. We also do > numbering of portions > > renumber (chain, start_id, from_id, to_id) > > and using insertion codes --> 1A, 1B, 1C, 1D... > But I agree these are not so important. > > Best regards, > Kristian > > > > > > > Hello Eric, > > > > Regarding renumber_residues, I could add the following arguments to make > > it > > more flexible: > > - *first *and *last*: to allow renumbering of specific sections of the > > structure > > - *chain* : to limit the renumbering to one chain only. > > > > As for the biological units I'll have to look at them again... to be > > honest > > I really need to retest them :x > > > > Thanks for the comments :) > > > > Best! > > > > Jo?o > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > From eric.talevich at gmail.com Mon Dec 13 13:39:49 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Dec 2010 13:39:49 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de> Message-ID: Hey folks, 1. For stand-alone functions that don't seem to fit anywhere else, I suggest creating a file called _utils.py (rather than Utils.py or Misc.py) -- this means it's protected, i.e. users know they're not supposed to access it directly -- then import it to the top level in __init__.py: from Bio.PDB._utils import renumber_residues, center_of_mass, ... Then you're allowed to move the functions in that module somewhere else later without going through the usual deprecation process. 2. For renumber_residues in particular, adding a keyword argument like "chain=None" would solve the immediate problem. Would anyone want to select a model for renumbering? We might as well make it possible. So the three sensible ways to do that are: (a) Add a "model=None" keyword, too, and select the appropriate chain(s)/model(s) based on the combination of those two arguments. I think this isn't so bad. (b) Move the method from Structure to Entity. Then Residue inherits the method, too, and I guess it's a trivial operation there. (c) Turn the method into a separate function, accessible as Bio.PDB.renumber_residues(). Does this method make sense on RNA or DNA structures? Do RNA people call bases "residues" sometimes? - If so: keeping the method on PDB.Structure is good - If not: When Bio.Struct lands, the method might need to move to Bio.Struct.Protein I think the best route is (a), leaving it on PDB.Structure. The three cases of Structure, Model and Chain cover everything in Bio.PDB that you'd want to renumber, and two optional keyword arguments aren't so bad. Also, most of the other functionality in Bio.PDB is accessed through the Structure or Entity objects, rather than through top-level functions. I.e., you do "from Bio.PDB import stuff_i_need" rather than "from Bio import PDB" usually. Let's also leave out renumbering specified residue ranges, at least for this merge -- having both "start" and "first" as keyword arguments could be really confusing. Maybe post it as a cookbook entry first? Cheers, Eric On Mon, Dec 13, 2010 at 10:32 AM, Jo?o Rodrigues wrote: > Hey Kristian, all, > > What about this: creating a new file under Bio.PDB called Misc.py or > Utilities.py that contains functions like renumber_residues. I'm thinking on > other functions like joining PDB files, extracting chains from PDB files, > renaming chains etc. > > These are all operations that one can easily do programatically using > Bio.PDB, however they do require coding (10 lines each I'd say). This > "collection" of useful snippets could be included in a separate file and > even provide an interface to the command line. This was one of Eric's ideas > for GSOC I didn't pursue in my project. > > Regarding renumber_residues explicitly, this would allow using it > structure-wide or chain-wide without duplicating code (Right now it sits on > Structure.py). I can also keep it as is now and just add chain, from_id, and > to_id arguments. > > Let me know of your opinions, > > Cheers > > > Jo?o [...] Rodrigues > http://doeidoei.wordpress.com > > > > On Mon, Dec 13, 2010 at 4:13 PM, Kristian Rother wrote: > >> >> Hi Joao, >> >> I agree with Eric that renumbering residues should be possible chain-wise. >> >> In my daily work, I am frequently using an interface similar to >> >> renumber( chain, start_id ) >> >> rather than >> >> renumber( structure, start_id ) >> >> But having these two in BioPython would be quite cool. We also do >> numbering of portions >> >> renumber (chain, start_id, from_id, to_id) >> >> and using insertion codes --> 1A, 1B, 1C, 1D... >> But I agree these are not so important. >> >> Best regards, >> Kristian >> >> >> >> >> >> > Hello Eric, >> > >> > Regarding renumber_residues, I could add the following arguments to make >> > it >> > more flexible: >> > - *first *and *last*: to allow renumbering of specific sections of the >> > structure >> > - *chain* : to limit the renumbering to one chain only. >> > >> > As for the biological units I'll have to look at them again... to be >> > honest >> > I really need to retest them :x >> > >> > Thanks for the comments :) >> > >> > Best! >> > >> > Jo?o >> > >> > _______________________________________________ >> > Biopython-dev mailing list >> > Biopython-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > >> >> > From biopython at maubp.freeserve.co.uk Mon Dec 13 13:46:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Dec 2010 18:46:05 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: 2010/12/13 Eric Talevich : > Hey folks, > > 1. For stand-alone functions that don't seem to fit anywhere else, I suggest > creating a file called _utils.py (rather than Utils.py or Misc.py) -- this > means it's protected, i.e. users know they're not supposed to access it > directly -- then import it to the top level in __init__.py: > > from Bio.PDB._utils import renumber_residues, center_of_mass, ... > > Then you're allowed to move the functions in that module somewhere else > later without going through the usual deprecation process. On the down side, you'd be adding even more top level functions to Bio.PDB. Shouldn't some of these be methods instead (either returning modified objects or acting in situ as appropriate)? Peter From eric.talevich at gmail.com Mon Dec 13 16:28:57 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Dec 2010 16:28:57 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: 2010/12/13 Peter > 2010/12/13 Eric Talevich : > > Hey folks, > > > > 1. For stand-alone functions that don't seem to fit anywhere else, I > suggest > > creating a file called _utils.py (rather than Utils.py or Misc.py) -- > this > > means it's protected, i.e. users know they're not supposed to access it > > directly -- then import it to the top level in __init__.py: > > > > from Bio.PDB._utils import renumber_residues, center_of_mass, ... > > > > Then you're allowed to move the functions in that module somewhere else > > later without going through the usual deprecation process. > > On the down side, you'd be adding even more top level functions to > Bio.PDB. > > Shouldn't some of these be methods instead (either returning modified > objects or acting in situ as appropriate)? > Agreed, wherever that's possible. I'm just recommending keeping a "utils/misc" module as protected in absence of any more informative name for its contents. If renumber_residues stays as a method on Structure or Entity, then we don't need a _utils.py yet. I think the top level is a good place for functions that don't operate on any particular one of the objects defined in that sub-package -- to use Jo?o's example, joining several PDB files together. -E From rodrigo_faccioli at uol.com.br Mon Dec 13 19:55:10 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Mon, 13 Dec 2010 22:55:10 -0200 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: Hello, I saw this post about some features that we have already implemented. One example is split the pdb into chains. You can see [1] my own example. Does Joao want to talk to us some details about it? Unfortunately, I had talked to Eric about our idea to contribute for BioPython project. Although, I don't have time, in this moment, because I have to finish some issues for my PhD, I believe that Joao and us can work together in some issues. [1] https://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/scripts/split_pdb_chains.py Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Mon, Dec 13, 2010 at 7:28 PM, Eric Talevich wrote: > 2010/12/13 Peter > > > 2010/12/13 Eric Talevich : > > > Hey folks, > > > > > > 1. For stand-alone functions that don't seem to fit anywhere else, I > > suggest > > > creating a file called _utils.py (rather than Utils.py or Misc.py) -- > > this > > > means it's protected, i.e. users know they're not supposed to access it > > > directly -- then import it to the top level in __init__.py: > > > > > > from Bio.PDB._utils import renumber_residues, center_of_mass, ... > > > > > > Then you're allowed to move the functions in that module somewhere else > > > later without going through the usual deprecation process. > > > > On the down side, you'd be adding even more top level functions to > > Bio.PDB. > > > > Shouldn't some of these be methods instead (either returning modified > > objects or acting in situ as appropriate)? > > > > Agreed, wherever that's possible. I'm just recommending keeping a > "utils/misc" module as protected in absence of any more informative name > for > its contents. If renumber_residues stays as a method on Structure or > Entity, > then we don't need a _utils.py yet. > > I think the top level is a good place for functions that don't operate on > any particular one of the objects defined in that sub-package -- to use > Jo?o's example, joining several PDB files together. > > -E > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Mon Dec 13 20:34:08 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 14 Dec 2010 02:34:08 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: <50f7146808d7321f16cb1b179aefc712-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhQQlpQWwpdVw==-webmailer1@server01.webmailer.hosteurope.de>

Message-ID: Hello everyone, My suggestion with this _utils.py file was only to provide a temporary storage solution for these functions. This would be temporary until Struct.py is mature enough. I'd say these would fit perfectly under it since they are pretty general functions (i.e. not restricted to DNA, Protein, etc). The downside is indeed as Peter noted, that the Bio.PDB grows larger. Indeed, as you suggested, renumber_residues is a method of Structure. I could change it to Entity to allow its usage on chains but I think that would be adding an unnecessary (and perhaps confusing) method for Residue.py. Some of what we've been discussing has been implement by Rodrigo in his git branch and would indeed be a good addition. I'd like to see these routine actions "translated" from Bio.PDB classes to an easy interface callable from the command line. For me at least, this would be extremely valuable (and allow me to drop those weird FORTRAN files :). Now, how this should be structured, I guess the best place would be inside an _utils.py file under Bio.Struct. However, since this is not yet in the main trunk, I'd say we can (for now) create a similar file under Bio.PDB and then just move it to Bio.Struct when the time comes. Regarding an immediate solution, adding the keyword arguments seems perfect IMO. I'd also add start_id and last_id to allow renumbering of specific parts. Best Jo?o From eric.talevich at gmail.com Wed Dec 15 21:50:30 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 15 Dec 2010 21:50:30 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: On Thu, Dec 9, 2010 at 2:03 PM, Eric Talevich wrote: > On Tue, Nov 30, 2010 at 10:45 AM, Jo?o Rodrigues wrote: > >> Hello all, >> >> I've been looking at the code I wrote for the GSOC to see what is ready to >> be merged in the main branch. I have to thank Kristian and whoever >> participated in the Python & Friends for the input. >> >> > Hi Jo?o, > > It sounds like everyone is happy with this branch: > https://github.com/JoaoRodrigues/biopython/tree/atom-element > > So I will try to fetch your branch, correct the spelling of the > IUPACData.atom_weights references (unless you beat me to it), test, rebase > onto biopython/biopython/master, and merge it this weekend. > I just did this and pushed it to biopython/biopython/master. The Biopython network graph on GitHub looks reasonable and the tests pass, so I think it all went OK. https://github.com/biopython/biopython/commits/master Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Dec 16 05:24:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Dec 2010 10:24:03 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: 2010/12/16 Eric Talevich : > On Thu, Dec 9, 2010 at 2:03 PM, Eric Talevich wrote: > >> On Tue, Nov 30, 2010 at 10:45 AM, Jo?o Rodrigues wrote: >> >>> Hello all, >>> >>> I've been looking at the code I wrote for the GSOC to see what is ready to >>> be merged in the main branch. I have to thank Kristian and whoever >>> participated in the Python & Friends for the input. >>> >>> >> Hi Jo?o, >> >> It sounds like everyone is happy with this branch: >> https://github.com/JoaoRodrigues/biopython/tree/atom-element >> >> So I will try to fetch your branch, correct the spelling of the >> IUPACData.atom_weights references (unless you beat me to it), test, rebase >> onto biopython/biopython/master, and merge it this weekend. >> > > > I just did this and pushed it to biopython/biopython/master. The Biopython > network graph on GitHub looks reasonable and the tests pass, so I think it > all went OK. > https://github.com/biopython/biopython/commits/master Hi Eric, Yeah, it looks OK. For future reference (and I'm trying to be constructive rather than critical), there were a few things like these commits which could have been omitted: https://github.com/biopython/biopython/commit/06008c298c14a2178bebf1f9795c0740f02937b1 https://github.com/biopython/biopython/commit/f4568562a914d31f04a55148b9aa927e849b101d https://github.com/biopython/biopython/commit/6c7ef358e5f93599ca165ce8e7b46261106e2b06 https://github.com/biopython/biopython/commit/9a4a3a5b968e8918fd6ca02b436382c1e3c5d604 Also there appears to have been a net change to Tests/PDB/1A8O.pdb (just white space so probably harmless). In this particular case I would have been tempted to have collapsed it all into just one or two commits - it is actually quite a small change overall, and much easier to follow. As it stands we see several iterations of the code (e.g. adding them removing the hetatm arg, removing then restoring elements in one of the test files). In some respects this is more a matter of taste, the end result is the same. What are your thoughts on this Eric (et al)? Regards, Peter From anaryin at gmail.com Thu Dec 16 06:20:20 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 16 Dec 2010 12:20:20 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: Hey Peter, I also noticed that last night. In fact what happened was that some changes were actually done months apart (e.g. the hetatm_flag) but my git just "reset" all commits to the date I created the branch... I'd actually love to do some sort of "commit" cleaning but I don't know how to do that... Sorry for the hassle.. Jo?o From biopython at maubp.freeserve.co.uk Thu Dec 16 06:32:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Dec 2010 11:32:57 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: On Thu, Dec 16, 2010 at 11:20 AM, Jo?o Rodrigues wrote: > Hey Peter, > > I also noticed that last night. In fact what happened was that some changes > were actually done months apart (e.g. the hetatm_flag) but my git just > "reset" all commits to the date I created the branch... > > I'd actually love to do some sort of "commit" cleaning but I don't know how > to do that... Well, once a series of commits are published on github, it possible but really not a good idea to change them (rewrite history). One simple way to convert a branch into a single commit is to do the merge with the squash option (I've not tried this yet I confess). Something like this: git checkout my_stuff git rebase master (make sure it is all fine) git checkout master git diff my_stuff (make sure the changes look sane) git merge --squash my_stuff (make sure the changes look sane) git push origin master Another useful trick is doing a diff between branches, saving it to a patch, then applying the patch. It depends what you are aiming to do. Peter From anaryin at gmail.com Thu Dec 16 06:39:45 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 16 Dec 2010 12:39:45 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: I did the latter. But somehow along the way the commits made their way into there... what I can do from now on is just try to rewrite history *before* pushing to github and see if something is redundant. And of course, go on a commit diet. Thanks for the tip Peter, and again, I apologize for the mess. From biopython at maubp.freeserve.co.uk Thu Dec 16 06:52:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Dec 2010 11:52:44 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: On Thu, Dec 16, 2010 at 11:39 AM, Jo?o Rodrigues wrote: > I did the latter. But somehow along the way the commits made their way into > there... what I can do from now on is just try to rewrite history *before* > pushing to github and see if something is redundant. And of course, go on a > commit diet. While developing code it is a *good* thing use separate commits for each logical unit of work. I was talking about what happened during the merge process, in particular the "pointless" commits where the functional change had already been applied to the master. > Thanks for the tip Peter, and again, I apologize for the mess. No at all - it wasn't your fault. Peter From eric.talevich at gmail.com Thu Dec 16 13:02:56 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 16 Dec 2010 10:02:56 -0800 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: 2010/12/16 Peter > 2010/12/16 Eric Talevich : > > On Thu, Dec 9, 2010 at 2:03 PM, Eric Talevich >wrote: > > > >> On Tue, Nov 30, 2010 at 10:45 AM, Jo?o Rodrigues >wrote: > >> > >>> Hello all, > >>> > >>> I've been looking at the code I wrote for the GSOC to see what is ready > to > >>> be merged in the main branch. I have to thank Kristian and whoever > >>> participated in the Python & Friends for the input. > >>> > >>> > >> Hi Jo?o, > >> > >> It sounds like everyone is happy with this branch: > >> https://github.com/JoaoRodrigues/biopython/tree/atom-element > >> > >> So I will try to fetch your branch, correct the spelling of the > >> IUPACData.atom_weights references (unless you beat me to it), test, > rebase > >> onto biopython/biopython/master, and merge it this weekend. > >> > > > > > > I just did this and pushed it to biopython/biopython/master. The > Biopython > > network graph on GitHub looks reasonable and the tests pass, so I think > it > > all went OK. > > https://github.com/biopython/biopython/commits/master > > Hi Eric, > > Yeah, it looks OK. For future reference (and I'm trying to be constructive > rather than critical), there were a few things like these commits which > could have been omitted: > > > https://github.com/biopython/biopython/commit/06008c298c14a2178bebf1f9795c0740f02937b1 > > https://github.com/biopython/biopython/commit/f4568562a914d31f04a55148b9aa927e849b101d > > https://github.com/biopython/biopython/commit/6c7ef358e5f93599ca165ce8e7b46261106e2b06 > > https://github.com/biopython/biopython/commit/9a4a3a5b968e8918fd6ca02b436382c1e3c5d604 > > Also there appears to have been a net change to Tests/PDB/1A8O.pdb > (just white space so probably harmless). > > In this particular case I would have been tempted to have collapsed > it all into just one or two commits - it is actually quite a small > change overall, and much easier to follow. As it stands we see > several iterations of the code (e.g. adding them removing the > hetatm arg, removing then restoring elements in one of the test > files). In some respects this is more a matter of taste, the end > result is the same. What are your thoughts on this Eric (et al)? > > Regards, > > Peter > Ah, sorry. I think we had a difference in understanding what I was planning to do -- "git rebase" can straighten out a branching history, and also squash multiple commits into one with the "-i" option. I only did the former, though in retrospect the latter would have been easy to do, too. For the public benefit, these are good ways to keep a commit history clean: 1. git commit --amend Squash the current change set onto the previous commit. This is great for eliminating "fixed a typo" commits while you're working. 2. git rebase -i Rearrange and simplify a series of commits (interactive rebasing). This is the best way to clean up a personal branch just before publishing; I could have done this to Jo?o's atom-element branch after rebasing onto the current master and fixing the merge conflicts, just before adding my own small correction. 3. git merge --squash Flatten a branch into a single commit, which is then applied to master (or another branch). I didn't know about this until Peter mentioned it. I use the first two regularly but haven't tried the third yet. In the future, for Jo?o's other feature branches, I'll use "git merge --squash" for small overall changes and a gentle degree of "git rebase -i" for larger changes. But interactive rebasing does require a solid understanding of how the commits in a series fit together, so I think it's best if the original author does the heavy lifting there. I'll note here that the Mercurial and Git communities differ in opinion here -- Mercurial supports single-commit rollbacks but strongly discourages editing the commit history beyond that. Having been burned by history editing several times, I can understand their perspective, but I think using Git's features carefully yields a nicer result. Best, Eric From biopython at maubp.freeserve.co.uk Thu Dec 16 13:09:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Dec 2010 18:09:36 +0000 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References:

Message-ID: 2010/12/16 Eric Talevich : > > Ah, sorry. I think we had a difference in understanding what I was planning > to do -- "git rebase" can straighten out a branching history, and also > squash multiple commits into one with the "-i" option. I only did the > former, though in retrospect the latter would have been easy to do, too. Got it. > For the public benefit, these are good ways to keep a commit history clean: > > 1. git commit --amend > > Squash the current change set onto the previous commit. This is great for > eliminating "fixed a typo" commits while you're working. I should probably start doing that more often myself. > 2. git rebase -i > > Rearrange and simplify a series of commits (interactive rebasing). This is > the best way to clean up a personal branch just before publishing; I could > have done this to Jo?o's atom-element branch after rebasing onto the current > master and fixing the merge conflicts, just before adding my own small > correction. > > 3. git merge --squash > > Flatten a branch into a single commit, which is then applied to master (or > another branch). I didn't know about this until Peter mentioned it. > > > I use the first two regularly but haven't tried the third yet. In the > future, for Jo?o's other feature branches, I'll use "git merge --squash" for > small overall changes and a gentle degree of "git rebase -i" for larger > changes. But interactive rebasing does require a solid understanding of how > the commits in a series fit together, so I think it's best if the original > author does the heavy lifting there. > > I'll note here that the Mercurial and Git communities differ in opinion here > -- Mercurial supports single-commit rollbacks but strongly discourages > editing the commit history beyond that. Having been burned by history > editing several times, I can understand their perspective, but I think using > Git's features carefully yields a nicer result. > > Best, > Eric > Sounds good. Peter From mjldehoon at yahoo.com Fri Dec 17 22:16:54 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Dec 2010 19:16:54 -0800 (PST) Subject: [Biopython-dev] Pending deprecations Message-ID: <66511.47063.qm@web62403.mail.re1.yahoo.com> Hi everybody, The following bits of code have a PendingDeprecationWarning in release 1.56. Any objections to upgrading those to BiopythonDeprecationWarnings for the next release? That would mean that users see a deprecation warning by default when using those pieces of code. Bio.Align.MultipleSeqAlignment.get_column Bio.Align.MultipleSeqAlignment.add_sequence Bio.Align.Generic.Alignment.__init__ Bio.Align.Generic.Alignment.get_seq_by_num Bio.Blast.Applications.BlastallCommandline Bio.Blast.Applications.BlastpgpCommandline Bio.Blast.Applications.RpsBlastCommandline Bio.Blast.NCBIStandalone Bio.Blast.NCBIStandalone.blastall Bio.Blast.NCBIStandalone.blastpgp Bio.Blast.NCBIStandalone.rpsblast Bio.Clustalw.parse_file Bio.Clustalw.do_alignment Bio.ClustalW.ClustalAlignment Bio.ClustalW.MultipleAlignCL Bio.File.SGMLStripper Bio.Nexus.Nexus._kill_comments_and_break_lines Bio.PDB.AbstractPropertyMap.AbstractPropertyMap.has_key Bio.PDB.FragmentMapper.FragmentMapper.has_key BioSQL.BioSeqDatabase.DBServer.remove_database BioSQL.BioSeqDatabase.BioSeqDatabase.get_all_primary_ids BioSQL.BioSeqDatabase.BioSeqDatabase.get_Seq_by_primary_id We were also planning to add a PendingDeprecationWarning to Bio.Seq.Seq.tostring, but this will require to replace all calls to my_seq.tostring() in Biopython by str(my_seq). It would be good to start doing this in your favorite Biopython module. Best, --Michiel. From mjldehoon at yahoo.com Fri Dec 17 22:20:38 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Dec 2010 19:20:38 -0800 (PST) Subject: [Biopython-dev] Updating C extensions to Python 3 Message-ID: <147850.46084.qm@web62408.mail.re1.yahoo.com> Hi everybody, I looked at what will be needed to make Biopython's C extensions ready for Python 3. It doesn't seem too bad, so if there are no objections then I will go ahead and make the necessary changes over the next two weeks or so. If you are maintaining one of the C extensions and would like to make the required changes yourself, please let us know, then I won't touch those modules. Best, --Michiel. From biopython at maubp.freeserve.co.uk Sat Dec 18 07:15:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Dec 2010 12:15:31 +0000 Subject: [Biopython-dev] Pending deprecations In-Reply-To: <66511.47063.qm@web62403.mail.re1.yahoo.com> References: <66511.47063.qm@web62403.mail.re1.yahoo.com> Message-ID: On Sat, Dec 18, 2010 at 3:16 AM, Michiel de Hoon wrote: > Hi everybody, > > The following bits of code have a PendingDeprecationWarning in release 1.56. > Any objections to upgrading those to BiopythonDeprecationWarnings for the > next release? That would mean that users see a deprecation warning by > default when using those pieces of code. > > Bio.Align.MultipleSeqAlignment.get_column > Bio.Align.MultipleSeqAlignment.add_sequence > Bio.Align.Generic.Alignment.__init__ > Bio.Align.Generic.Alignment.get_seq_by_num Sounds fine. I hope to add some stuff for column iteration which would help here too. > Bio.Blast.Applications.BlastallCommandline > Bio.Blast.Applications.BlastpgpCommandline > Bio.Blast.Applications.RpsBlastCommandline > Bio.Blast.NCBIStandalone > Bio.Blast.NCBIStandalone.blastall > Bio.Blast.NCBIStandalone.blastpgp > Bio.Blast.NCBIStandalone.rpsblast The NCBI are still supporting their "legacy" BLAST suite, so I think we should keep the wrappers as "pending deprecation" for a little longer. > Bio.Clustalw.parse_file > Bio.Clustalw.do_alignment > Bio.ClustalW.ClustalAlignment > Bio.ClustalW.MultipleAlignCL The whole of Bio.ClustalW is already deprecated, isn't it? It looks like it currently has double warnings which was probably just an oversight. > Bio.File.SGMLStripper Yep, deprecate SGMLStripper > Bio.Nexus.Nexus._kill_comments_and_break_lines That's a private method so we can just remove it. > Bio.PDB.AbstractPropertyMap.AbstractPropertyMap.has_key > Bio.PDB.FragmentMapper.FragmentMapper.has_key The has_key thing is being deprecated in Python itself (use in instead), I don't recall what the timeline is for that, but it may influence us for when we remove has_key too. > BioSQL.BioSeqDatabase.DBServer.remove_database > BioSQL.BioSeqDatabase.BioSeqDatabase.get_all_primary_ids > BioSQL.BioSeqDatabase.BioSeqDatabase.get_Seq_by_primary_id The above are "obsolete" in the sense that I added dictionary like methods instead. Adding a PendingDeprecationWarning is probably OK. > We were also planning to add a PendingDeprecationWarning to > Bio.Seq.Seq.tostring, but this will require to replace all calls to > my_seq.tostring() in Biopython by str(my_seq). It would be > good to start doing this in your favorite Biopython module. One reason why some code does my_seq.tostring() is it will fail with an attribute error for a non-Seq like object, thus doubles as a type check. Sneaky, but in some ways safer than str(my_seq). Still, I agree in principle we should try not to use tostring(). Peter From biopython at maubp.freeserve.co.uk Sat Dec 18 07:18:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Dec 2010 12:18:47 +0000 Subject: [Biopython-dev] Updating C extensions to Python 3 In-Reply-To: <147850.46084.qm@web62408.mail.re1.yahoo.com> References: <147850.46084.qm@web62408.mail.re1.yahoo.com> Message-ID: On Sat, Dec 18, 2010 at 3:20 AM, Michiel de Hoon wrote: > Hi everybody, > > I looked at what will be needed to make Biopython's C extensions > ready for Python 3. It doesn't seem too bad, so if there are no > objections then I will go ahead and make the necessary changes > over the next two weeks or so. If you are maintaining one of the > C extensions and would like to make the required changes yourself, > please let us know, then I won't touch those modules. > > Best, > --Michiel. Excellent news :) I did have a try at this once before, but it seemed tricky to me. You know a lot more about the Python C API, so I was hoping you'd be able to look at this. If you haven't already, check out NumPy for inspiration - although in C land you should just needs some #if defined checks on the Python version, they may have solved many of the issues we will face. I suspect the string/unicode handling will be hardest - so if I were you I'd start on a purely numerical C module. Peter From biopython at maubp.freeserve.co.uk Mon Dec 20 14:09:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Dec 2010 19:09:04 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: Message-ID: On Tue, Nov 30, 2010 at 11:24 PM, Peter wrote: > > One thing I haven't done yet (any volunteers?) is any > benchmarking - for example comparing the index > build and retrieval times for some large files using > Biopython 1.55 (recent baseline), Biopython 1.56 > (should be faster on retrieval) and the branch to > check for any regressions in Bio.SeqIO.index(), and > compare this to Bio.SeqIO.index_many() which being > disk based will be slower but require much less RAM. > Testing here is complicated because each file format can behave differently. I've noticed a slight regression for GenBank indexing, particularly for building the index where I also now track the end of each record (although this is not used for the Bio.SeqIO.index code), and can probably be improved on. e.g. Using the current trunk code for the 240MB GenBank file gbvrt1.seq with 31065 records and Bio.SeqIO.index() we have: Indexed in 5.2s All with get_raw took 5.53s All as SeqRecord objects took 24.08s Using the branch, and Bio.SeqIO.index() gbvrt1.seq contains 31065 records Indexed in 7.1s All with get_raw took 6.08s All as SeqRecord objects took 24.60s Using the branch, and Bio.SeqIO.index_db() Indexed in 7.2s All with get_raw took 1.75s All as SeqRecord objects took 25.15s I haven't looked at EMBL, SwissProt or UniPort XML files yet - but I expect their behaviour to be similar. The major use case for indexing large files is probably FASTA and FASTQ. Testing on FASTQ files with 7 million or so entries shows very little change - which is good :) I really should have made a note of the timings, but I don't have time right now to repeat them, maybe tomorrow. Here are timings from a smaller file, contains 1253960 records from a Roche 454 run in FASTQ format. Using the trunk and Bio.SeqIO.index() Indexed in 20.1s All with get_raw took 34.70s All as SeqRecord objects took 234.68s Using the branch and Bio.SeqIO.index() Indexed in 20.8s All with get_raw took 35.86s All as SeqRecord objects took 238.28s Using the branch and Bio.SeqIO.index_db() Indexed in 41.9s All with get_raw took 41.20s All as SeqRecord objects took 271.26s This example shows Bio.SeqIO.index() remains about the same speed as before for FASTQ files. The other general message is that for large files (many records), using the SQLite back end does slow down the index building step, but access to the records remains very competitive with the in memory Python dict. And of course you can scale to index files bigger than you could otherwise. Peter From n.j.loman at bham.ac.uk Tue Dec 21 06:17:33 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Tue, 21 Dec 2010 11:17:33 +0000 Subject: [Biopython-dev] Reverse complement SeqRecord and features Message-ID: <4D108CCD.8020408@bham.ac.uk> Hi Peter I see there is experimental code to allow SeqRecords to be reverse_complemented, preserving features. That's just what I need for my project. But should I be using branch 'seqrecords' or 'seqrecords-rc' ? And do you think this will make it to master soon? I realise there are (potentially insurmountable) issues with certain types of annotations but the basic functionality which can deal with exact locations is all I need. Cheers Nick From biopython at maubp.freeserve.co.uk Tue Dec 21 08:56:00 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Dec 2010 13:56:00 +0000 Subject: [Biopython-dev] Reverse complement SeqRecord and features In-Reply-To: <4D108CCD.8020408@bham.ac.uk> References: <4D108CCD.8020408@bham.ac.uk> Message-ID: On Tue, Dec 21, 2010 at 11:17 AM, Nick Loman wrote: > Hi Peter > > I see there is experimental code to allow SeqRecords to be > reverse_complemented, preserving features. That's just what I need for my > project. > > But should I be using branch 'seqrecords' or 'seqrecords-rc' ? > > And do you think this will make it to master soon? I realise there are > (potentially insurmountable) issues with certain types of annotations but > the basic functionality which can deal with exact locations is all I need. > > Cheers > > Nick Oh right - this had dropped off my personal priority list ;) http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007850.html Based on the above old posts, and a quick check this is most recent branch (and it looks like I deleted the other branch "seqrecords"): http://github.com/peterjc/biopython/commits/seqrecord-rc You can read the reverse_complement method docstring online here: https://github.com/peterjc/biopython/blob/7f17bbfef9882ef039d02ff04908d01ab400b71b/Bio/SeqRecord.py I've had a quick look, and to rebase it or merge it requires some manual merge conflict resolution due to changes in Bio/SeqFeature.py and its unit tests. However, I'll make time to update the branch if you are willing to test this, and we should be able to get this into the master shortly - perhaps even before the new year? If not, certainly in January. We'll need to add a warning to the docstring about strand specific annotation in features (the SNP feature problem raised by Jose in June). Regards, Peter From bugzilla-daemon at portal.open-bio.org Tue Dec 21 09:16:06 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Dec 2010 09:16:06 -0500 Subject: [Biopython-dev] [Bug 3161] New: MEME Parser fails for large MEME files Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3161 Summary: MEME Parser fails for large MEME files Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: lpritc at scri.sari.ac.uk When using the MEME parser for MEME (4.5.0) text output containing more than 99 sequences, the parser fails to read motif header lines for motifs 100+: In [1]: from Bio import Motif In [2]: data = list(Motif.parse(open('meme.txt'), 'MEME')) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Volumes/RAID_Mirror/Organisms/Phytophthora infestans/RXLR/rxlr_meme/purge_clustering/rxlr_full/ in () /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/Motif/__init__.pyc in parse(handle, format) 76 yield reader(handle) 77 else: # we have a proper reader ---> 78 for m in parser(handle).motifs: 79 yield m 80 /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/Motif/Parsers/MEME.pyc in read(handle) 39 raise ValueError('Unexpected end of stream') 40 while True: ---> 41 motif = __create_motif(line) 42 motif.alphabet = record.alphabet 43 record.motifs.append(motif) /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/Motif/Parsers/MEME.pyc in __create_motif(line) 260 ls = line.split() 261 motif = MEMEMotif() --> 262 motif.length = int(ls[4]) 263 motif._numoccurrences(ls[7]) 264 motif._evalue(ls[13]) ValueError: invalid literal for int() with base 10: 'sites' This happens because for motifs with number greater than 99 there is no whitespace between 'MOTIF' and the motif number in the motif header, e.g.: ******************************************************************************** MOTIF 99 width = 29 sites = 4 llr = 286 E-value = 4.0e-016 ******************************************************************************** ******************************************************************************** MOTIF100 width = 29 sites = 3 llr = 253 E-value = 1.4e-023 ******************************************************************************** which throws off the indexing of the parser's __create_motif function. This can be fixed by offsetting the header line by five characters to remove the MOTIF string, and changing the indexing accordingly: def __create_motif(line): line = line[5:].strip() ls = line.split() motif = MEMEMotif() motif.length = int(ls[3]) motif._numoccurrences(ls[6]) motif._evalue(ls[12]) return motif -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Dec 21 09:17:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Dec 2010 14:17:03 +0000 Subject: [Biopython-dev] Reverse complement SeqRecord and features In-Reply-To: References: <4D108CCD.8020408@bham.ac.uk> Message-ID: On Tue, Dec 21, 2010 at 1:56 PM, Peter wrote: > > http://github.com/peterjc/biopython/commits/seqrecord-rc > > You can read the reverse_complement method docstring online here: > > https://github.com/peterjc/biopython/blob/7f17bbfef9882ef039d02ff04908d01ab400b71b/Bio/SeqRecord.py > > I've had a quick look, and to rebase it or merge it requires some manual > merge conflict resolution due to changes in Bio/SeqFeature.py and its > unit tests. However, I'll make time to update the branch if you are willing > to test this, ... http://github.com/peterjc/biopython/commits/seqrecord-rc has been rebased to the master, so please test this new branch: http://github.com/peterjc/biopython/commits/seqrecord-rc2 I need to double check the SeqRecord import in Seq.py is sane, but the unit tests are passing. Peter From mjldehoon at yahoo.com Tue Dec 21 10:15:33 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Dec 2010 07:15:33 -0800 (PST) Subject: [Biopython-dev] Pending deprecations In-Reply-To: Message-ID: <49679.10755.qm@web62403.mail.re1.yahoo.com> --- On Sat, 12/18/10, Peter wrote: > > Bio.Nexus.Nexus._kill_comments_and_break_lines > > That's a private method so we can just remove it. > Well, looking at this code it seems that this method is still needed if cnexus (C module) is not available, for example in case of Jython. So I think we should remove the PendingDeprecationWarning for this function because we cannot remove it. --Michiel. From bugzilla-daemon at portal.open-bio.org Tue Dec 21 10:47:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Dec 2010 10:47:38 -0500 Subject: [Biopython-dev] [Bug 3162] New: Recording log-likelihood ratio of MEME motifs Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3162 Summary: Recording log-likelihood ratio of MEME motifs Product: Biopython Version: Not Applicable Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: lpritc at scri.sari.ac.uk The MEMEMotif object in Bio.Motif.MEME does not currently have an attribute for recording log likelihood ratio. This is perhaps a more reliable metric for ranking motifs after parsing, since the E-value may be truncated to zero for very small E-values, e.g. ******************************************************************************** MOTIF 1 width = 11 sites = 331 llr = 4357 E-value = 3.1e-554 ******************************************************************************** has reported E-value after parsing of zero: In [24]: data[0].name Out[24]: 'Motif 1' In [25]: data[0].length Out[25]: 11 In [26]: data[0].num_occurrences Out[26]: 331 In [27]: data[0].evalue Out[27]: 0.0 As does the next motif: ******************************************************************************** MOTIF 2 width = 15 sites = 259 llr = 4456 E-value = 3.1e-743 ******************************************************************************** This can easily be handled in __create_motif with the addition at line 265 of motif.llr = int(ls[9]) but it may be more elegantly handled by a method in MEMEMotif. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Dec 21 10:55:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Dec 2010 10:55:33 -0500 Subject: [Biopython-dev] [Bug 3162] Recording log-likelihood ratio of MEME motifs In-Reply-To: Message-ID: <201012211555.oBLFtXht014709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3162 ------- Comment #1 from lpritc at scri.sari.ac.uk 2010-12-21 10:55 EST ------- I should note that the indexing for ls[9] assumes that the fix in bug3161 has been implemented. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From n.j.loman at bham.ac.uk Tue Dec 21 12:45:30 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Tue, 21 Dec 2010 17:45:30 +0000 Subject: [Biopython-dev] Reverse complement SeqRecord and features In-Reply-To: References: <4D108CCD.8020408@bham.ac.uk> Message-ID: <4D10E7BA.9070805@bham.ac.uk> Peter wrote: > On Tue, Dec 21, 2010 at 1:56 PM, Peter wrote: > >> http://github.com/peterjc/biopython/commits/seqrecord-rc >> >> You can read the reverse_complement method docstring online here: >> >> https://github.com/peterjc/biopython/blob/7f17bbfef9882ef039d02ff04908d01ab400b71b/Bio/SeqRecord.py >> >> I've had a quick look, and to rebase it or merge it requires some manual >> merge conflict resolution due to changes in Bio/SeqFeature.py and its >> unit tests. However, I'll make time to update the branch if you are willing >> to test this, ... >> Hi Peter > http://github.com/peterjc/biopython/commits/seqrecord-rc > has been rebased to the master, so please test this new branch: > http://github.com/peterjc/biopython/commits/seqrecord-rc2 > > I need to double check the SeqRecord import in Seq.py is sane, but > the unit tests are passing. > The branch seems to work fine on my application, although that's obviously not an exhaustive test. Cheers Nick From biopython at maubp.freeserve.co.uk Tue Dec 21 13:21:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Dec 2010 18:21:38 +0000 Subject: [Biopython-dev] Pending deprecations In-Reply-To: <49679.10755.qm@web62403.mail.re1.yahoo.com> References: <49679.10755.qm@web62403.mail.re1.yahoo.com> Message-ID: On Tue, Dec 21, 2010 at 3:15 PM, Michiel de Hoon wrote: > --- On Sat, 12/18/10, Peter wrote: >> > Bio.Nexus.Nexus._kill_comments_and_break_lines >> >> That's a private method so we can just remove it. >> > Well, looking at this code it seems that this method is still needed if > cnexus (C module) is not available, for example in case of Jython. > So I think we should remove the PendingDeprecationWarning for > this function because we cannot remove it. I haven't checked, but that sounds sensible. Peter From krother at rubor.de Wed Dec 22 05:37:33 2010 From: krother at rubor.de (Kristian Rother) Date: Wed, 22 Dec 2010 11:37:33 +0100 Subject: [Biopython-dev] Bio.PDB.Structure tests added In-Reply-To: References: Message-ID: Hi, I was just running tests on the functions get_chains, get_residues, and get_atoms in Bio.PDB.Structure. It all seemed fine, and I decided to create a branch with test functions, because I didn't find any so far. (the testing included manually counting through the atoms. I recognized that there are some specific rules how insertion codes are interpreted (e.g. always takes the second residue), but they seemed consistent to me. see also: https://github.com/krother/biopython/commits/bugfix_getresidue https://github.com/krother/biopython/commit/2609230e5f661abf0d0ca1aa9f0e8592bc2141c7 Best regards, Kristian From biopython at maubp.freeserve.co.uk Wed Dec 22 05:46:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Dec 2010 10:46:54 +0000 Subject: [Biopython-dev] Bio.PDB.Structure tests added In-Reply-To: References:

Message-ID: On Wed, Dec 22, 2010 at 10:37 AM, Kristian Rother wrote: > > Hi, > > I was just running tests on the functions get_chains, get_residues, and > get_atoms in Bio.PDB.Structure. It all seemed fine, and I decided to > create a branch with test functions, because I didn't find any so far. > > (the testing included manually counting through the atoms. I recognized > that there are some specific rules how insertion codes are interpreted > (e.g. always takes the second residue), but they seemed consistent to me. > > see also: > https://github.com/krother/biopython/commits/bugfix_getresidue > > https://github.com/krother/biopython/commit/2609230e5f661abf0d0ca1aa9f0e8592bc2141c7 > > Best regards, > ? ?Kristian Cherry-picked, thanks. Peter From bugzilla-daemon at portal.open-bio.org Wed Dec 22 06:12:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Dec 2010 06:12:45 -0500 Subject: [Biopython-dev] [Bug 3161] MEME Parser fails for large MEME files In-Reply-To: Message-ID: <201012221112.oBMBCjiR006180@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3161 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-12-22 06:12 EST ------- Fix applied, thanks: https://github.com/biopython/biopython/commit/350cb6bd14aebbf3a3b99c26cc4e20c4bbe712e3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Wed Dec 22 07:41:09 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 22 Dec 2010 13:41:09 +0100 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: References: <4CF77A0B.9050204@bham.ac.uk>

<4CF7BACA.8050206@bham.ac.uk> Message-ID: <4D11F1E5.1000906@biologie.uni-kl.de> On 12/02/2010 05:22 PM, Peter wrote: > On Thu, Dec 2, 2010 at 3:27 PM, Nick Loman wrote: >> ... >> raise NexusError('Unknown partition: '+interleave_by_partition) >> TypeError: cannot concatenate 'str' and 'bool' objects >> > That should probably be something like this to avoid the TypeError > in the exception: > > raise NexusError('Unknown partition: %r' % interleave_by_partition) > >> Which suggests that combine does not add partitions for each >> alignment. I could of course work around this with extra code. >> > Or that the code isn't expecting True for interleave_by_partition? > At first glance the expected argument type isn't obvious to me... interleave_by_partition is in fact expecting the name of the partition to be used for the split. Nexus.combine() creates a partition named "combined" which contains the character delimiting the original files that were used for the combination. This partition could then be used for the interleave_by_partition parameter. Frank > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From fkauff at biologie.uni-kl.de Wed Dec 22 07:32:49 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 22 Dec 2010 13:32:49 +0100 Subject: [Biopython-dev] Pending deprecations In-Reply-To: References: <49679.10755.qm@web62403.mail.re1.yahoo.com> Message-ID: <4D11EFF1.3020704@biologie.uni-kl.de> On 12/21/2010 07:21 PM, Peter wrote: > On Tue, Dec 21, 2010 at 3:15 PM, Michiel de Hoon wrote: >> --- On Sat, 12/18/10, Peter wrote: >>>> Bio.Nexus.Nexus._kill_comments_and_break_lines >>> That's a private method so we can just remove it. >>> >> Well, looking at this code it seems that this method is still needed if >> cnexus (C module) is not available, for example in case of Jython. >> So I think we should remove the PendingDeprecationWarning for >> this function because we cannot remove it. Yep, that's what it's for. Not needed if cnexus code is available. However, if it is needed, the method takes a prohibitive amount of time for larger nexus files anyway. We could still remove it and make cnexus mandatory, I don't see any reason why cnexus could not be available. Frank > I haven't checked, but that sounds sensible. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Wed Dec 22 07:58:00 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Dec 2010 12:58:00 +0000 Subject: [Biopython-dev] Pending deprecations In-Reply-To: <4D11EFF1.3020704@biologie.uni-kl.de> References: <49679.10755.qm@web62403.mail.re1.yahoo.com> <4D11EFF1.3020704@biologie.uni-kl.de> Message-ID: On Wed, Dec 22, 2010 at 12:32 PM, Frank Kauff wrote: > > Yep, that's what it's for. Not needed if cnexus code is available. However, > if it is needed, the method takes a prohibitive amount of time for larger > nexus files anyway. We could still remove it and make cnexus mandatory, I > don't see any reason why cnexus could not be available. > > Frank Two major reasons, Jython has no compiled C modules (Java based), and so far we haven't ported any of the C extensions to Python 3. Peter From biopython at maubp.freeserve.co.uk Wed Dec 22 08:02:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Dec 2010 13:02:56 +0000 Subject: [Biopython-dev] Bio.AlignIO, Bio.Nexus, MrBayes, polymorphic sites, maximum line length In-Reply-To: <4D11F1E5.1000906@biologie.uni-kl.de> References: <4CF77A0B.9050204@bham.ac.uk>

<4CF7BACA.8050206@bham.ac.uk> <4D11F1E5.1000906@biologie.uni-kl.de> Message-ID: On Wed, Dec 22, 2010 at 12:41 PM, Frank Kauff wrote: > > On 12/02/2010 05:22 PM, Peter wrote: >> >> On Thu, Dec 2, 2010 at 3:27 PM, Nick Loman ?wrote: >>> >>> ... >>> ? raise NexusError('Unknown partition: '+interleave_by_partition) >>> TypeError: cannot concatenate 'str' and 'bool' objects >>> >> That should probably be something like this to avoid the TypeError >> in the exception: >> >> raise NexusError('Unknown partition: %r' % interleave_by_partition) >> >>> Which suggests that combine does not add partitions for each >>> alignment. I could of course work around this with extra code. >>> >> Or that the code isn't expecting True for interleave_by_partition? >> At first glance the expected argument type isn't obvious to me... > > interleave_by_partition is in fact expecting the name of the partition to be > used for the split. Nexus.combine() creates a partition named "combined" > which contains the character delimiting the original files that were used > for the combination. This partition could then be used for the > interleave_by_partition parameter. > Thanks Frank - I added that to the docstring, https://github.com/biopython/biopython/commit/c73b0321b26a1377a351466fe9cf927a38943b62 Peter From biopython at maubp.freeserve.co.uk Fri Dec 24 12:39:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Dec 2010 17:39:52 +0000 Subject: [Biopython-dev] Reverse complement SeqRecord and features In-Reply-To: <4D10E7BA.9070805@bham.ac.uk> References: <4D108CCD.8020408@bham.ac.uk> <4D10E7BA.9070805@bham.ac.uk> Message-ID: