From bugzilla-daemon at portal.open-bio.org Mon Mar 1 13:14:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Mar 2010 13:14:45 -0500 Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic alignment, e.g. align[1:2, 5:-5] In-Reply-To: Message-ID: <201003011814.o21IEjcK024496@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2551 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-01 13:14 EST ------- I've started a possible implementation of an improved multiple sequence alignment object on a github branch: http://github.com/peterjc/biopython/commits/alignment-obj This now covers: Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5] Bug 2552 - Adding alignments Bug 2553 - Adding SeqRecord objects to an alignment (append or extend) Bug 2554 - Creating an Alignment from a list of SeqRecord objects -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bioinformed at gmail.com Mon Mar 1 18:22:42 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 1 Mar 2010 18:22:42 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> Message-ID: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> On Thu, Feb 11, 2010 at 12:29 AM, Peter wrote: > On Mon, Jan 11, 2010 at 5:11 PM, Peter > wrote: > > I didn't want to rush the SFF support into Biopython 1.53, but its been > > waiting "ready" for a while now. Any objections or comments about > > me merging this now? > > There were no objections, and I ran this by Brad and Michiel and > have just merged this into the master branch. Time for some more > testing! > > I've tried out the recently landed SFF SeqIO code and am pleased to report that it works very well. I am parsing gsMapper 454PairAlign.txt output and converting it to SAM/BAM format to view in IGV (among other things) and wanted to include per-based quality score information from the SFF files. The only glitch so far is that the indexed access mode yields sequences with no alphabet assigned. The solution is to add the following to the beginning of SffDict.__init__: if alphabet is None: alphabet = Alphabet.generic_dna My only other comment is that several file reads and struct.unpacks can be merged in _sff_read_seq_record. Given the number of records in most 454 SFF files, I suspect the micro-optimization effort will be worth the slight cost in code clarity. Thanks to Peter and Jose for all of their hard work! Best regards, -Kevin From biopython at maubp.freeserve.co.uk Tue Mar 2 05:08:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 10:08:27 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> Message-ID: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote: > On Thu, Feb 11, 2010 at 12:29 AM, Peter > wrote: >> >> On Mon, Jan 11, 2010 at 5:11 PM, Peter >> wrote: >> > I didn't want to rush the SFF support into Biopython 1.53, but its been >> > waiting "ready" for a while now. Any objections or comments about >> > me merging this now? >> >> There were no objections, and I ran this by Brad and Michiel and >> have just merged this into the master branch. Time for some more >> testing! >> > > I've tried out the recently landed SFF SeqIO code and am pleased to > report that it works very well. Great :) If you have suggestions for the documentation please voice them. Also did the handling of trimmed reads seem sensible? Until we release this we can tweak the API. > I am parsing gsMapper 454PairAlign.txt output and > converting it to SAM/BAM format to view in IGV (among other things) and > wanted to include per-based quality score information from the SFF files. Are you reading and writing SAM/BAM format with Python? Looking into this is on my (long) todo list. >?The only glitch so far is that the indexed access mode yields sequences > with no alphabet assigned. ?The solution is to add the following to the > beginning of SffDict.__init__: > ?? ? ? ?if alphabet is None: > ?? ? ? ? ?alphabet = Alphabet.generic_dna Thanks - I'll look at that. > My only other comment is that several file reads and struct.unpacks can be > merged in?_sff_read_seq_record. ?Given the number of records in most 454 SFF > files, I suspect the micro-optimization effort will be worth the slight cost > in code clarity. I did try and spend some effort on the run time, but it wouldn't surprise me that there was still room for improvement. I found that since most of my SFF files were only up to 2GB with under a million reads, that this wasn't such an issue (compared to FASTQ files with Solexa data). I guess you mean the flowgram values, flowgram index, bases and qualities might be loaded with a single read? That would be worth trying. > Thanks to Peter and Jose for all of their hard work! > Best regards, > -Kevin And thanks for the feedback :) Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 07:02:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 12:02:53 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> Message-ID: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com> On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote: > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote: >>?The only glitch so far is that the indexed access mode yields sequences >> with no alphabet assigned. ?The solution is to add the following to the >> beginning of SffDict.__init__: >> ?? ? ? ?if alphabet is None: >> ?? ? ? ? ?alphabet = Alphabet.generic_dna > > Thanks - I'll look at that. Yes, that looks sensible - change commited. Would you like to be credited in our NEWS and CONTRIB file for this little bug fix? Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 07:25:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 12:25:05 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <20091028121833.GC22395@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman wrote: >Peter wrote: >> My rough work in progress in on github - at the moment I'm still trying >> things out, and don't assume anything is set in stone. If you want to >> have a play with this code, feedback is very welcome - probably best >> on the dev list rather than here. See: >> >> http://github.com/peterjc/biopython/tree/seqrecords >> >> (a lot of the alignment things I want to support, like slicing and adding >> are very closely linked to doing the same operations to SeqRecords) Here is a new branch implementing a multiple-sequence-alignment class (living under Bio.Align for now) based on the recent support for slicing and adding SeqRecord objects: http://github.com/peterjc/biopython/tree/alignment-obj This handles most of the basic tasks I want to be able to easily do with classical alignments, based on previous discussions on the mailing list and/or bugzilla: http://bugzilla.open-bio.org/show_bug.cgi?id=2551 http://bugzilla.open-bio.org/show_bug.cgi?id=2552 http://bugzilla.open-bio.org/show_bug.cgi?id=2553 http://bugzilla.open-bio.org/show_bug.cgi?id=2554 At its core, the alignment is still held as a list of SeqRecord objects, which should mean minimal problems with backwards compatibility. If anyone would like to try out the code, comments would be very welcome. There are plenty of doctests in the docstrings which should explain how I expect things to work. > The bx-python alignment object is nice and goes to/from MAF > and AXT formats: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py > > This supports slicing by alignment coordinates and by reference > coordinates for a species in the alignment. Some other useful > features are limiting the alignment to specific species and removing > all gap columns that can result. The representation is a high level > Alignment object containing multiple Components. My code does not (yet) attempt to deal with next-gen sequencing alignments, which would require padding all the (short) reads with leading and trailing gaps to ensure all rows of the alignment have the same length. Doing this in a memory efficient way could be done with a PaddedSeq object, or a very different alignment object (hold read and their offsets in memory). I'm not sure what is best, but the bx-python model looks worth understanding to help decide. Perhaps until this is settled, it would be premature to merge my alignment class to the trunk. After all, we may need to tweak the alignment object class heirachy. Peter From bioinformed at gmail.com Tue Mar 2 07:29:38 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 07:29:38 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com> Message-ID: <2e1434c11003020429y37343796oddf02ad433ab82ea@mail.gmail.com> On Tue, Mar 2, 2010 at 7:02 AM, Peter wrote: > On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote: > > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote: > >> The only glitch so far is that the indexed access mode yields sequences > >> with no alphabet assigned. The solution is to add the following to the > >> beginning of SffDict.__init__: > >> if alphabet is None: > >> alphabet = Alphabet.generic_dna > > > > Thanks - I'll look at that. > > Yes, that looks sensible - change commited. Would you like to be credited > in our NEWS and CONTRIB file for this little bug fix? > > I'm happy to contribute and be listed in the credits. Thanks, -Kevin From bioinformed at gmail.com Tue Mar 2 07:36:27 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 07:36:27 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> Message-ID: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote: > On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman > wrote:My code does not (yet) attempt to deal with next-gen sequencing > alignments, which would require padding all the (short) reads with > leading and trailing gaps to ensure all rows of the alignment have > the same length. Doing this in a memory efficient way could be > done with a PaddedSeq object, or a very different alignment object > (hold read and their offsets in memory). I'm not sure what is best, > but the bx-python model looks worth understanding to help decide. > > Perhaps until this is settled, it would be premature to merge my > alignment class to the trunk. After all, we may need to tweak the > alignment object class heirachy. Hi Peter, I'm just jumping in here and have not yet read all of the background material. However, I am working with next-gen alignments and am curious as to what you have in mind. At first glance, it sounds like you want to access aligned reads in a 'pileup' format (i.e., an object model akin to http://samtools.sourceforge.net/pileup.shtml). Or are you thinking of something different entirely? Best regards, -Kevin From bioinformed at gmail.com Tue Mar 2 07:28:22 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 07:28:22 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> Message-ID: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> On Tue, Mar 2, 2010 at 5:08 AM, Peter wrote: > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs > wrote: > > I've tried out the recently landed SFF SeqIO code and am pleased to > > report that it works very well. > > Great :) > > If you have suggestions for the documentation please voice them. > Also did the handling of trimmed reads seem sensible? Until we > release this we can tweak the API. I only looked at the module documentation and it was more than sufficient to get started. I've never really used BioPython before, so I was pleasantly surprised at how easy it was to get started. The BioPython SFF parser and indexed access replaced a hairy process of extracting data using 454's sffinfo and packing it into a BDB file. > > I am parsing gsMapper 454PairAlign.txt output and > > converting it to SAM/BAM format to view in IGV (among other things) and > > wanted to include per-based quality score information from the SFF files. > > Are you reading and writing SAM/BAM format with Python? Looking > into this is on my (long) todo list. > Yes-- so far I have code to populate the basic data for unpaired reads, but none of the optional annotations. My script reads the 454 pairwise alignment data, finds each read in the source SFF file, figures out if extra trimming was applied by gsMapper, and extracts the matching PHRED quality scores. Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools FAQ). The script can output SAM records or create a subprocess to sort the records and recode to BAM format using samtools. I've attached the current version script and you are welcome to use it for any purpose. > My only other comment is that several file reads and struct.unpacks can be > > merged in _sff_read_seq_record. Given the number of records in most 454 > SFF > > files, I suspect the micro-optimization effort will be worth the slight > cost > > in code clarity. > > [...]I guess you mean the flowgram values, flowgram index, bases and qualities might be loaded with a single read? That would > be worth trying. > Exactly! Also, flowgrams do not need to be unpacked when trimming. My own bias is to encode the quality scores and flowgrams in numpy arrays rather than lists, however I understand that the goal is to keep the external dependencies to a minimum (although NumPy is required elsewhere). Also, the test "chr(0)*padding != handle.read(padding)" could be written just as clearly as "handle.read(padding).count('\0') != padding" and not generate as many temporary objects. Best regards, -Kevin -------------- next part -------------- # -*- coding: utf-8 -*- # Convert 454PairAlign.txt and the corresponding SFF files into SAM/BAM format import re import sys from operator import getitem, itemgetter from itertools import izip, imap, groupby, repeat from subprocess import Popen, PIPE import numpy as np try: # Import fancy versions of basic IO functions from my GLU package # see http://code.google.com/p/glu-genetics from glu.lib.fileutils import autofile,hyphen,table_writer,table_reader except ImportError: import csv # The real version handles automatic gz/bz2 (de)compression autofile = file def hyphen(filename,default): if filename=='-' and default is not None: return default return filename # Write a tab-delimited ASCII file # The real version handles many more formats (CSV, XLS, Stata), column # selection, header optionds, row filters, and other toys. def table_writer(filename,hyphen=None): if filename=='-' and hyphen is not None: dest = hyphen else: dest = autofile(filename,'wb') return csv.writer(dest, dialect='excel-tab') # Read a tab-delimited ASCII file # The real version handles many more formats (CSV, XLS, Stata), column # selection, header optionds, row filters, and other toys. def table_reader(filename,hyphen=None): if filename=='-' and hyphen is not None: dest = hyphen else: dest = autofile(filename,'rb') return csv.reader(dest, dialect='excel-tab') CIGAR_map = { ('-','-'):'P' } for a in 'NACGTacgt': CIGAR_map[a,'-'] = 'I' CIGAR_map['-',a] = 'D' for b in 'NACGTacgt': CIGAR_map[a,b] = 'M' def make_cigar_py(query,ref): assert len(query)==len(ref) igar = imap(getitem, repeat(CIGAR_map), izip(query,ref)) cigar = ''.join('%d%s' % (len(list(run)),code) for code,run in groupby(igar)) return cigar # Try to import the optimized Cython version # The Python version is pretty fast, but I wanted to play with Cython. try: from cigar import make_cigar except ImportError: make_cigar = make_cigar_py class SFFIndex(object): def __init__(self, sfffiles): self.sffindex = sffindex = {} for sfffile in sfffiles: from Bio import SeqIO prefix,ext = sfffile[-13:].split('.') assert ext=='sff' print >> sys.stderr,'Loading SFF index for',sfffile reads = SeqIO.index(sfffile, 'sff-trim') sffindex[prefix] = reads def get_quality(self, qname, query, qstart, qstop): prefix = qname[:9] sff = self.sffindex.get(prefix) if not sff: return '*' rec = sff[qname] phred = rec.letter_annotations['phred_quality'] sffqual = np.array(phred,dtype=np.uint8) sffqual += 33 sffqual = sffqual.tostring() # Align the query to the original read to find the matching quality # score information. This is complicated by the extra trimming done by # gsMapper. We could obtain this information by parsing the # 454TrimStatus.txt, but it is easier to search for the sub-sequence in # the reference. Ones hopes the read maps uniquely, but this is not # checked. # CASE 1: Forward read alignment if qstart> sys.stderr,'MATCHED TYPE F2: name=%s, qstart=%d(%d), qstop=%d, qlen=%d, len.query=%d' % (qname,start+1,qstart,qstop,qlen,len(query)) qual = sffqual[start:start+len(query)] # CASE 2: Backward read alignment else: # Try using specified cut-points read = str(rec.seq.complement()) seq = read[qstop-1:qstart][::-1] read = read[::-1] # If it matches, then compute quality if seq==query: qual = sffqual[qstop-1:qstart][::-1] else: # otherwise gsMapper applied extra trimming, so we have to manually find the offset start = read.index(query) seq = read[start:start+len(query)] if seq==query: #print >> sys.stderr,'MATCHED TYPE R2: name=%s, qstart=%d, qstop=%d(%d), qlen=%d, len.query=%d' % (qname,qstart,start+1,qstop,qlen,len(query)) qual = sffqual[::-1][start:start+len(query)] assert seq==query assert len(qual) == len(query) return qual def pair_align(filename, sffindex): records = autofile(filename) split = re.compile('[\t ,.]+').split mrnm = '*' mpos = 0 isize = 0 mapq = 60 for line in records: assert line.startswith('>') fields = split(line) qname = fields[0][1:] qstart = int(fields[1]) qstop = int(fields[2]) #qlen = int(fields[4]) rname = fields[6] rstart = int(fields[7]) rstop = int(fields[8]) #rlen = int(fields[10]) query = split(records.next())[2] qq = query.replace('-','') ref = split(records.next())[2] cigar = make_cigar(query,ref) qual = sffindex.get_quality(qname, qq, qstart, qstop) flag = 0 if qstart>qstop: flag |= 0x10 if rstart>rstop: flag |= 0x20 yield [qname, flag, rname, rstart, mapq, cigar, mrnm, mpos, isize, qq, qual] def option_parser(): import optparse usage = 'usage: %prog [options] 454PairAlign.txt[.gz] [SFFfiles.sff..]' parser = optparse.OptionParser(usage=usage) parser.add_option('-r', '--reflist', dest='reflist', metavar='FILE', help='Reference genome contig list') parser.add_option('-o', '--output', dest='output', metavar='FILE', default='-', help='Output SAM file') return parser def main(): parser = option_parser() options,args = parser.parse_args() if not args: parser.print_help(sys.stderr) sys.exit(2) sffindex = SFFIndex(args[1:]) alignment = pair_align(hyphen(args[0],sys.stdin), sffindex) write_bam = options.output.endswith('.bam') if write_bam: if not options.reflist: raise ValueError('Conversion to BAM format requires a reference genome contig list (-r/--reflist)') # Creating the following two-stage pipeline deadlocks due to problems with subprocess # -- use the shell method below instead #sammer = Popen(['samtools','import',options.reflist,'-','-'],stdin=PIPE,stdout=PIPE) #bammer = Popen(['samtools','sort','-', options.output[:-4]], stdin=sammer.stdout) cmd = 'samtools import "%s" - - | samtools sort - "%s"' % (options.reflist,options.output[:-4]) bammer = Popen(cmd,stdin=PIPE,shell=True,bufsize=-1) out = table_writer(bammer.stdin) else: out = table_writer(options.output,hyphen=sys.stdout) out.writerow(['@HD', 'VN:1.0']) if options.reflist: reflist = table_reader(options.reflist) for row in reflist: if len(row)<2: continue contig_name = row[0] contig_len = int(row[1]) out.writerow(['@SQ', 'SN:%s' % contig_name, 'LN:%d' % contig_len]) print >> sys.stderr, 'Generating alignment from %s to %s' % (args[0],options.output) for qname,qalign in groupby(alignment,itemgetter(0)): qalign = list(qalign) if len(qalign)>1: # Set MAPQ to 0 for multiply aligned reads for row in qalign: row[4] = 0 out.writerow(row) else: out.writerow(qalign[0]) if write_bam: print >> sys.stderr,'Finishing BAM encoding...' bammer.communicate() if __name__=='__main__': if 1: main() else: try: import cProfile as profile except ImportError: import profile import pstats prof = profile.Profile() try: prof.runcall(main) finally: stats = pstats.Stats(prof) stats.strip_dirs() stats.sort_stats('time', 'calls') stats.print_stats(25) From biopython at maubp.freeserve.co.uk Tue Mar 2 08:01:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 13:01:53 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> Message-ID: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> Kevin wrote: > I only looked at the module documentation and it was more than sufficient to > get started. ?I've never really used BioPython before, so I was pleasantly > surprised at how easy it was to get started. ?The BioPython SFF parser and > indexed access replaced a hairy process of extracting data using 454's > sffinfo and packing it into a BDB file. Great :) >> > I am parsing gsMapper 454PairAlign.txt output and >> > converting it to SAM/BAM format to view in IGV (among other things) and >> > wanted to include per-based quality score information from the SFF >> > files. >> >> Are you reading and writing SAM/BAM format with Python? Looking >> into this is on my (long) todo list. > > Yes-- so far I have code to populate the basic data for unpaired reads, but > none of the optional annotations. ?My script reads the 454 pairwise > alignment data, finds each read in the source SFF file, figures out if extra > trimming was applied by gsMapper, and extracts the matching PHRED quality > scores. ?Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and > non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools > FAQ). ?The script can output SAM records or create a subprocess to sort the > records and recode to BAM format using samtools. ?I've attached the current > version script and you are welcome to use it for any purpose. I'll take a look... >> [...] I guess you mean the flowgram values, flowgram index, bases >> and qualities might be loaded with a single read? That would >> be worth trying. > > Exactly! If I recall I felt the unpacking was more complicated (and not needed for the sequence bases), but I agree it this is faster it is worthwhile. > Also, flowgrams do not need to be unpacked when trimming. True, that shouldn't make the function much more complex. I'll try to look at that later today. > My own bias is to encode the quality scores and flowgrams in numpy > arrays rather than lists, however I understand that the goal is to keep > the external dependencies to a minimum (although NumPy is required > elsewhere). Yes, I did wonder about using NumPy here but wanted to ensure that the core of Biopython remains without an external dependency here. > Also, the test "chr(0)*padding != handle.read(padding)" could be written > just as clearly as "handle.read(padding).count('\0') != padding" and not > generate as many temporary objects. Good point, done - and you're in the contributors list now ;) Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 09:34:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 14:34:07 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> Message-ID: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> On Tue, Mar 2, 2010 at 12:36 PM, Kevin Jacobs wrote: > On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote: >> My code does not (yet) attempt to deal with next-gen sequencing >> alignments, which would require padding all the (short) reads with >> leading and trailing gaps to ensure all rows of the alignment have >> the same length. Doing this in a memory efficient way could be >> done with a PaddedSeq object, or a very different alignment object >> (hold read and their offsets in memory). I'm not sure what is best, >> but the bx-python model looks worth understanding to help decide. >> >> Perhaps until this is settled, it would be premature to merge my >> alignment class to the trunk. After all, we may need to tweak the >> alignment object class heirachy. > > > Hi Peter, > > I'm just jumping in here and have not yet read all of the background > material. ?However, I am working with next-gen alignments and am > curious as to what you have in mind. ?At first glance, it sounds like > you want to access aligned reads in a 'pileup' format (i.e., an object > model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are > you thinking of something different entirely? Probably something different. My general concern boils down to the fact that the current Alignment model as an enhanced "list of SeqRecord objects" is potentially limiting. The alignment code in Biopython (and my branch which is basically an extension to that) deals with classical multiple sequence alignments like ClustalW etc. You can think of the alignment as a matrix of letters, each row is a sequence (e.g. a gene), and there will be some gap characters for insertions, and padding for leading/trailing commissions. There may or may not be a consensus sequence too. With assembles you have a (long) consensus with many (short) reads aligned to it. In order to hold this as a "matrix" representation, all the (short) reads would require (lots of) leading/trailing padding. The same applies when mapping reads to a reference genome. So, while the current object model may work, all this extra padding might mean too much of a memory overhead (especially as all the rows are currently stored as SeqRecord objects). Instead, we might just store the (short) read sequence, name, and its offset (and perhaps the strand). We can then reconstruct columns or rows mimicking the "matrix" interpretation on demand. However, the API should make it easy to get the unpadded reads and their offsets too - so the current alignment API might either be extended or perhaps changed. Related to this, a "Lite" version of the alignment object might be useful when there is no annotation requiring using SeqRecord objects. e.g. For ClustalW, FASTA, PHYLIP alignments all we need is the sequence and identifiers. Regarding one of your points, accessing aligned reads (or rows) from an alignment - currently this is only supported by index (row number). In most cases the reads (rows) have a unique identifier/name, and thus one idea I am considering for this branch is overloading the align[...] syntax further to allow a record's id to be used as an alternative. i.e. More like a dictionary. Other ideas for enhancements on this branch including sorting the rows (with a list like sort method, defaulting to sorting on the record's id strings), per-column annotation (useful for PFAM alignments and the match string in pairwise alignments), and a general annotations dictionary (like we have on SeqRecord objects). Peter From bioinformed at gmail.com Tue Mar 2 09:36:32 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 09:36:32 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> Message-ID: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote: > Kevin wrote:> My own bias is to encode the quality scores and flowgrams in > numpy > > arrays rather than lists, however I understand that the goal is to keep > > the external dependencies to a minimum (although NumPy is required > > elsewhere). > > Yes, I did wonder about using NumPy here but wanted to ensure that > the core of Biopython remains without an external dependency here. > In addition to not creating many little objects, my leanings toward using NumPy are also due to the generality of tricks like the following to recode quality scores to Sanger ASCII-33 format: sffqual = np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) sffqual += 33 sffqual = sffqual.tostring() That said, the alternatives aren't that slow and small integers are shared from a pre-allocated pool, so this is not as big a concern. -Kevin From biopython at maubp.freeserve.co.uk Tue Mar 2 09:44:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 14:44:13 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> Message-ID: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com> On Tue, Mar 2, 2010 at 2:36 PM, Kevin Jacobs wrote: > On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote: >> Yes, I did wonder about using NumPy here but wanted to ensure that >> the core of Biopython remains without an external dependency here. > > In addition to not creating many little objects, my leanings toward using > NumPy are also due to the generality of tricks like the following to recode > quality scores to Sanger ASCII-33 format: > > ? ?sffqual ?= > np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) > ? ?sffqual += 33 > ? ?sffqual ?= sffqual.tostring() > Yeah - I had this kind of thing in mind for the qualities, both when looking at the SFF files and earlier when doing the FASTQ and QUAL stuff. You can probably make that more efficient with one line: sffqual = (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) + 33).tostring() Not sure if it will make a measurable difference mind you ;) > That said, the alternatives aren't that slow and small integers are shared > from a pre-allocated pool, so this is not as big a concern. Indeed. Peter From bioinformed at gmail.com Tue Mar 2 09:51:04 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 09:51:04 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com> Message-ID: <2e1434c11003020651y541ce3e5q92fb0fea308a59e9@mail.gmail.com> On Tue, Mar 2, 2010 at 9:44 AM, Peter wrote: > You can probably make that more efficient with one line: > > sffqual = > (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) > + 33).tostring() > > Not sure if it will make a measurable difference mind you ;) > I haven't measured, but my understanding is that the inplace "+= 33" will avoid creating a temporary copy and thus be quicker. But as you said, not likely to make a difference in practice. -Kevin From chapmanb at 50mail.com Tue Mar 2 10:03:08 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 2 Mar 2010 10:03:08 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> Message-ID: <20100302150308.GP98028@sobchak.mgh.harvard.edu> Peter and Kevin; > >> My code does not (yet) attempt to deal with next-gen sequencing > >> alignments, [...] > >> Perhaps until this is settled, it would be premature to merge my > >> alignment class to the trunk. After all, we may need to tweak the > >> alignment object class heirachy. My vote would be to merge what you've done in for handling standard multiple alignments, and then look at next-generation read representation as an analogous but separate problem. All of the SeqRecord objects which are useful for drilling in on multiple alignments are likely going to be memory hogs for any real world next gen work. > > I'm just jumping in here and have not yet read all of the background > > material. ?However, I am working with next-gen alignments and am > > curious as to what you have in mind. ?At first glance, it sounds like > > you want to access aligned reads in a 'pileup' format (i.e., an object > > model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are > > you thinking of something different entirely? This is a good way to go. SAM is at least an emerging standard that people are adopting, and samtools and the pysam module do a good job of dealing with them: http://code.google.com/p/pysam/ pysam exposes a Pileup style API from sorted and indexed BAM files and scales great for large alignment files: http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html This is a good starting point for providing interoperability with Biopython; it would be great to re-use what we can from these projects. Brad From biopython at maubp.freeserve.co.uk Tue Mar 2 10:28:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 15:28:45 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> Message-ID: <320fb6e01003020728v760e8208h5da4288dfaef7ed7@mail.gmail.com> On Tue, Mar 2, 2010 at 12:28 PM, Kevin Jacobs wrote: > >?Also, flowgrams do not need to be unpacked when trimming. > True - change made on the trunk, should make parsing SFF files as trimmed records a little bit faster. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 11:43:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 16:43:18 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003020843n72a23176wa023786c46ffb7b3@mail.gmail.com> On Tue, Mar 2, 2010 at 3:03 PM, Brad Chapman wrote: > Peter and Kevin; > >> >> My code does not (yet) attempt to deal with next-gen sequencing >> >> alignments, > [...] >> >> Perhaps until this is settled, it would be premature to merge my >> >> alignment class to the trunk. After all, we may need to tweak the >> >> alignment object class heirachy. > > My vote would be to merge what you've done in for handling > standard multiple alignments, and then look at next-generation read > representation as an analogous but separate problem. All of the > SeqRecord objects which are useful for drilling in on multiple > alignments are likely going to be memory hogs for any real world > next gen work. OK - that is what I was leaning towards. What do you think about the fact I am introducing an "improved" version of the existing Bio.Align.Generic.Alignment class under Bio.Align.MultipleSeqAlignment? That's actually several questions in one - should this be a new object or just enhance the old one? I favour a new object here because I want to *enforce* the fact that all the rows are the same length, but I doubt people are using the flexibility of the current alignment object in this way. Next where should the new object live? I find the current use of Bio.Align.Generic somewhat hidden away, thus my suggestion of using Bio.Align directly. Next, what should the new object be called? We could reuse the old name of Alignment but it is a bit vague and would cause confusion given the existing object is also called that. I have used MultipleSeqAlignment but am open to suggestions (e.g. MulSeqAlignment is shorter). Peter From bioinformed at gmail.com Tue Mar 2 12:07:03 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 12:07:03 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> Message-ID: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> On Tue, Mar 2, 2010 at 10:03 AM, Brad Chapman wrote: > Kevin; > > > I'm just jumping in here and have not yet read all of the background > > > material. However, I am working with next-gen alignments and am > > > curious as to what you have in mind. At first glance, it sounds like > > > you want to access aligned reads in a 'pileup' format (i.e., an object > > > model akin to http://samtools.sourceforge.net/pileup.shtml). Or are > > > you thinking of something different entirely? > > This is a good way to go. SAM is at least an emerging standard that > people are adopting, and samtools and the pysam module do a good job > of dealing with them: > > http://code.google.com/p/pysam/ > > I find pysam pretty limited for doing more than reading and subsetting SAM/BAM files. I'm planning to add a constructor and helper functions for creating new aligned reads. The current AlignedRead object is also read-only, which will need to be relaxed for many serious applications. Until then, I'm writing (text) SAM records and piping them to samtools to encode in BAM format (see the script attached to one of my earlier emails). > pysam exposes a Pileup style API from sorted and indexed BAM files > and scales great for large alignment files: > > http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html Scalability is okay for conversion to pileup format, but not what I'd consider great. But I agree, pysam is a good starting point. I just wish that the read identifiers and attributes were available via the C API, since those are often needed when, e.g., writing a genotype caller. -Kevin From chapmanb at 50mail.com Wed Mar 3 09:12:15 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 3 Mar 2010 09:12:15 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> Message-ID: <20100303141215.GZ98028@sobchak.mgh.harvard.edu> Kevin and Peter; > I find pysam pretty limited for doing more than reading and subsetting > SAM/BAM files. I'm planning to add a constructor and helper functions for > creating new aligned reads. The current AlignedRead object is also > read-only, which will need to be relaxed for many serious applications. > Until then, I'm writing (text) SAM records and piping them to samtools to > encode in BAM format (see the script attached to one of my earlier emails). Agreed. These sound like good improvements. > Scalability is okay for conversion to pileup format, but not what I'd > consider great. But I agree, pysam is a good starting point. I just wish > that the read identifiers and attributes were available via the C API, > since those are often needed when, e.g., writing a genotype caller. Do you think we could build off of what pysam has? The project hasn't seemed especially active, but it would be great to have a unified code base in python for dealing with BAM files. They use mercurial for revision control, so worst case we can always fork this on bitbucket and work off of that. Galaxy has a fork for their use: http://bitbucket.org/kanwei/kanwei-pysam/ The bioconductor folks also seem to be standardizing around SAM/BAM for their analysis pipelines, so practically we may be able to borrow some of their APIs once they have a released version of Rsamtools. > What do you think about the fact I am introducing an "improved" > version of the existing Bio.Align.Generic.Alignment class under > Bio.Align.MultipleSeqAlignment? Yes please. I don't think Generic is that great and am happy to see it improved upon. > That's actually several questions in one - should this be a new > object or just enhance the old one? I favour a new object here > because I want to *enforce* the fact that all the rows are the > same length, but I doubt people are using the flexibility of > the current alignment object in this way. > > Next where should the new object live? I find the current use > of Bio.Align.Generic somewhat hidden away, thus my > suggestion of using Bio.Align directly. > > Next, what should the new object be called? We could reuse > the old name of Alignment but it is a bit vague and would > cause confusion given the existing object is also called that. > I have used MultipleSeqAlignment but am open to suggestions > (e.g. MulSeqAlignment is shorter). I like MultipleSeqAlignment, and agree it should be as top level as possible in Bio.Align. If you think a new object is better, go for that and we can move Generic on a deprecation path. It's great you are cleaning this up. Brad From biopython at maubp.freeserve.co.uk Wed Mar 3 10:03:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 15:03:38 +0000 Subject: [Biopython-dev] EMBOSS eprimer3 parser In-Reply-To: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com> References: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com> Message-ID: <320fb6e01003030703k691fdbe8i3ab3dfd5ba1640a6@mail.gmail.com> On Mon, Jan 18, 2010 at 4:33 PM, Peter wrote: > Hi all, > > Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in > Biopython? I'd like someone to look over Leighton's proposed enhancements > to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968 > > There are two main issues. First, the current code doesn't cope with multiple > primer sets (so Leighton introduces read/parse functions in line with other > modules for single or multiple sets of primers). This seems entirely sensible > to me, and worthwhile in itself. I've made changes on github to do this based on Leighton's code. > Second, Leighton makes some changes to the primer record objects. > I'm not so sure about the necessity here, even if it is backwards > compatible, but I haven't really used this code. What do the rest of > you think? I expect to doing some work with eprimer3 this month, so will feel I can make a more informed choice later. Peter From bugzilla-daemon at portal.open-bio.org Wed Mar 3 10:06:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Mar 2010 10:06:47 -0500 Subject: [Biopython-dev] [Bug 2968] Modifications to Emboss eprimer3 parser and associated files In-Reply-To: Message-ID: <201003031506.o23F6lgb005243@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2968 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-03 10:06 EST ------- (In reply to comment #0) > The existing Emboss primer3/eprimer3 code has a couple of issues, and some > scope for improvement: > > - The existing Primer3.py parser code can only parse output when eprimer3 is > applied to a single sequence. When eprimer3 is applied to multiple sequence > input, it groups all primers for all sequences into a single record, which may > incorrectly associate primers with the wrong sequences in downstream analysis. > - The current parser lacks an iterator for iterating over multiple sequence > output I've made changes on github to support multiple targets (with a read and a parse function) this based on Leighton's code which addresses the above issues. > - The current parser creates 'ghost' primers for all primer pairs, with length > zero and sequence as an empty string; it does not do this for internal oligos. > A more intuitive solution might be to return None for absent primers/oligos > - The current data model stores all primer data as individual attributes. It > might be more useful to group the attributes of individual primers into their > natural associations Regarding the object changes, I'll be doing some work with eprimer3 this month, so will feel I can make a more informed choice later. See also: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007255.html http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007398.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Mar 3 10:57:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 15:57:09 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100303141215.GZ98028@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> On Wed, Mar 3, 2010 at 2:12 PM, Brad Chapman wrote: > Kevin and Peter; > >> I find pysam pretty limited for doing more than reading and subsetting >> SAM/BAM files. ?I'm planning to add a constructor and helper functions for >> creating new aligned reads. ?The current AlignedRead object is also >> read-only, which will need to be relaxed for many serious applications. >> ?Until then, I'm writing (text) SAM records and piping them to samtools to >> encode in BAM format (see the script attached to one of my earlier emails). > > Agreed. These sound like good improvements. > >> Scalability is okay for conversion to pileup format, but not what I'd >> consider great. ?But I agree, pysam is a good starting point. ?I just wish >> that the read identifiers and attributes were ?available via the C API, >> since those are often needed when, e.g., writing a genotype caller. > > Do you think we could build off of what pysam has? The project hasn't > seemed especially active, but it would be great to have a unified > code base in python for dealing with BAM files. They use mercurial > for revision control, so worst case we can always fork this on > bitbucket and work off of that. Galaxy has a fork for their use: > > http://bitbucket.org/kanwei/kanwei-pysam/ > > The bioconductor folks also seem to be standardizing around > SAM/BAM for their analysis pipelines, so practically we may be > able to borrow some of their APIs once they have a released > version of Rsamtools. I agree that we should work towards supporting SAM (and perhaps also BAM) in Biopython, and other projects APIs can be very useful for inspiration or guidance. I was aware of pysam but am concerned about the dependencies: pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools itself - which may all be fine on Linux, but will likely be trouble for us on other platforms (especially Windows). Is anyone aware of any other SAM/BAM parser in Python? >> What do you think about the fact I am introducing an "improved" >> version of the existing Bio.Align.Generic.Alignment class under >> Bio.Align.MultipleSeqAlignment? > > Yes please. I don't think Generic is that great and am happy to see > it improved upon. > >> That's actually several questions in one - should this be a new >> object or just enhance the old one? I favour a new object here >> because I want to *enforce* the fact that all the rows are the >> same length, but I doubt people are using the flexibility of >> the current alignment object in this way. >> >> Next where should the new object live? I find the current use >> of Bio.Align.Generic somewhat hidden away, thus my >> suggestion of using Bio.Align directly. >> >> Next, what should the new object be called? We could reuse >> the old name of Alignment but it is a bit vague and would >> cause confusion given the existing object is also called that. >> I have used MultipleSeqAlignment but am open to suggestions >> (e.g. MulSeqAlignment is shorter). > > I like MultipleSeqAlignment, and agree it should be as top level as > possible in Bio.Align. If you think a new object is better, go for > that and we can move Generic on a deprecation path. It's great you > are cleaning this up. OK then - I've been wanting to "clean this up" for some time. I'll make time to merge what I have so far (which shouldn't be controversial) and update the tutorial. I would also like to investigate moving the useful bits of the SummaryInfo class into methods of the main alignment class. Testing would be very welcome! Peter From biopython at maubp.freeserve.co.uk Wed Mar 3 12:51:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 17:51:41 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> Message-ID: <320fb6e01003030951n261c124bq31578bc9cc5814c9@mail.gmail.com> On Wed, Mar 3, 2010 at 3:57 PM, Peter wrote: > > OK then - I've been wanting to "clean this up" for some time. > I'll make time to merge what I have so far (which shouldn't be > controversial) and update the tutorial. The merge is done, updates to the tutorial to show how to use the new object pending (but already in the doctests). Peter From bioinformed at gmail.com Wed Mar 3 13:30:49 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 3 Mar 2010 13:30:49 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> Message-ID: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> On Wed, Mar 3, 2010 at 10:57 AM, Peter wrote: > I agree that we should work towards supporting SAM (and perhaps > also BAM) in Biopython, and other projects APIs can be very > useful for inspiration or guidance. > > Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully between samtools and Picard source code, I've been able to work out most of the tricky bits. I'm glad to know that the R folks are also working on this, since they're usually very good about generating clear documentation. > I was aware of pysam but am concerned about the dependencies: > pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools > itself - which may all be fine on Linux, but will likely be trouble for > us on other platforms (especially Windows). > > Is anyone aware of any other SAM/BAM parser in Python? Parsing SAM is pretty simple and I can certainly help with gluing it into Biopython (with some help on the Biopython side, since I'm still a newb). I'm about half-way to having a BAM reader and writer for my own purposes. I'm coding the time-critical parts in Cython with a fallback to pure Python, so it may not be ideal for use in Biopython. -Kevin From chapmanb at 50mail.com Thu Mar 4 08:13:52 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 4 Mar 2010 08:13:52 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> Message-ID: <20100304131352.GB19053@sobchak.mgh.harvard.edu> Kevin and Peter; > I was aware of pysam but am concerned about the dependencies: > pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools > itself - which may all be fine on Linux, but will likely be trouble for > us on other platforms (especially Windows). I believe you can remove the pyrex requirement by shipping the generated C file with the distribution. Samtools itself may be an issue; however, right now it is probably a practical need for dealing with SAM/BAM since it implements a lot of BAM generation, sorting, merging and indexing you need in workflows. Also, the C code is included with the distribution so it is more a matter of getting it compiled than introducing extra dependencies. The bioconductor work appears to do the same thing. > > I agree that we should work towards supporting SAM (and perhaps > > also BAM) in Biopython, and other projects APIs can be very > > useful for inspiration or guidance. All of my work converts SAM directly into sorted and indexed BAM, and then build from that. For me, direct SAM parsing wouldn't be as useful as BAM. > Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully > between samtools and Picard source code, I've been able to work out most of > the tricky bits. I'm glad to know that the R folks are also working on > this, since they're usually very good about generating clear documentation. Agreed, but at least we are converging on something instead of having to write a parser every time you use a new aligner. The bioconductor SVN is here: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/ (user: readonly, pass: readonly) I think the pysam API does a decent job for reading and exposing this. The higher level things that would be nice to add are: - Converting the CIGAR string into something more useful. - Smartly dealing with the X? fields from various aligners. These often contain very useful information missing from the SAM specification. Where the data actually is will be aligner specific. - More generally easing dealing with the optional fields. > Parsing SAM is pretty simple and I can certainly help with gluing it into > Biopython (with some help on the Biopython side, since I'm still a newb). > I'm about half-way to having a BAM reader and writer for my own purposes. > I'm coding the time-critical parts in Cython with a fallback to pure > Python, so it may not be ideal for use in Biopython. Cool. Does the BAM reader require samtools C code or is it independent of that? Brad From aaronquinlan at gmail.com Thu Mar 4 08:33:40 2010 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Thu, 4 Mar 2010 08:33:40 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> Message-ID: Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/). The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc. I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries. Save for a few minor bugs along the way, it was rather straightforward to include. Aaron Aaron Quinlan, Ph.D. NRSA Postdoctoral Fellow Hall Laboratory University of Virginia Biochem. & Mol. Genetics aaronquinlan at gmail.com On Mar 4, 2010, at 8:13 AM, Brad Chapman wrote: > Kevin and Peter; > >> I was aware of pysam but am concerned about the dependencies: >> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools >> itself - which may all be fine on Linux, but will likely be trouble for >> us on other platforms (especially Windows). > > I believe you can remove the pyrex requirement by shipping the > generated C file with the distribution. Samtools itself may be an > issue; however, right now it is probably a practical need for dealing > with SAM/BAM since it implements a lot of BAM generation, sorting, > merging and indexing you need in workflows. Also, the C code is > included with the distribution so it is more a matter of getting it > compiled than introducing extra dependencies. The bioconductor work > appears to do the same thing. > >>> I agree that we should work towards supporting SAM (and perhaps >>> also BAM) in Biopython, and other projects APIs can be very >>> useful for inspiration or guidance. > > All of my work converts SAM directly into sorted and indexed BAM, > and then build from that. For me, direct SAM parsing wouldn't be as > useful as BAM. > >> Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully >> between samtools and Picard source code, I've been able to work out most of >> the tricky bits. I'm glad to know that the R folks are also working on >> this, since they're usually very good about generating clear documentation. > > Agreed, but at least we are converging on something instead of > having to write a parser every time you use a new aligner. The > bioconductor SVN is here: > > https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/ > (user: readonly, pass: readonly) > > I think the pysam API does a decent job for reading and exposing > this. The higher level things that would be nice to add are: > > - Converting the CIGAR string into something more useful. > - Smartly dealing with the X? fields from various aligners. These > often contain very useful information missing from the SAM > specification. Where the data actually is will be aligner > specific. > - More generally easing dealing with the optional fields. > >> Parsing SAM is pretty simple and I can certainly help with gluing it into >> Biopython (with some help on the Biopython side, since I'm still a newb). >> I'm about half-way to having a BAM reader and writer for my own purposes. >> I'm coding the time-critical parts in Cython with a fallback to pure >> Python, so it may not be ideal for use in Biopython. > > Cool. Does the BAM reader require samtools C code or is it > independent of that? > > Brad > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bioinformed at gmail.com Thu Mar 4 08:44:39 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 4 Mar 2010 08:44:39 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> Message-ID: <2e1434c11003040544j278ffb0fya984cd2668a6d278@mail.gmail.com> On Thu, Mar 4, 2010 at 8:13 AM, Brad Chapman wrote: > All of my work converts SAM directly into sorted and indexed BAM, > and then build from that. For me, direct SAM parsing wouldn't be as > useful as BAM. Same here-- I construct and unserialize alignment data into SAM-like records, but it would be foolish to actually store them natively to disk. > > > Parsing SAM is pretty simple and I can certainly help with gluing it into > > Biopython (with some help on the Biopython side, since I'm still a newb). > > I'm about half-way to having a BAM reader and writer for my own purposes. > > I'm coding the time-critical parts in Cython with a fallback to pure > > Python, so it may not be ideal for use in Biopython. > > Cool. Does the BAM reader require samtools C code or is it > independent of that? > It is intended to be independent of the samtools distribution, though some of the C code is currently duplicated (e.g., bgzf). Of course, a Cython/Python re-write would be simple enough, though obviously extra work. -Kevin From bioinformed at gmail.com Thu Mar 4 08:52:33 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 4 Mar 2010 08:52:33 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> Message-ID: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote: > Just an FYI for those interested in developing tools to work with BAM: it > may also be worth looking into the BamTools C++ API developed by Derek > Barnett at Boston College (http://sourceforge.net/projects/bamtools/). > The API is quite nice and has much of the necessary functionality for > iterators, getters/setters, etc. > > I added BAM support for my BEDTools package ( > http://code.google.com/p/bedtools/) using the BAMTools libraries. Save > for a few minor bugs along the way, it was rather straightforward to > include. > Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools. The bamtools code looks well designed and quite similar to my emerging Cython/Python rendition. -Kevin From bioinformed at gmail.com Thu Mar 4 09:07:03 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 4 Mar 2010 09:07:03 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> Message-ID: <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com> On Thu, Mar 4, 2010 at 8:52 AM, Kevin Jacobs < bioinformed at gmail.com> wrote: > On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote: > >> Just an FYI for those interested in developing tools to work with BAM: it >> may also be worth looking into the BamTools C++ API developed by Derek >> Barnett at Boston College (http://sourceforge.net/projects/bamtools/). >> The API is quite nice and has much of the necessary functionality for >> iterators, getters/setters, etc. >> >> I added BAM support for my BEDTools package ( >> http://code.google.com/p/bedtools/) using the BAMTools libraries. Save >> for a few minor bugs along the way, it was rather straightforward to >> include. >> > > Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools. > The bamtools code looks well designed and quite similar to my emerging > Cython/Python rendition. > > Ouch-- never mind. The bamtools code isn't endian-clean -- it will only work correctly on native little-endian architectures. -Kevin From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:47:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:47:36 -0500 Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic alignment, e.g. align[1:2, 5:-5] In-Reply-To: Message-ID: <201003051047.o25Ala5W006656@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2551 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:47 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:18 -0500 Subject: [Biopython-dev] [Bug 2552] Adding alignments In-Reply-To: Message-ID: <201003051048.o25AmIoF006689@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2552 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:34 -0500 Subject: [Biopython-dev] [Bug 2553] Adding SeqRecord objects to an alignment (append or extend) In-Reply-To: Message-ID: <201003051048.o25AmYYH006723@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2553 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:36 -0500 Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of SeqRecord objects In-Reply-To: Message-ID: <201003051048.o25AmaIn006735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2554 Bug 2554 depends on bug 2553, which changed state. Bug 2553 Summary: Adding SeqRecord objects to an alignment (append or extend) http://bugzilla.open-bio.org/show_bug.cgi?id=2553 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:50 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:50 -0500 Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of SeqRecord objects In-Reply-To: Message-ID: <201003051048.o25AmoWN006761@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2554 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:50:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:50:45 -0500 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201003051050.o25Aojkg006835@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Short read alignment format |Short read alignment format | |SAM / BAM ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:50 EST ------- Updating summary to include SAM and BAM keywords. See also recent mailing list discussions such as this thread: http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007397.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 06:40:05 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 06:40:05 -0500 Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory In-Reply-To: Message-ID: <201003051140.o25Be532008197@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3010 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 06:40 EST ------- I suspect any memory leak is within KDTree.c function KDTree_set_data. Looking at this I wondered how the memory allocated by KDTree_add_point gets freed. The following *might* help, but even if I am right, this is at best only a partial fix: diff --git a/Bio/KDTree/KDTree.c b/Bio/KDTree/KDTree.c index d074f26..07cdc1f 100644 --- a/Bio/KDTree/KDTree.c +++ b/Bio/KDTree/KDTree.c @@ -621,9 +621,14 @@ int KDTree_set_data(struct KDTree* tree, float *coords, long tree->_radius_list = NULL; } tree->_count=0; + if (tree->_data_point_list) { + free(tree->_data_point_list); + tree->_data_point_list = NULL; + tree->_data_point_list_size = 0; + } /* keep pointer to coords to delete it */ tree->_coords=coords; -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Mar 10 09:30:57 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Mar 2010 14:30:57 +0000 Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc) Message-ID: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> Dear Biopythoneers, The Open Bioinformatics Foundation (the Bio* umbrella organisation) is preparing an application for the 2010 Google Summer of Code (GSoC). http://code.google.com/soc/ If you are interested in becoming a mentor for a Biopython related project, you can join us in the application. If you are a student and are interested in a project (or would like to propose one), please take a look at these pages: http://www.open-bio.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/Google_Summer_of_Code Regards, Brad & Peter From biopython at maubp.freeserve.co.uk Thu Mar 11 06:21:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 11:21:50 +0000 Subject: [Biopython-dev] Bio.Phylo.Applications? Message-ID: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com> Hi Eric et al, We have started a collection of command line tool wrappers for multiple sequence alignments under Bio.Align.Applications, so I was thinking about where to put wrappers for phylogenetic tree command line tools. How does Bio.Phylo.Applications sound (following the same structure as the Bio.Align.Applications module). The kind of things I am thinking about include: QuickTree (neighbour joining, NJ) http://www.sanger.ac.uk/resources/software/quicktree/ QuickJoin (NJ) http://www.daimi.au.dk/~mailund/quick-join.html RaxML (maximum likelihood, ML), http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm [We should talk to Biopython contributor Frank Kauff as he uses this with Python] And so on. Plus pointers in the documentation to the EMBOSS module for PHYLIP tools. Peter From biopython at maubp.freeserve.co.uk Thu Mar 11 06:30:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 11:30:04 +0000 Subject: [Biopython-dev] Adding format method to phylo tree object? Message-ID: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> Hi Eric (et al), Are you familiar with the format method of the SeqRecord and alignment object (plus the __format__ method which does the same thing aiming to work nicely with the Python 2.6 built in function format)? This allows the user to turn their data into a string in a specified output format. Internally the method calls Bio.SeqIO.write (or AlignIO) with a StringIO handle. Do you think it would it make sense to have this for the tree objects in Bio.Phylo, allowing easy access to the object as a Newick tree format etc? For people using IPython, the __pretty__ method looks related. I know the Bio.Nexus tree has a "prity print" method which might be exposed like this. I wonder if this convention will become more widespread? http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html Peter From p.j.a.cock at googlemail.com Thu Mar 11 10:34:07 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 15:34:07 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 Message-ID: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Hi all, It is probably time to starting getting ready for Biopython 1.54, perhaps aiming to release within about a months time? This means not landing any major additions to the trunk for now (keep things like GFF and Geography on branches for now). Other than finishing up any documentation for new stuff (especially the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, are there any important issues we should address before the release? Regards, Peter From tiagoantao at gmail.com Thu Mar 11 10:42:21 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 11 Mar 2010 15:42:21 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com> On Thu, Mar 11, 2010 at 3:34 PM, Peter Cock wrote: > Other than finishing up any documentation for new stuff (especially > the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, > are there any important issues we should address before the release? I think I will be able to commit my code around the 20th. Currently I need to address the issue of supporting thousands of markers in the genepop parser as people do complain about that (like a couple of times a month or so, not more). -- "Heavier than air flying machines are impossible" Lord Kelvin, President, Royal Society, c. 1895 From andrea at biocomp.unibo.it Thu Mar 11 12:11:00 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 11 Mar 2010 18:11:00 +0100 (CET) Subject: [Biopython-dev] Planning for Biopython 1.54 Message-ID: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> What about the Uniprot XML format parser? The code is functional, and was reviewd, but it would be nice to have some beta testing. The only remaining "issue" is where to save the comment fields. The actual implementation will work for biosql schema, and store most of the data in the comment fields. Andrea From p.j.a.cock at googlemail.com Thu Mar 11 12:31:08 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 17:31:08 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> Message-ID: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> On Thu, Mar 11, 2010 at 5:11 PM, Andrea Pierleoni wrote: > What about the Uniprot XML format parser? > The code is functional, and was reviewd, but it would be nice to have some > beta testing. > The only remaining "issue" is where to save the comment fields. > The actual implementation will work for biosql schema, and store most > of the data in the comment fields. > > Andrea Hi Andrea, Your UnitProt XML parser was one of the things I thought we should delay until after getting Biopython 1.54 out the door, but I would expect it to be included in Biopython 1.55. There are at least two remaining issues, (1) where to save the comment fields, and (2) what to call the format in SeqIO. Both of these should ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to ensure the OBF projects which use simple strings for file formats are consistent. Would you like me to start a discussion there regarding the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe even "unitprotxml". Personally, "uniprot" seems fine provided this is going to be the primary file format for UniProt records in the short to medium term. Also I don't think any of the current Biopython developers have sat down to review the code. As the Bio.SeqIO maintainer, I will do this, but right now I think getting Biopython 1.54 out should be prioritised. From a very quick look just now, the recent merging of the SFF support to the trunk will require a few tweaks in test_SeqIO.py (e.g. an empty file is not valid for SFF files as well as the UniProt XML). Also including a UniProt XML file in test_BioSQL_SeqIO.py would be worthwhile. Regards, Peter From andrea at biocomp.unibo.it Thu Mar 11 12:43:13 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 11 Mar 2010 18:43:13 +0100 (CET) Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> Message-ID: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it> > > Hi Andrea, > > Your UnitProt XML parser was one of the things I thought we should > delay until after getting Biopython 1.54 out the door, but I would > expect it to be included in Biopython 1.55. > > There are at least two remaining issues, (1) where to save the comment > fields, and (2) what to call the format in SeqIO. Both of these should > ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to > ensure the OBF projects which use simple strings for file formats are > consistent. Would you like me to start a discussion there regarding > the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe > even "unitprotxml". Personally, "uniprot" seems fine provided this is > going to be the primary file format for UniProt records in the short > to medium term. > Of course you are free to open a discussion. I used 'uniprot' for sake of simplicity, but then I noticed that the format is called 'uniprotxml' in EBI REST web services. A common name will easier for everybody. > Also I don't think any of the current Biopython developers have sat > down to review the code. The code was reviewed by Mauro Amico, I don't know if he is one of the "current Biopython developers", anyhow any additional review is welcome. > As the Bio.SeqIO maintainer, I will do this, > but right now I think getting Biopython 1.54 out should be > prioritised. From a very quick look just now, the recent merging of > the SFF support to the trunk will require a few tweaks in > test_SeqIO.py (e.g. an empty file is not valid for SFF files as well > as the UniProt XML). Also including a UniProt XML file in > test_BioSQL_SeqIO.py would be worthwhile. > Mauro also added some unit testing that should be useful for this. Let me know if you need any help/info. Bests, Andrea From p.j.a.cock at googlemail.com Thu Mar 11 12:49:50 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 17:49:50 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it> References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it> Message-ID: <320fb6e01003110949v206a1868g6360002198a41ddd@mail.gmail.com> On Thu, Mar 11, 2010 at 5:43 PM, Andrea Pierleoni wrote: > >> >> Hi Andrea, >> >> Your UnitProt XML parser was one of the things I thought we should >> delay until after getting Biopython 1.54 out the door, but I would >> expect it to be included in Biopython 1.55. >> >> There are at least two remaining issues, (1) where to save the comment >> fields, and (2) what to call the format in SeqIO. Both of these should >> ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to >> ensure the OBF projects which use simple strings for file formats are >> consistent. Would you like me to start a discussion there regarding >> the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe >> even "unitprotxml". Personally, "uniprot" seems fine provided this is >> going to be the primary file format for UniProt records in the short >> to medium term. > > Of course you are free to open a discussion. I used 'uniprot' for sake of > simplicity, but then I noticed that the format is called 'uniprotxml' in > EBI REST web services. A common name will easier for everybody. In that case, given the EBI REST convention, uniprotxml may be wise. >> Also I don't think any of the current Biopython developers have sat >> down to review the code. > > The code was reviewed by Mauro Amico, I don't know if he is one of the > "current Biopython developers", anyhow any additional review is welcome. I don't recall Mauro Amico contributing to Biopython in the past, but as you say, the more eyes on the code the better :) Peter From eric.talevich at gmail.com Thu Mar 11 17:54:38 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 11 Mar 2010 17:54:38 -0500 Subject: [Biopython-dev] Adding format method to phylo tree object? In-Reply-To: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> Message-ID: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com> On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote: > Hi Eric (et al), > > Are you familiar with the format method of the SeqRecord and alignment > object (plus the __format__ method which does the same thing aiming to > work nicely with the Python 2.6 built in function format)? This allows > the user to turn their data into a string in a specified output > format. Internally the method calls Bio.SeqIO.write (or AlignIO) with > a StringIO handle. > > Do you think it would it make sense to have this for the tree objects > in Bio.Phylo, allowing easy access to the object as a Newick tree > format etc? > Sure, I could do that. It makes a lot of sense for Newick trees, and could be useful with the XML formats for debugging. > > For people using IPython, the __pretty__ method looks related. I know > the Bio.Nexus tree has a "prity print" method which might be exposed > like this. I wonder if this convention will become more widespread? > > http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html > I didn't know about that. I also have a pretty_print method in Bio.Phylo which does something much different from the Bio.Nexus printer -- the Nexus one looks more like it's more useful for debugging the Tree object's internal structure in terms of references, so (highly biased judgment) I'm inclined to use the code from Bio.Phylo._utils.pretty_print to implement __pretty__ for IPython. But I'll play with this IPython feature to see how it's supposed to behave in general. -Eric From eric.talevich at gmail.com Thu Mar 11 18:03:59 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 11 Mar 2010 18:03:59 -0500 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com> On Thu, Mar 11, 2010 at 10:34 AM, Peter Cock wrote: > Hi all, > > It is probably time to starting getting ready for Biopython 1.54, > perhaps aiming to release within about a months time? > > This means not landing any major additions to the trunk for now (keep > things like GFF and Geography on branches for now). > > Other than finishing up any documentation for new stuff (especially > the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, > are there any important issues we should address before the release? > Is it all right to leave the documentation for Bio.Phylo on the wiki for now, or should I try to add something to the main tutorial? -Eric From p.j.a.cock at googlemail.com Thu Mar 11 18:18:18 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 23:18:18 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com> Message-ID: <320fb6e01003111518o3f50b95bw6b2446611fbb9bf5@mail.gmail.com> On Thu, Mar 11, 2010 at 11:03 PM, Eric Talevich wrote: >> Other than finishing up any documentation for new stuff (especially >> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, >> are there any important issues we should address before the release? > > Is it all right to leave the documentation for Bio.Phylo on the wiki > for now, or should I try to add something to the main tutorial? I would like at least a short section in the tutorial mentioning the new module with a link to the wiki. That way people just browsing the tutorial to get an idea of what Biopython covers will be made aware of it. In the long term I think the module deserves a chapter (which can be based on the wiki text). Are you familiar with LaTeX? (The mark up language the tutorial is written in). Also, I think it would be great to have a post on the news server (which we can link to in the release announcement) talking about what Bio.Phylo adds (and thank GSoC and NESCent etc). A little advertising ;) How does that sound? Regards, Peter From biopython at maubp.freeserve.co.uk Thu Mar 11 18:23:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 23:23:54 +0000 Subject: [Biopython-dev] Adding format method to phylo tree object? In-Reply-To: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com> References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com> Message-ID: <320fb6e01003111523r4fe5f4c7va9f77e089385ba0c@mail.gmail.com> On Thu, Mar 11, 2010 at 10:54 PM, Eric Talevich wrote: > On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote: > >> Hi Eric (et al), >> >> Are you familiar with the format method of the SeqRecord and alignment >> object (plus the __format__ method which does the same thing aiming to >> work nicely with the Python 2.6 built in function format)? This allows >> the user to turn their data into a string in a specified output >> format. Internally the method calls Bio.SeqIO.write (or AlignIO) with >> a StringIO handle. >> >> Do you think it would it make sense to have this for the tree objects >> in Bio.Phylo, allowing easy access to the object as a Newick tree >> format etc? >> > > Sure, I could do that. It makes a lot of sense for Newick trees, and could > be useful with the XML formats for debugging. > Great. >> For people using IPython, the __pretty__ method looks related. I know >> the Bio.Nexus tree has a "pretty print" method which might be exposed >> like this. I wonder if this convention will become more widespread? >> >> http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html >> > > I didn't know about that. I only read about it recently myself - it may not be worth doing. (I'm not trying to invent work here *grin*, just looking for things we can polish before your code gets its first proper release.) Thanks, Peter From lpritc at scri.ac.uk Fri Mar 12 03:18:09 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 12 Mar 2010 08:18:09 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: On 11/03/2010 Thursday, March 11, 15:34, "Peter Cock" wrote: > Other than finishing up any documentation for new stuff (especially > the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, > are there any important issues we should address before the release? There are those updates to ePrimer3/PrimerSearch EMBOSS interaction (that you'll need for that differential primer script, BTW...) Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Fri Mar 12 08:22:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Mar 2010 13:22:55 +0000 Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML) Message-ID: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com> Hi all, Back in November I set up a simple pair of cron jobs to update the code snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html I've just added another job which takes the latest Tutorial.tex file and compiles it with pdflatex (already installed) and hevea (installed from source under my user account) to make the PDF and HTML files. These are then copied to the webserver and published as: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf These are currently updated once a day (at 2:40am which shouldn't be too busy whichever USA timezone the server uses). Assuming I got my crontab settings right - in the short term I'll keep an eye on it to check ;) In comparison the "official" versions at the following URLs are generally updated only for releases: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf I know that not everyone has latex or hevea installed (installing hevea from source is a bit of a hassle even on Linux), and further more proof reading the raw markup in Tutorial.tex isn't that easy. So, the point of all this effort is now anyone can help proofread the latest version of the tutorial - this should also be of use to those users/contributors actually running the latest code from git rather than the official releases. Regards, Peter From biopython at maubp.freeserve.co.uk Fri Mar 12 08:32:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Mar 2010 13:32:32 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> Message-ID: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> Hi all, I'd like to proceed as outlined below for Biopython 1.54, i.e. don't change the current Seq equality but add a warning that we plan to change it. Should we have a discussion on the main list first? Peter On Mon, Feb 22, 2010 at 2:48 PM, Peter wrote: > Hi all, > > I've just got back from Japan - Brad and I were fortunate to be > able to attend the DBCLS BioHackathon 2010 held in Tokyo, > http://hackathon3.dbcls.jp/ > > As Brad already mentioned in passing, we also managed to have > dinner one evening with Michiel, and had an informal chat about > Biopython plans. Expect a few more emails on other topics to > follow. > > One of the short term aims we agreed on was to press ahead > with the Seq equality changes outlined on this thread late last > year. Mailing list archive link: > http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html > > To recap, the agreed best behaviour was to make Seq equality > act like string equality, but to raise a Python warning when > incompatible alphabets are compared (e.g. DNA to Protein). > This also applies to all the other comparison operators: > not equal, less than, greater than, less than or equal, and > greater than or equal. > > This is my outline plan for the change: > > For Biopython up to 1.53, Seq class uses object equality, > seq1==seq2 acts as id(seq1)==id(seq2) > > For Biopython 1.54 (and perhaps a few more releases), > the Seq classes will still use object equality but will trigger > a warning suggesting explicit use of ?id(seq1)==id(seq2) > or str(seq1)==str(seq2) as appropriate. > > For Biopython 1.xx (maybe 1.55 or 1.56?) the Seq classes > will switch to using string equality (with an alphabet aware > warning for comparing DNA to RNA etc), but will also trigger > a warning that this is a change from previous releases, and > suggest in the short term the continued explicit use of either > id(seq1)==id(seq2) for object identity or str(seq1)==str(seq2) > for string identity. > > For Biopython 1.yy (maybe 1.57?) the Seq classes will > use string equality (with an alphabet aware warning for > comparing DNA to RNA etc), without any warning about > this being a change from historic behaviour. > > These warning messages could also point at a wiki page, > and we'd need a FAQ entry in the tutorial as well. The > aim of this slightly drawn out switch is to try and make > sure all users are aware of the change, even if they > only update their copy of Biopython every few releases. > > Does that all sound sensible? If so, we should probably > have an announcement on the main mailing list, in case > there are any other views. > > Other more complex options include a flag for switching > between the modes - but that complexity doesn't seem > such a good idea to me. All my own code and most of > the unit tests use str(seq1)==str(seq2) explicitly anyway. > The only exception is some of the genetic algorithm unit > tests which do seem to want explicit object identity. > > Regards, > > Peter > From kellrott at gmail.com Fri Mar 12 13:00:45 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 12 Mar 2010 10:00:45 -0800 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: > > > It is probably time to starting getting ready for Biopython 1.54, > perhaps aiming to release within about a months time? > > This means not landing any major additions to the trunk for now (keep > things like GFF and Geography on branches for now). > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I don't think it counts as a major addition. I think to finish it off, we just needed to finalize the driver names. For post 1.54 stuff, I have some HMMER3, Pfam, and GO parsing code (Chris Lasher has a GO fork as well). But I need some community feedback to fill in the interface holes. Kyle From p.j.a.cock at googlemail.com Fri Mar 12 13:09:39 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Mar 2010 18:09:39 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> On Fri, Mar 12, 2010 at 6:00 PM, Kyle wrote: >> >> It is probably time to starting getting ready for Biopython 1.54, >> perhaps aiming to release within about a months time? >> >> This means not landing any major additions to the trunk for now (keep >> things like GFF and Geography on branches for now). > > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I > don't think it counts as a major addition. ?I think to finish it off, we > just needed to finalize the driver names. Oh yeah - I confess I'd forgotten about that. Has there been any news on the Jython front about SQLite support? > For post 1.54 stuff, I have some HMMER3, Pfam, and GO??parsing code?(Chris > Lasher has a GO fork as well). But I need some community feedback to fill in > the interface holes. > Kyle Lots of exciting stuff to come then :) Peter From kellrott at gmail.com Fri Mar 12 13:28:45 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 12 Mar 2010 10:28:45 -0800 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> Message-ID: > > > Oh yeah - I confess I'd forgotten about that. Has there been any news > on the Jython front about SQLite support? > There is no official support, but you can always work through existing Java packages ( http://old.nabble.com/SQLite-%2B-JDBC-%2B-Jython.-Example-td13322270.html ). Kyle From eric.talevich at gmail.com Fri Mar 12 14:14:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 12 Mar 2010 14:14:51 -0500 Subject: [Biopython-dev] Bio.Phylo.Applications? In-Reply-To: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com> References: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com> Message-ID: <3f6baf361003121114v36b8a311i5b4dc9cee27961c2@mail.gmail.com> On Thu, Mar 11, 2010 at 6:21 AM, Peter wrote: > Hi Eric et al, > > We have started a collection of command line tool wrappers for > multiple sequence alignments under Bio.Align.Applications, so I was > thinking about where to put wrappers for phylogenetic tree command > line tools. How does Bio.Phylo.Applications sound (following the same > structure as the Bio.Align.Applications module). > Sounds great to me! I don't have any code that would go there yet, but feel free to add the directory and any new code you have. -Eric From bugzilla-daemon at portal.open-bio.org Fri Mar 12 16:57:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 12 Mar 2010 16:57:53 -0500 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: Message-ID: <201003122157.o2CLvrtP008861@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3000 ------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2010-03-12 16:57 EST ------- Hi Peter, I finally got back to this. Thank your for all your work. I would be glad if one could use the accession without the trailing ".1", etc for get_raw() and get(). I think just any version of the record should be returned, and maybe a list if there were multiple versions of the same. >>> print data.get_raw("BC035166") Traceback (most recent call last): File "", line 1, in File "Bio/SeqIO/_index.py", line 280, in get_raw handle.seek(dict.__getitem__(self, key)) KeyError: 'BC035166' >>> Similarly, if I loop over the entries I have to do: >>> mylist = ['ACC1', 'ACC2', 'ACC3'] >>> sequences = [] >>> for acc in data.keys(): ... if data.get(acc).id.split('.')[0] in mylist: ... sequences.append(data.get(acc)) Oh no, this is not what I wanted, in full: from Bio import SeqIO data = SeqIO.index("full.gb", "gb") mylist = ['AC11111.1', 'AC2222.2', 'AC3333.3'] sequences = [] for acc in mylist: if acc in map(lambda x: x.split('.')[0], data.keys()): print "Found %s" % acc if data.get(acc + '.1'): sequences.append(data.get(acc + '.1')) else: if data.get(acc + '.2'): sequences.append(data.get(acc + '.2')) else: sequences.append(data.get(acc + '.3')) else: print "Missing %s" % acc output_handle = open("filtered.gb", "w") SeqIO.write(sequences, output_handle, "genbank") There was already a discussing on the user mailing list, I do not think forcing uppercase letters for genbank files is a good idea. Just stick with what was supplied. Myself, I use mixed typically to emphasize, ORFs, but sometimes in lower-case low-quality regions. Anyway, I provided original NCBI-web GenBank file of an EST and the DNA sequence was in lowercase, biopython returned uppercase. In turn, diff(1) command returns too many changed lines, unnecessarily. I suggest giving use an opportunity to specify on input parsing whether to keep mixed-case/lower-case or force uppercase. Also, protein sequences I have often seen in lower-case, which is ugly to my eyes, btw. Finally, the remaining differences are here (probably the first is in bug #2578): --- /tmp/orig.gb 2010-03-12 21:09:24.000000000 +0100 +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100 @@ -1,4 +1,4 @@ -LOCUS CR603932 1625 bp mRNA linear HTC 16-OCT-2008 +LOCUS CR603932 1625 bp DNA HTC 16-OCT-2008 DEFINITION full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized of Homo sapiens (human). ACCESSION CR603932 @@ -29,39 +29,39 @@ division of Invitrogen. FEATURES Location/Qualifiers source 1..1625 - /organism="Homo sapiens" /mol_type="mRNA" - /db_xref="taxon:9606" /clone="CS0DK007YH24" + /db_xref="taxon:9606" /tissue_type="HeLa cells Cot 25-normalized" /plasmid="pCMVSPORT_6" + /organism="Homo sapiens" ORIGIN Thanks for all you work on this, it is a great service. ;-) Next, I will try to filter by .features['tissue_type'] but sadly will have to search for the very same string through COMMENT string as well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 12 17:05:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 12 Mar 2010 17:05:39 -0500 Subject: [Biopython-dev] [Bug 3026] New: Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3026 Summary: Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Traceback (most recent call last): File "/home/mmokrejs/bin/filter-accessions.py", line 22, in SeqIO.write(sequences, output_handle, "genbank") File "/usr/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 363, in write count = writer_class(handle).write_file(sequences) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 691, in write_record self._write_comment(record) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 579, in _write_comment self._write_multi_line("", line) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 335, in _write_multi_line lines = self._split_multi_line(text, max_len) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 279, in _split_multi_line "Your description cannot be broken into nice lines!" AssertionError: Your description cannot be broken into nice lines! Please fix the message so it prints out the accession/version number. ;-) LOCUS BF378302 501 bp mRNA linear EST 27-NOV-2000 DEFINITION CM0-UM0001-060300-270-g07 UM0001 Homo sapiens cDNA, mRNA sequence. ACCESSION BF378302 VERSION BF378302.1 GI:11367336 KEYWORDS EST. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 501) AUTHORS Dias Neto,E., Garcia Correa,R., Verjovski-Almeida,S., Briones,M.R., Nagai,M.A., da Silva,W. Jr., Zago,M.A., Bordin,S., Costa,F.F., Goldman,G.H., Carvalho,A.F., Matsukuma,A., Baia,G.S., Simpson,D.H., Brunstein,A., deOliveira,P.S., Bucher,P., Jongeneel,C.V., O'Hare ,M.J., Soares,F., Brentani,R.R., Reis,L.F., de Souza,S.J. and Simpson,A.J. TITLE Shotgun sequencing of the human transcriptome with ORF expressed sequence tags JOURNAL Proc. Natl. Acad. Sci. U.S.A. 97 (7), 3491-3496 (2000) PUBMED 10737800 COMMENT Contact: Simpson A.J.G. Laboratory of Cancer Genetics Ludwig Institute for Cancer Research Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP, Brazil Tel: +55-11-2704922 Fax: +55-11-2707001 Email: asimpson at ludwig.org.br This sequence was derived from the FAPESP/LICR Human Cancer Genome Project. This entry can be seen in the following URL (http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001-060300-270-g07&t3=2000-03-06&t4=1 ) Seq primer: puc 18 forward. FEATURES Location/Qualifiers [cut] I have few more example slike this from some dbEST data, I think all from a same project, though. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Mar 13 08:43:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Mar 2010 13:43:53 +0000 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz> References: <201002021840.o12Ie88i015906@portal.open-bio.org> <4B6995D0.3030405@fold.natur.cuni.cz> <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com> <4B9AA432.2050407@fold.natur.cuni.cz> Message-ID: <320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com> On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ? wrote: > > I finally got back to this. Thank your for all your work. > I would be glad if one could use the accession without > the trailing ".1", etc for get_raw() and get(). I think > just any version of the record should be returned, > and maybe a list if there were multiple versions of > the same. This is just a quick reply to answer this part of your email. It would be unwise to try and be clever with the key matching - in this case yes, for GenBank files we know what the names means, accession.version - but this is not true in general. In this case the answer for your needs would be to use the Bio.SeqIO.index optional argument to specify the keys. e.g. something like this: from Bio import SeqIO def strip_version(identifier): return identifier.rsplit(".",1)[0] my_dict = SeqIO.index(filename, "gb", key_function=strip_version) That way all the keys will have just the accession without the version (assuming there are no clashes which I think will raise an error). Peter From sbassi at clubdelarazon.org Sun Mar 14 03:16:25 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Sun, 14 Mar 2010 04:16:25 -0300 Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc) In-Reply-To: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> References: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> Message-ID: <9e2f512b1003132316j55a95ca7u6a87191ff877898d@mail.gmail.com> On Wed, Mar 10, 2010 at 11:30 AM, Peter Cock wrote: > related project, you can join us in the application. If you are > a student and are interested in a project (or would like to > propose one), please take a look at these pages: > http://www.open-bio.org/wiki/Google_Summer_of_Code > http://biopython.org/wiki/Google_Summer_of_Code Regarding GSoC call in Biopython, I found the PDB-Tidy task pretty interesting. I will study the proposal and write back to you. I am working currently with microRNA but I use Bio.PDB a lot to help my wife who does antigen structure prediction and works with modeller, PyMol and PDB files. A tool like the proposed PDB-Tidy could come handily. Best, SB. From biopython at maubp.freeserve.co.uk Sun Mar 14 09:50:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Mar 2010 13:50:52 +0000 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: <4B9BB1F6.9000505@fold.natur.cuni.cz> References: <201002021840.o12Ie88i015906@portal.open-bio.org> <4B6995D0.3030405@fold.natur.cuni.cz> <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com> <4B9AA432.2050407@fold.natur.cuni.cz> <320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com> <4B9BB1F6.9000505@fold.natur.cuni.cz> Message-ID: <320fb6e01003140650o54a8eea2h66ea87abc42c754@mail.gmail.com> On Sat, Mar 13, 2010 at 3:40 PM, Martin MOKREJ? wrote: > > Thanks Peter, > ?yes, that is what I already ended-up with in a more awkward way. ;-) > But basically I have the same workaround. > Best, > M. So does using the Bio.SeqIO.index() function's key_function argument seem like a good solution to your key problem? Peter From biopython at maubp.freeserve.co.uk Sun Mar 14 16:30:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Mar 2010 20:30:45 +0000 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz> References: <201002021840.o12Ie88i015906@portal.open-bio.org> <4B6995D0.3030405@fold.natur.cuni.cz> <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com> <4B9AA432.2050407@fold.natur.cuni.cz> Message-ID: <320fb6e01003141330t199bbbcfm6bf32c5357b9fd77@mail.gmail.com> On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ? wrote: > > Finally, the remaining differences are here (probably the first is in bug #2578): > > --- /tmp/orig.gb ? ? ? ?2010-03-12 21:09:24.000000000 +0100 > +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100 > @@ -1,4 +1,4 @@ > -LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?mRNA ? ?linear ? HTC 16-OCT-2008 > +LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?DNA ? ? ? ? ? ? ?HTC 16-OCT-2008 > ?DEFINITION ?full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized > ? ? ? ? ? ? of Homo sapiens (human). > ?ACCESSION ? CR603932 > @@ -29,39 +29,39 @@ > ? ? ? ? ? ? division of Invitrogen. > ?FEATURES ? ? ? ? ? ? Location/Qualifiers > ? ? ?source ? ? ? ? ?1..1625 > - ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens" > ? ? ? ? ? ? ? ? ? ? ?/mol_type="mRNA" > - ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606" > ? ? ? ? ? ? ? ? ? ? ?/clone="CS0DK007YH24" > + ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606" > ? ? ? ? ? ? ? ? ? ? ?/tissue_type="HeLa cells Cot 25-normalized" > ? ? ? ? ? ? ? ? ? ? ?/plasmid="pCMVSPORT_6" > + ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens" > ?ORIGIN > Yes, the LOCUS line issue would be part of Bug 2578. As to the order of the feature qualifiers, these are stored in a Python dictionary which does not preserve the order. I personally don't think the order of the qualifiers is important and thus don't care that is can change like this. Assuming the NCBI have a defined sort order for the qualifiers (I'm not aware one), then we could sort the feature qualifiers on output. Another option would be to store the qualifiers in an ordered-dictionary. Or just leave it as it is ;) Peter From bugzilla-daemon at portal.open-bio.org Sun Mar 14 19:31:51 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 14 Mar 2010 19:31:51 -0400 Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! In-Reply-To: Message-ID: <201003142331.o2ENVp3v015452@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3026 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-14 19:31 EST ------- I just used the Entrez web interface, and it comes with the URL split already to meet the 80 column limit. Also doing it via the API: >>> from Bio import Entrez >>> data = Entrez.efetch("nucest", id="BF378302", rettype="gb").read() >>> print data[1095:1800] PUBMED 10737800 COMMENT Contact: Simpson A.J.G. Laboratory of Cancer Genetics Ludwig Institute for Cancer Research Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP, Brazil Tel: +55-11-2704922 Fax: +55-11-2707001 Email: asimpson at ludwig.org.br This sequence was derived from the FAPESP/LICR Human Cancer Genome Project. This entry can be seen in the following URL (http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001- 060300-270-g07&t3=2000-03-06&t4=1) Seq primer: puc 18 forward. FEATURES Location/Qualifiers In this particular case, it looks like splitting the string on a hyphen would be a reasonable option (i.e. copy what the NCBI seems to be doing). Did you just cut and paste it from the NCBI's HTML page where it does seem to be shown with the URL is shown unbroken (giving a line more than 80 characters)? Or can we download a "broken" GenBank file from the NCBI somewhere (maybe the FTP site)? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Mar 14 20:44:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 14 Mar 2010 20:44:59 -0400 Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! In-Reply-To: Message-ID: <201003150044.o2F0ixwP017517@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3026 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2010-03-14 20:44 EST ------- Most I copy&pasted from their web, so this is probably the case. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Mar 15 11:40:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Mar 2010 15:40:20 +0000 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? Message-ID: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> Hi all (especially Eric), As recently discussed SeqIO and AlignIO will now take filenames as well as handles. This matches the existing behaviour of Bio.Nexus, Eric's Bio.Phylo, and several big 3rd partly libraries like ReportLab. http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007352.html I've updated most of the tutorial to take advantage of this, and quickly got used less typing when working at the Python prompt. It does make things easier, and I probably should have conceded this earlier. It made me wonder about relaxing another restraint of the SeqIO and AlignIO write functions - they currently insist on a list or iterator of records or alignments. Giving a single object raises an error, but we could handle this unambiguously. Amusingly Eric just updated Bio.Phylo to match this strict behaviour - one reason I sat down and wrote this email. So, should we continue to insist on: record = SeqRecord(...) SeqIO.write([record], filename, format) or should be relax a little more and allow this too?: record = SeqRecord(...) SeqIO.write(record, filename, format) For SeqIO and AlignIO we can do a simple isinstance check for a SeqRecord or alignment object - there isn't really a problem with ambiguity here. Probably also try for Phylo? What's the general consensus on the dev list? Peter From updates at feedmyinbox.com Tue Mar 16 02:16:42 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Tue, 16 Mar 2010 02:16:42 -0400 Subject: [Biopython-dev] 3/16 BioStar - Biopython Questions Message-ID: <0ef45bfc18dff2fe627af99c71f3b412@74.63.51.88> ================================================== 1. Compare two protein sequences using local BLAST ================================================== March 15, 2010 at 7:24 PM Hi, I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful! Cheers http://biostar.stackexchange.com/questions/302/compare-two-protein-sequences-using-local-blast -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From mhampton at d.umn.edu Tue Mar 16 12:01:41 2010 From: mhampton at d.umn.edu (Marshall Hampton) Date: Tue, 16 Mar 2010 11:01:41 -0500 (CDT) Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: References: Message-ID: I'm strongly in favor of such relaxations. It would also be convenient if SeqRecords had a write function. -Marshall Hampton >So, should we continue to insist on: > >record = SeqRecord(...) >SeqIO.write([record], filename, format) >or should be relax a little more and allow this too?: >record = SeqRecord(...) >SeqIO.write(record, filename, format) >For SeqIO and AlignIO we can do a simple isinstance check >for a SeqRecord or alignment object - there isn't really a >problem with ambiguity here. Probably also try for Phylo? >What's the general consensus on the dev list? From rodrigo_faccioli at uol.com.br Tue Mar 16 15:24:58 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 16 Mar 2010 16:24:58 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) Message-ID: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Hi all, I want to know the primary sequence (fasta file) of all proteins. In other the words, I would like a database which contain the fasta files of all proteins. I'm a computer scientist and I don't know how hard it is. However, we have worked with SEQRES section of PDB files and BioPython. So, we want to work with fasta files and BioPython to check our results. I searched the NCBI web-site where I found a lot of databases. I confess I'm lost with them :) Sorry if my email is a basic question. But, I'm very lost. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Tue Mar 16 15:42:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 19:42:43 +0000 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli wrote: > > Hi all, > > I want to know the primary sequence (fasta file) of all proteins. In other > the words, I would like a database which contain the fasta files of all > proteins. > > I'm a computer scientist and I don't know how hard it is. However, we have > worked with SEQRES section of PDB files and BioPython. So, we want to work > with fasta files and BioPython to check our results. A single FASTA file of all know proteins would be enormous. Even the non-redundant ("nr") dataset used by the NCBI for their hugely popular BLAST search is pretty big. It sounds like many all you need is a FASTA file containing all the sequences with structures in the PDB - something you may be able to download directly from the PDB FTP site. Peter From biopython at maubp.freeserve.co.uk Tue Mar 16 15:42:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 19:42:43 +0000 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli wrote: > > Hi all, > > I want to know the primary sequence (fasta file) of all proteins. In other > the words, I would like a database which contain the fasta files of all > proteins. > > I'm a computer scientist and I don't know how hard it is. However, we have > worked with SEQRES section of PDB files and BioPython. So, we want to work > with fasta files and BioPython to check our results. A single FASTA file of all know proteins would be enormous. Even the non-redundant ("nr") dataset used by the NCBI for their hugely popular BLAST search is pretty big. It sounds like many all you need is a FASTA file containing all the sequences with structures in the PDB - something you may be able to download directly from the PDB FTP site. Peter From rodrigo_faccioli at uol.com.br Tue Mar 16 21:01:01 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 16 Mar 2010 22:01:01 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> Message-ID: <3715adb71003161801n294d15ccwb3a52f6d5ea83c23@mail.gmail.com> Peter, Thank you for your reply. Actually, we want to store the sequence of the fasta files in a relational database which has been developed by my research group. So, we have developed some calculations with primary sequence of proteins. We did not download the PDB database because our computation of protein properties are based on their primary sequence. Therefore, our idea is to work with the primary sequence of all proteins. My understanding is the PDB database contains the proteins which is known their tearty structure. The others are in other database. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Tue, Mar 16, 2010 at 4:42 PM, Peter wrote: > On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli > wrote: > > > > Hi all, > > > > I want to know the primary sequence (fasta file) of all proteins. In > other > > the words, I would like a database which contain the fasta files of all > > proteins. > > > > I'm a computer scientist and I don't know how hard it is. However, we > have > > worked with SEQRES section of PDB files and BioPython. So, we want to > work > > with fasta files and BioPython to check our results. > > A single FASTA file of all know proteins would be enormous. Even the > non-redundant ("nr") dataset used by the NCBI for their hugely popular > BLAST search is pretty big. > > It sounds like many all you need is a FASTA file containing all the > sequences with structures in the PDB - something you may be > able to download directly from the PDB FTP site. > > Peter > From bugzilla-daemon at portal.open-bio.org Wed Mar 17 07:33:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 17 Mar 2010 07:33:09 -0400 Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS 6.1.0 arguments In-Reply-To: Message-ID: <201003171133.o2HBX9kO004765@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2966 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 07:33 EST ------- (In reply to comment #2) > I also found an issue with the PrimerSearchCommandline. The command line > options -sequences and -primers do not appear to be used in EMBOSS6.1.0, having > been replaced by -seqall and -infile, respectively. I changed the options > accordingly, and the modified files are available at > http://github.com/widdowquinn/biopython/tree/emboss-branch. I've merged that fix on the master, http://github.com/biopython/biopython/commit/39708be130eb771eacccf96eed3e8ce0a44ea4f0 Will have a look at eprimer3 next. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Mar 17 08:13:46 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 17 Mar 2010 08:13:46 -0400 Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS 6.1.0 arguments In-Reply-To: Message-ID: <201003171213.o2HCDkf4006396@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2966 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 08:13 EST ------- (In reply to comment #1) > I have made changes to Primer3Commandline that involve adding the EMBOSS 6.1.0 > arguments, and docstrings to each argument. I have also added doctests. > > The proposed code can be inspected at my GitHub repository: > > http://github.com/widdowquinn/biopython/commit/9c0643e333b0cafb4e356426fb4902e0e9d2385c > Cherry picked to merge to the trunk. Marking bug as fixed - thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Wed Mar 17 14:32:17 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 17 Mar 2010 15:32:17 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Message-ID: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com> On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli wrote: > I want to know the primary sequence (fasta file) of all proteins. In other > the words, I would like a database which contain the fasta files of all > proteins. You don't need Biopython to get this file. Just download NR database y use "fastacmd", a program found in the blast suite. BLAST FTP is not working for me right now so I can't give you the exact URL to download, but look from here: ftp://ftp.ncbi.nih.gov/blast/ Here is how to use fastacmd to retrieve sequences from NR database: http://pwet.fr/man/linux/commandes/fastacmd From kellrott at gmail.com Wed Mar 17 18:14:25 2010 From: kellrott at gmail.com (Kyle) Date: Wed, 17 Mar 2010 15:14:25 -0700 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> Message-ID: > > > > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I > > don't think it counts as a major addition. I think to finish it off, we > > just needed to finalize the driver names. > > Oh yeah - I confess I'd forgotten about that. > I've posted a fork from the master branch on github ( http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes related to zxjdbc. I've added two driver requests, "MySQL" and "PostgreSQL", that select the appropriate driver based on the platform. Kyle From tiagoantao at gmail.com Wed Mar 17 18:28:36 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 17 Mar 2010 22:28:36 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com> Message-ID: <6d941f121003171528p1e60fbb8q419485f6c6f171c2@mail.gmail.com> Hi, 2010/3/11 Tiago Ant?o : > I think I will be able to commit my code around the 20th. Currently I > need to address the issue of supporting thousands of markers in the > genepop parser as people do complain about that (like a couple of > times a month or so, not more). I am going to add this and support for haploid markers also. I would like to ask, when its done (soon!) a code review on the part of support of thousands of markers (The parser will change in nature, and files will be maintained open during the whole existence of the parser object). No need for domain knowledge, just comments on code quality. Also some help with merging with the main trunk would be appreciated, as I don' t use github for my stuff (bazaar fan here ;) ). Thanks, Tiago -- "Heavier than air flying machines are impossible" Lord Kelvin, President, Royal Society, c. 1895 From rodrigo_faccioli at uol.com.br Wed Mar 17 20:59:49 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Wed, 17 Mar 2010 21:59:49 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com> Message-ID: <3715adb71003171759p7107f2cbod85339a5335374d5@mail.gmail.com> Sebastian, Thank you for your reply. I'll study it. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Wed, Mar 17, 2010 at 3:32 PM, Sebastian Bassi wrote: > On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli > wrote: > > I want to know the primary sequence (fasta file) of all proteins. In > other > > the words, I would like a database which contain the fasta files of all > > proteins. > > You don't need Biopython to get this file. Just download NR database y > use "fastacmd", a program found in the blast suite. > BLAST FTP is not working for me right now so I can't give you the > exact URL to download, but look from here: > ftp://ftp.ncbi.nih.gov/blast/ > Here is how to use fastacmd to retrieve sequences from NR database: > http://pwet.fr/man/linux/commandes/fastacmd > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Thu Mar 18 07:19:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 11:19:03 +0000 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 Message-ID: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> On Wed, Mar 17, 2010 at 10:14 PM, Kyle wrote: >> >>> I think the zxJDBC support (Jython MySQL for BioSQL) was almost >>> done. I don't think it counts as a major addition. ?I think to finish it off, >>> we just needed to finalize the driver names. >> >> Oh yeah - I confess I'd forgotten about that. > > I've posted a fork from the master branch on github ( > http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes > related to zxjdbc. I've added two driver requests, "MySQL" and > "PostgreSQL", that select the appropriate driver based on the platform. > Kyle Hmm. I think it might be cleaner to have a new optional argument like batabase back end (MySQL, PostgreSQL, SQLite3). If the back end is specified without the driver (which would be the encouraged usage) then we will pick the driver at run time (based on if in Jython, or for PostgreSQL which drivers are installed). Existing scripts can continue to specify the driver directly (but we can eventually deprecated this?). Peter From anaryin at gmail.com Thu Mar 18 07:33:05 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 18 Mar 2010 04:33:05 -0700 Subject: [Biopython-dev] Small Typo in PDBParser Message-ID: Hello All, There's a small typo in the Bio.PDB PDBParser module. Line 159: "PDBContructionError" should be "PDBConstructionError" So that I learn, how do I submit a bug and a patch to the project, such as in this case? Best! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From anaryin at gmail.com Thu Mar 18 07:36:15 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 18 Mar 2010 04:36:15 -0700 Subject: [Biopython-dev] Small Typo in PDBParser In-Reply-To: References: Message-ID: Well, actually, PDBConstructionError is not even defined.. It should likely be PDBConstructionException. Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Thu, Mar 18, 2010 at 4:33 AM, Jo?o Rodrigues wrote: > Hello All, > > There's a small typo in the Bio.PDB PDBParser module. Line 159: > > "PDBContructionError" should be "PDBConstructionError" > > So that I learn, how do I submit a bug and a patch to the project, such as > in this case? > > Best! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > From biopython at maubp.freeserve.co.uk Thu Mar 18 08:02:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 12:02:32 +0000 Subject: [Biopython-dev] Small Typo in PDBParser In-Reply-To: References: Message-ID: <320fb6e01003180502w573baa84od9924f4b8e2486c8@mail.gmail.com> On Thu, Mar 18, 2010 at 11:33 AM, Jo?o Rodrigues wrote: > Hello All, > > There's a small typo in the Bio.PDB PDBParser module. Line 159: > > "PDBContructionError" should be "PDBConstructionError" > > So that I learn, how do I submit a bug and a patch to the project, such as > in this case? > > Best! Hi Jo?o, I've you've found a bug in a release, and worked out how to fix it, one of the first steps would be to try the latest code from the repository to see if the bug is still there (and if you fix would need changing). In this case the problem has already been fixed (February 23, 2010), see: http://github.com/biopython/biopython/commits/master/Bio/PDB/PDBParser.py For a simple change like this, you can use the command line tool diff to generate a patch file (see "man diff" for details), which you can then attach to a bug report on our bugzilla. The basic diff usage would be: diff original_file.py fixed_file.py > bug_fix.patch For more complex changes, I would suggest you look at learning git. If you make a change locally you can get a patch file with this: git diff > bug_fix.patch Or, publish the fix to a public copy of the repository (e.g. on github). See also http://biopython.org/wiki/GitUsage I hope that helps, and that you'll have more patches for us in future :) Peter From biopython at maubp.freeserve.co.uk Thu Mar 18 15:01:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 19:01:32 +0000 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> Message-ID: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> On Mon, Mar 15, 2010 at 5:26 PM, Eric Talevich wrote: > On Mon, Mar 15, 2010 at 11:40 AM, Peter wrote: >> >> So, should we continue to insist on: >> >> record = SeqRecord(...) >> SeqIO.write([record], filename, format) >> >> or should be relax a little more and allow this too?: >> >> record = SeqRecord(...) >> SeqIO.write(record, filename, format) >> >> For SeqIO and AlignIO we can do a simple isinstance check >> for a SeqRecord or alignment object - there isn't really a >> problem with ambiguity here. Probably also try for Phylo? >> >> What's the general consensus on the dev list? > > Sounds good to me! The code I just deleted from Bio.Phylo._io > was doing something foolish anyway (testing whether the > argument is iterable) -- now that Bio.Phylo has a single legitimate > base class, I can restore the feature with an isinstance(trees, > BaseTree.Tree) check if we have a consensus here. > > -Eric There was another +1 vote from Marshall Hampton, and no comments against (so far). Let's leave it a few days, but unless anyone speaks out in favour of the status-quo (keep the current strict check in the write function), then make the change. Peter From biopython at maubp.freeserve.co.uk Thu Mar 18 15:04:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 19:04:10 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> References: <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> Message-ID: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com> On Fri, Mar 12, 2010 at 1:32 PM, Peter wrote: > Hi all, > > I'd like to proceed as outlined below for Biopython 1.54, > i.e. don't change the current Seq equality but add a warning > that we plan to change it. I've done that to Bio/Seq.py on the trunk (added two FutureWarnings and docstring explanation). Assuming this doesn't trigger any regressions, we'd need to work on the documentation (in particular the tutorial, but also perhaps a news post?) and fix the GA unit test before the release. If anyone on the dev list thinks this is a bad idea, please speak up (sooner rather than later). Thanks, Peter From kellrott at gmail.com Thu Mar 18 15:28:58 2010 From: kellrott at gmail.com (Kyle) Date: Thu, 18 Mar 2010 12:28:58 -0700 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: What should the parameter by called? Possibilities: 'backend', 'dbtype', ... ideas anyone? Kyle On Thu, Mar 18, 2010 at 4:19 AM, Peter wrote: > Hmm. I think it might be cleaner to have a new optional argument like > batabase back end (MySQL, PostgreSQL, SQLite3). If the back end > is specified without the driver (which would be the encouraged usage) > then we will pick the driver at run time (based on if in Jython, or for > PostgreSQL which drivers are installed). Existing scripts can continue > to specify the driver directly (but we can eventually deprecated this?). > > Peter > From biopython at maubp.freeserve.co.uk Thu Mar 18 15:34:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 19:34:39 +0000 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> On Thu, Mar 18, 2010 at 7:28 PM, Kyle wrote: > What should the parameter be called? Possibilities: > 'backend', 'dbtype', ... ideas anyone? Just database would be too vague. I quite like backend. Peter From sbassi at clubdelarazon.org Thu Mar 18 15:39:40 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Thu, 18 Mar 2010 16:39:40 -0300 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> Message-ID: <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> On Thu, Mar 18, 2010 at 4:01 PM, Peter wrote: > There was another +1 vote from Marshall Hampton, and no > comments against (so far). Let's leave it a few days, but unless > anyone speaks out in favour of the status-quo (keep the > current strict check in the write function), then make the change. If we are going to change this, why not setting "fasta" as default input/output format? This would also results in less typing when processing fasta files (most of the time in my workflow at least). From bugzilla-daemon at portal.open-bio.org Thu Mar 18 17:27:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 18 Mar 2010 17:27:48 -0400 Subject: [Biopython-dev] [Bug 3029] New: PhyloXML.Phylogeny.is_preterminal() fails Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3029 Summary: PhyloXML.Phylogeny.is_preterminal() fails Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: joelb at lanl.gov tree.is_preterminal() raises an AttributeError "'Phylogeny' object has no attribute 'clades'" File BaseTree.py line 442. git fetch on Feb. 22. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Thu Mar 18 18:03:09 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 18 Mar 2010 22:03:09 +0000 Subject: [Biopython-dev] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4BA29706.8040606@cornell.edu> References: <4BA29706.8040606@cornell.edu> Message-ID: <320fb6e01003181503j7e3030aao7bce7ebf4d8be06@mail.gmail.com> Good news for GSoC 2010 :) ---------- Forwarded message ---------- From: Robert Buels Date: Thu, Mar 18, 2010 at 9:11 PM Subject: Google Summer of Code is *ON* for OBF projects! Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). ? Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo Student applications are due April 9, 2010 at 19:00 UTC. ?Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and who to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2010 Administrator From biopython at maubp.freeserve.co.uk Fri Mar 19 06:45:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Mar 2010 10:45:55 +0000 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> Message-ID: <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> Hi Sebastian, On Thu, Mar 18, 2010 at 7:39 PM, Sebastian Bassi wrote: > On Thu, Mar 18, 2010 at 4:01 PM, Peter wrote: >> There was another +1 vote from Marshall Hampton, and no >> comments against (so far). Let's leave it a few days, but unless >> anyone speaks out in favour of the status-quo (keep the >> current strict check in the write function), then make the change. > > If we are going to change this, why not setting "fasta" as default > input/output format? This would also results in less typing when > processing fasta files (most of the time in my workflow at least). Give an inch and they'll take a mile ;) I agree that FASTA is likely to be the most common file format for most users, but I don't think we should make it the default. One specific reason is because the FASTA parser will allow and ignore a header comment, you will get confusing results if the file is not actually a FASTA file (typically it will parse other text files like GenBank, EMBL or FASTQ with no errors, but will return no records). I am worried that people will assume that if they don't specify the format that Biopython will determine it automatically - which it won't. [Yes, I'm talking about the read/parse functions here, but it would be odd if the write function defaulted to FASTA but they did not.] Also, could you clarify if you are in favour of relaxing the requirement that the write function takes a list/iterator of records/alignments to allow a single SeqRecord or alignment? Thanks, Peter From bugzilla-daemon at portal.open-bio.org Fri Mar 19 09:22:51 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Mar 2010 09:22:51 -0400 Subject: [Biopython-dev] [Bug 3029] PhyloXML.Phylogeny.is_preterminal() fails In-Reply-To: Message-ID: <201003191322.o2JDMpYW015069@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3029 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from eric.talevich at gmail.com 2010-03-19 09:22 EST ------- (In reply to comment #0) > tree.is_preterminal() > > raises an AttributeError > "'Phylogeny' object has no attribute 'clades'" > File BaseTree.py line 442. > > git fetch on Feb. 22. > Thanks for catching this. It's fixed on the trunk now. I also checked the rest of TreeMixin for other occurrences of the same problem (accessing self.clades directly instead of going through self.root.clades) and found none, so it shouldn't happen again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Fri Mar 19 18:08:17 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Fri, 19 Mar 2010 19:08:17 -0300 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> Message-ID: <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> On Fri, Mar 19, 2010 at 7:45 AM, Peter wrote: > Give an inch and they'll take a mile ;) In Spanish we say: Give a hand and they'll take the whole arm :) > that if they don't specify the format that Biopython will > determine it automatically - which it won't. In this respect, Python zen favours being explicit,so I see your point. > Also, could you clarify if you are in favour of relaxing the > requirement that the write function takes a list/iterator of > records/alignments to allow a single SeqRecord or alignment? Is OK for me to allow a single record instead of a iterable, this change will not break any existing code so it is OK for me. Best, SB. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 02:24:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 02:24:39 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003200624.o2K6OdCd010209@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #1 from crosvera at gmail.com 2010-03-20 02:24 EST ------- Created an attachment (id=1463) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1463&action=view) propose patch bug2948.patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 02:26:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 02:26:10 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003200626.o2K6QAoV010279@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crosvera at gmail.com ------- Comment #2 from crosvera at gmail.com 2010-03-20 02:26 EST ------- Here I show an example about what Paul says: bash-4.0$ python Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34) [GCC 4.4.1 (CRUX)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import * >>> parser=PDBParser() >>> structure = parser.get_structure("2beg", "2BEG.pdb") >>> structure.header.keys() ['structure_method', 'head', 'journal', 'journal_reference', 'compound', 'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source', 'resolution', 'structure_reference'] >>> structure.header['head'] 'protein fibril' >>> structure.header['name'] " d structure of alzheimer's abeta(1-42) fibrils" I made a patch, which change the regex. From: tail=re.sub("\A\w+\s+\d*\s*","",h) TO: tail=re.sub("\A\w+\s+\d*\s+","",h Seems that this patch works. The result I got is this: bash-4.0$ python Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34) [GCC 4.4.1 (CRUX)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import * >>> parser=PDBParser() >>> structure = parser.get_structure("2beg", "2BEG.pdb") >>> structure.header.keys() ['structure_method', 'head', 'journal', 'journal_reference', 'compound', 'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source', 'resolution', 'structure_reference'] >>> structure.header['head'] 'protein fibril' >>> structure.header['name'] " 3d structure of alzheimer's abeta(1-42) fibrils" >>> I propose this patch. (my first one). -- Carlos R??os V. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 02:56:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 02:56:53 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003200656.o2K6urZa011050@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crosvera at gmail.com ------- Comment #1 from crosvera at gmail.com 2010-03-20 02:56 EST ------- (In reply to comment #0) > [...] > elif key=="REVDAT": > #Modified by Paul T. Bathen to get most recent date instead of > oldest date. > #Also added additional dict entries > if dict['release_date'] == "1909-01-08": #set in init > rr=re.search("\d\d-\w\w\w-\d\d",tail) > if rr!=None: > dict['release_date']=_format_date(_nice_case(rr.group())) > > dict['mod_number'] = hh[7:10].strip() > dict['mod_id'] = hh[23:28].strip() > dict['mod_type'] = hh[31:32].strip() The Protein Data Bank Contents Guide (Version 3.20, http://www.wwpdb.org/documentation/format32/sect2.html#REVDAT) says that modNum use the colums: 8-10. modId use the colums: 24-27. And modType use the colum 32. So the last part of your code should change to: dict['mod_number'] = hh[7:9] dict['mod_id'] = hh[23:26] dict['mod_type'] = hh[31] Regards. -- Carlos Rios V. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 19:02:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 19:02:16 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003202302.o2KN2GFb006461@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 ------- Comment #2 from crosvera at gmail.com 2010-03-20 19:02 EST ------- Currently I got this with the actual code: bash-4.0$ python Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34) [GCC 4.4.1 (CRUX)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import * >>> parser=PDBParser() >>> structure = parser.get_structure("2beg", "../2BEG.pdb") >>> structure.header.keys() ['structure_method', 'head', 'journal', 'journal_reference', 'compound', 'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source', 'resolution', 'structure_reference'] >>> structure.header['release_date'] '2005-11-22' >>> but the grep command returns this: bash-4.0$ grep REVDAT ../2BEG.pdb REVDAT 3 24-FEB-09 2BEG 1 VERSN REVDAT 2 20-DEC-05 2BEG 1 JRNL REVDAT 1 22-NOV-05 2BEG 0 So, the actual code is showing the oldest date from REVDAT. I don't know if you (the developer) are trying to say with 'release_date' if is the first version or the last. But I think, as Paul said, that should be the most current date. By the way, my previous comment I said that the last part of the code pasted by Paul should be: dict['mod_number'] = hh[7:9] dict['mod_id'] = hh[23:26] dict['mod_type'] = hh[31] But it has to be: dict['mod_number'] = hh[7:10] dict['mod_id'] = hh[23:27] dict['mod_type'] = hh[31] Other thing, instead to add 'mod_number', 'mod_id', 'mod_type' directly into dict. I think that these keys should be inside a 'release_data' key: dict={'name':"", [...] 'release_date' : "1909-01-08", 'release_data' : {'mod_number' : "", 'mod_id' : "", 'mod_type' : ""}, 'structure_method' : "unknown", [...] } Please comment :) Regards. -- Carlos Rios V. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Sun Mar 21 00:29:30 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Sun, 21 Mar 2010 10:29:30 +0600 Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project Message-ID: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> Dear BioPython developers, my name is Konstantin. I am first-year master's student at Novosibirsk State University, Russia. The subject of my bachelor diploma work was development of 3D biological macromolecular structure visualization tool for open-source bioinformatics project called UGENE . This work was successfully finished about a year ago. The task included a lot of work with PDB format: parsing, correctness testing etc. For testing purposes even whole PDB database was downloaded and tested for simple assertions. Such stress testing revealed a lot of problems and helped to improve code significantly. So, one may say, I have some experience with PDB format :) I used BioPython when I was studying bioinformatics basics and really liked it. I would like to contribute to the project by improving Bio.PDB module and implementing a set of convenient tools to work with PDB files. Best regards, Konstantin From tiagoantao at gmail.com Sun Mar 21 08:59:31 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 21 Mar 2010 12:59:31 +0000 Subject: [Biopython-dev] Changes to the main repo Message-ID: <6d941f121003210559o506b853ci381927fed3aa836f@mail.gmail.com> Hi, I've made some changes in the main repository (my first changes with github), some comments: 1. Many thanks for the GitUsage wiki page. REALLY useful. 2. That being said, if I did any mistakes, they are my own fault. 3. I've added support for big genepop files, something I tend do be asked quite a lot 4. And support for haploid data (nobody really asked this) 5. I remember Peter sending an email about needed corrections to the code. I am afraid I've lost that email :( . If you send it to me, I will do them ASAP 6. New test cases and test data files 7. I might add support, in the future, to Arlequin (file format and application). Allowing for statistics over sequences and other goodies with sequence data. Regards, Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From eric.talevich at gmail.com Sun Mar 21 21:54:27 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 21 Mar 2010 21:54:27 -0400 Subject: [Biopython-dev] GSoC: Refining the PDB-Tidy project idea Message-ID: <3f6baf361003211854g41a4d358pc7fc49c156dcbb7b@mail.gmail.com> Hi GSoC'ers, The PDB-Tidy idea on Biopython's Summer of Code page seems to have attracted interest from a number of highly qualified students: http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files Please, don't let this deter you from applying! Google allocates student slots to each organization based on the number of applications received, so if OBF receives more applications, we can accept more students. However, I'm also concerned that I've made the project description too general. (Or is it too specific?) This article describes the characteristics of a well-defined GSoC project idea: http://en.flossmanuals.net/GSoCMentoring/SelectingProjects In the interest of improving the opportunities for each student, I'm suggesting that the proposals that are submitted under the PDB-Tidy theme focus on a specific goal beyond the manipulation PDB files. At the risk of being "That Guy", I'll give some examples of what I mean: (a) Improve interoperability with external tools like AutoDock or Modeller; (b) Port some MolProbity-like functionality to Biopython; (c) Improve interoperability and consistency between Bio.PDB and the rest of Biopython; (d) Write a parser for some useful format. Also, would anyone else be interested in co-mentoring one of these projects? It's good for a GSoC project to have a secondary mentor -- not required, but helpful -- and I think some support from a more experienced structural biologist would be valuable here. Thanks & best regards, Eric From bugzilla-daemon at portal.open-bio.org Sun Mar 21 22:50:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 21 Mar 2010 22:50:24 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003220250.o2M2oOoP003409@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #3 from eric.talevich at gmail.com 2010-03-21 22:50 EST ------- (In reply to comment #2) > > I made a patch, which change the regex. > From: tail=re.sub("\A\w+\s+\d*\s*","",h) > TO: tail=re.sub("\A\w+\s+\d*\s+","",h > Seems that this patch works. The result I got is this: > > ... Thanks for triaging this, Carlos. However, I think it would be better if the code is a direct reflection of the actual PDB specification: http://www.wwpdb.org/documentation/format32/sect2.html It looks like "continuation" numbers are ignored by this code, so only the text starting in column 11 onward (hh[10:]) is ever used, also dropping leading spaces. Similarly, the key found by regexp is just the first whitespace-delimited word. Can you change your patch to use string methods instead of regular expressions? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Sun Mar 21 23:22:32 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 21 Mar 2010 23:22:32 -0400 Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project In-Reply-To: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> References: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> Message-ID: <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com> On Sun, Mar 21, 2010 at 12:29 AM, Konstantin Okonechnikov < k.okonechnikov at gmail.com> wrote: > Dear BioPython developers, > my name is Konstantin. I am first-year master's student at Novosibirsk > State > University, Russia. > The subject of my bachelor diploma work was development of 3D biological > macromolecular structure visualization tool for open-source bioinformatics > project called UGENE . This work was successfully > finished about a > year > ago. > The task included a lot of work with PDB format: parsing, correctness > testing etc. For testing purposes even whole PDB database was downloaded > and > tested for simple assertions. Such stress testing revealed a lot of > problems > and helped to improve code significantly. So, one may say, I have some > experience with PDB format :) > I used BioPython when I was studying bioinformatics basics and really liked > it. I would like to contribute to the project by improving Bio.PDB module > and implementing a set of convenient tools to work with PDB files. > Best regards, > Konstantin > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > Hi Konstantin, That's really cool. You might also be interested in a project based on this idea from another GSoC organization, NESCent: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Visualization_of_Protein_3D_Structure_Evolution (It's OK to apply to more than one GSoC mentoring organization as a student.) I sent an e-mail earlier today describing some possible refinements to the PDB-Tidy project; did any of those interest you? While we're at it, here's a good place to start improving Bio.PDB before Summer of Code begins: http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED Feel free to e-mail or gchat me with any questions you have. Thanks, Eric From bugzilla-daemon at portal.open-bio.org Sun Mar 21 23:50:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 21 Mar 2010 23:50:47 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003220350.o2M3olGt004920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 ------- Comment #3 from eric.talevich at gmail.com 2010-03-21 23:50 EST ------- (In reply to comment #2) > So, the actual code is showing the oldest date from REVDAT. I don't know if you > (the developer) are trying to say with 'release_date' if is the first version > or the last. But I think, as Paul said, that should be the most current date. It's probably an accidental result of repeatedly setting the same field. Surely the most recent revision date is at least as important as the date of the first revision, given that the initial deposition date is recorded separately. I'm not the original developer, but I'd say it would be best to keep a list or dictionary in a new "revisions" attribute, leaving release_date alone or deprecating it in case someone is actually relying on the current behavior. We should discuss this on biopython-dev before implementing it. > Other thing, instead to add 'mod_number', 'mod_id', 'mod_type' directly into > dict. I think that these keys should be inside a 'release_data' key: That name could lead to some typo-related confusion... but yes, a list-of-dicts or dict-of-dicts would be a nice way to store this info. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Mon Mar 22 02:04:50 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Mon, 22 Mar 2010 12:04:50 +0600 Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project In-Reply-To: <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com> References: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com> Message-ID: <884d1faa1003212304vcdc86d6t4a6931adce8214fc@mail.gmail.com> Hi Eric! Hi Konstantin, > > That's really cool. You might also be interested in a project based on this > idea from another GSoC organization, NESCent: > > https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Visualization_of_Protein_3D_Structure_Evolution > > This project looks really nice, though it requires some proficiency in Java. Actually I don't like the idea of applying to many organizations, I would better choose one project and concentrate my efforts. > (It's OK to apply to more than one GSoC mentoring organization as a > student.) > > I sent an e-mail earlier today describing some possible refinements to the > PDB-Tidy project; did any of those interest you? > > I need some time to investigate them. There is one question so far: what "useful formats" do you have in mind? AFAI, there are not so many data formats for storing 3d structures. I know about PDB XML and NCBI data format. The last one is ASN.1 variation, it is used for diffrent kinds of data (sequences etc.). > While we're at it, here's a good place to start improving Bio.PDB before > Summer of Code begins: > > > http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED > > OK, I will look at it. Fixing a couple of bugs is good way to get aquainted with the code :) > > > Feel free to e-mail or gchat me with any questions you have. > > Thanks, > Eric > p.s. Sorry for misprint in letter subject, I hope that the project won't be that small :) -- Best regards, Okonechnikov Konstantin From p.j.a.cock at googlemail.com Mon Mar 22 05:19:38 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 22 Mar 2010 09:19:38 +0000 Subject: [Biopython-dev] pylint, was: Changes to the main repo Message-ID: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com> 2010/3/21 Tiago Ant?o : > Hi, > > I've made some changes in the main repository (my first changes with > github), some comments: > 1. Many thanks for the GitUsage wiki page. REALLY useful. > 2. That being said, if I did any mistakes, they are my own fault. > 3. I've added support for big genepop files, something I tend do be > asked quite a lot > 4. And support for haploid data (nobody really asked this) > 5. I remember Peter sending an email about needed corrections to the > code. I am afraid I've lost that email :( . If you send it to me, I > will do them ASAP > 6. New test cases and test data files > 7. I might add support, in the future, to Arlequin (file format and > application). Allowing for statistics over sequences and other goodies > with sequence data. > > Regards, > Tiago Hi Tiago, That sounds good. Regarding point 5, running pylint over the code reported some possible errors in Bio.PopGen. Have a look at this - they are all undefined variable issues: http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html I just ran it again on the latest code, and the line numbers have changed a tiny bit but that is all: $ pylint --disable-msg-cat=CRW --include-ids=y --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen No config file found, using default configuration ************* Module Bio.PopGen.Async E0602: 78:Async.get_result: Undefined variable 'done' E0602: 79:Async.get_result: Undefined variable 'done' ************* Module Bio.PopGen.GenePop E0602:166:Record.split_in_pops: Undefined variable 'GenePop' E0602:183:Record.split_in_loci: Undefined variable 'GenePop' ************* Module Bio.PopGen.GenePop.Controller E0602: 41:_read_allele_freq_table: Undefined variable 'self' E0602:133:_hw_func: Undefined variable 'self' E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext' E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable 'currrent_pop' ************* Module Bio.PopGen.GenePop.FileParser E1120:219:FileRecord.remove_locus_by_name: No value passed for parameter 'fw' in function call ************* Module Bio.PopGen.SimCoal.Cache E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config' E0602: 88: Undefined variable 'Cache' ************* Module Bio.PopGen.SimCoal.Controller E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config' Peter From krother at rubor.de Mon Mar 22 11:27:30 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 22 Mar 2010 16:27:30 +0100 Subject: [Biopython-dev] RNA secondary structure parsing Message-ID: Hi, Took me a while to do some basic clean up in my code - finally managed to contribute something. I just added a branch 'rna' with basic RNA 2D format parsers (Vienna, CT, BPSEQ), and a module that can extract 2D structure elements (helices, loops, bulges, junctions). http://github.com/krother/biopython/tree/rna Its all in: Bio.RNA Tests.test_RNA_* Any kind of feedback is welcome. Best Regards, Kristian From biopython at maubp.freeserve.co.uk Mon Mar 22 12:08:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 16:08:27 +0000 Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML) In-Reply-To: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com> References: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com> Message-ID: <320fb6e01003220908s264401e6s3dab9aa7f2a3f87b@mail.gmail.com> On Fri, Mar 12, 2010 at 1:22 PM, Peter wrote: > Hi all, > > Back in November I set up a simple pair of cron jobs to update the code > snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour: > http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html > > I've just added another job which takes the latest Tutorial.tex file and > compiles it with pdflatex (already installed) and hevea (installed from > source under my user account) to make the PDF and HTML files. > These are then copied to the webserver and published as: > > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf > > These are currently updated once a day (at 2:40am which shouldn't > be too busy whichever USA timezone the server uses). Assuming > I got my crontab settings right - in the short term I'll keep an eye on > it to check ;) It looks like the PDF is working (which happens first in the script), but not the HTML. I'll look into this... Peter From biopython at maubp.freeserve.co.uk Mon Mar 22 12:21:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 16:21:16 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo Message-ID: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> Hi Eric, I've got a real example of a simple tree manipulation that I would like to handle via your new module. I have a (small) unrooted tree from a gene family in Newick format, which by construction includes an out-group (the same gene but from a more distant organism). I would like to reroot the tree so that this out-group is at the basal level. Can Bio.Phylo help me here? Thanks, Peter P.S. Why is Bio.Phylo.trim_str a public method? From bugzilla-daemon at portal.open-bio.org Mon Mar 22 12:28:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 12:28:24 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003221628.o2MGSOqs027450@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 krother at rubor.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |krother at rubor.de -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 12:39:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 12:39:01 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003221639.o2MGd1mk027807@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 ------- Comment #4 from krother at rubor.de 2010-03-22 12:39 EST ------- I originally contributed the parse_pdb_header module a long time ago. I think one or two persons added some changes in the meantime. I like Erics idea of adding a separate 'revisions' attribute. When the code does what is needed I think it's time for me to do some cleanup work. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 13:24:32 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 13:24:32 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003221724.o2MHOWY7029072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1463 is|0 |1 obsolete| | ------- Comment #4 from crosvera at gmail.com 2010-03-22 13:24 EST ------- Created an attachment (id=1464) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1464&action=view) new proposed patch for bug2948 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 13:25:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 13:25:14 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003221725.o2MHPEbj029122@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #5 from crosvera at gmail.com 2010-03-22 13:25 EST ------- ok, I made other patch, this one replace some regex for string-slice methods. what I got: crosvera at cabernet:~/programming/biopython/Bio$ python Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from PDB import * >>> parser = PDBParser() >>> structure = parser.get_structure("2beg", "PDB/2BEG.pdb") >>> structure.header['name'] " 3d structure of alzheimer's abeta(1-42) fibrils" >>> patch file: 0001-modified-parse_pdb_header.py.patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 14:17:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 14:17:34 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003221817.o2MIHYpm030968@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1464 is|0 |1 obsolete| | ------- Comment #6 from crosvera at gmail.com 2010-03-22 14:17 EST ------- Created an attachment (id=1465) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1465&action=view) new proposed patch for bug2948 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Mar 22 16:28:21 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Mar 2010 16:28:21 -0400 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> Message-ID: <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> On Mon, Mar 22, 2010 at 12:21 PM, Peter wrote: > Hi Eric, > > I've got a real example of a simple tree manipulation that I would like to > handle via your new module. I have a (small) unrooted tree from a gene > family in Newick format, which by construction includes an out-group > (the same gene but from a more distant organism). I would like to reroot > the tree so that this out-group is at the basal level. > > Can Bio.Phylo help me here? > In Bio.Nexus, would you normally have handled this with the method root_with_outgroup? I intend to port that method to Bio.Phylo once I understand it, but the existing code has been kind of hard for me to figure out. Let's address it here, then. Is there a detailed plain-text description somewhere of how this operation should work in general? Given that the outgroup taxon is already somewhere inside the existing unrooted tree, I would guess something like: 0. Load the tree: tree = Phylo.read('example.nwk', 'newick') 1. Locate the outgroup in the tree, remembering the lineage for future operations: outgroup_path = tree.get_path({'name': 'OUTGROUP'}) # or however you can identify it 2. Tracing the outgroup lineage backwards, reattach the subclades to new locations under a new root (or the old root, repurposed). Picturing the unrooted tree as an arbitrarily rooted tree, invert everything above the outgroup in the tree, but keep the descendants of those clades as they are: # Untested, hardly even thought through, danger danger! root = tree.root old_clades = root.clades # needed? root.clades = [] new_parent = root last = outgroup_path[-1] for parent in outgroup_path[-2::-1]: siblings = [kid for kid in parent.clades if kid != last] new_parent.clades = # TODO new_parent = last last = parent tree.rooted = True Bio.Phylo does no internal bookkeeping, so it's OK (i.e. sometimes required) to shuffle clades directly. Is this what "root with outgroup" is supposed to do? What functionality in Bio.Nexus.Trees.root_with_outgroup is missing here? And, do you happen to have an example of a tree with edge cases that I could use for testing? P.S. Why is Bio.Phylo.trim_str a public method? > Oops, I'll fix it. Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Mar 22 17:48:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 21:48:31 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> Message-ID: <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> On Mon, Mar 22, 2010 at 8:28 PM, Eric Talevich wrote: > On Mon, Mar 22, 2010 at 12:21 PM, Peter wrote: > >> Hi Eric, >> >> I've got a real example of a simple tree manipulation that I would like to >> handle via your new module. I have a (small) unrooted tree from a gene >> family in Newick format, which by construction includes an out-group >> (the same gene but from a more distant organism). I would like to reroot >> the tree so that this out-group is at the basal level. >> >> Can Bio.Phylo help me here? >> > > In Bio.Nexus, would you normally have handled this with the method > root_with_outgroup? I intend to port that method to Bio.Phylo once I > understand it, but the existing code has been kind of hard for me to figure > out. > > Let's address it here, then. Is there a detailed plain-text description > somewhere of how this operation should work in general? I've just got a quick answer for you now tonight: I've not used Bio.Nexus to try and do this - I'll try to get back to you in more depth tomorrow. Thanks, Peter From p.j.a.cock at googlemail.com Tue Mar 23 07:50:24 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Mar 2010 11:50:24 +0000 Subject: [Biopython-dev] pylint, was: Changes to the main repo In-Reply-To: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com> References: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com> Message-ID: <320fb6e01003230450h502adce0p27080d3a00ddda23@mail.gmail.com> 2010/3/22 Peter Cock : > 2010/3/21 Tiago Ant?o : >> Hi, >> >> I've made some changes in the main repository (my first changes with >> github), some comments: >> 1. Many thanks for the GitUsage wiki page. REALLY useful. >> 2. That being said, if I did any mistakes, they are my own fault. >> 3. I've added support for big genepop files, something I tend do be >> asked quite a lot >> 4. And support for haploid data (nobody really asked this) >> 5. I remember Peter sending an email about needed corrections to the >> code. I am afraid I've lost that email :( . If you send it to me, I >> will do them ASAP >> 6. New test cases and test data files >> 7. I might add support, in the future, to Arlequin (file format and >> application). Allowing for statistics over sequences and other goodies >> with sequence data. >> >> Regards, >> Tiago > > Hi Tiago, > > That sounds good. Regarding point 5, running pylint over the > code reported some possible errors in Bio.PopGen. Have a > look at this - they are all undefined variable issues: > http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html > > I just ran it again on the latest code, and the line numbers have > changed a tiny bit but that is all: > > $ pylint --disable-msg-cat=CRW --include-ids=y > --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen > No config file found, using default configuration > ************* Module Bio.PopGen.Async > E0602: 78:Async.get_result: Undefined variable 'done' > E0602: 79:Async.get_result: Undefined variable 'done' > ************* Module Bio.PopGen.GenePop > E0602:166:Record.split_in_pops: Undefined variable 'GenePop' > E0602:183:Record.split_in_loci: Undefined variable 'GenePop' > ************* Module Bio.PopGen.GenePop.Controller > E0602: 41:_read_allele_freq_table: Undefined variable 'self' > E0602:133:_hw_func: Undefined variable 'self' > E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext' > E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable > 'currrent_pop' > ************* Module Bio.PopGen.GenePop.FileParser > E1120:219:FileRecord.remove_locus_by_name: No value passed for > parameter 'fw' in function call > ************* Module Bio.PopGen.SimCoal.Cache > E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config' > E0602: 88: Undefined variable 'Cache' > ************* Module Bio.PopGen.SimCoal.Controller > E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config' > > Peter > Hi Taigo, This is looking much better after your fixes last night - just one left: $ pylint --disable-msg-cat=CRW --include-ids=y --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen No config file found, using default configuration ************* Module Bio.PopGen.GenePop.Controller E0602: 41:_read_allele_freq_table: Undefined variable 'self' Note if I turn off those particular error messages which in other situations I had tentatively tagged as false positives, there could be a few more issues: $ pylint --disable-msg-cat=CRW --include-ids=y -r n Bio.PopGenNo config file found, using default configuration ************* Module Bio.PopGen.Async E1101: 59:Async.run_program: Instance of 'Async' has no '_run_program' member ************* Module Bio.PopGen.GenePop.Controller E0602: 41:_read_allele_freq_table: Undefined variable 'self' ************* Module Bio.PopGen.GenePop.EasyController E1101: 33:EasyController.get_basic_info: Module 'Bio.PopGen.GenePop' has no 'parse' member E1101: 43:EasyController.test_hw_pop: Instance of 'GenePopController' has no 'test_pop_hz_prob' member ************* Module Bio.PopGen.GenePop.FileParser E1101:197:FileRecord.remove_population: Instance of 'FileRecord' has no 'populations' member E1101:206:FileRecord.remove_locus_by_position: Instance of 'FileRecord' has no 'populations' member Some of these may be harmless, for example the Async class has a run_program method which calls _run_program, which you expect to be implemented in any subclass. You could add a dummy method to show the expected arguments and just raise a NotImplementedError exception with a comment that the subclass should implement it. e.g. def _run_program(self, program, parameters, input_files): """Actually run the program, handled by a subclass (PRIVATE). This method should be replaced by any derived class to do something useful. It will be called by the run_program method. """ raise NotImplementedError("This object should be subclassed") That particular change is probably worth doing anyway from a code clarity point of view. Peter From biopython at maubp.freeserve.co.uk Tue Mar 23 11:26:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Mar 2010 15:26:56 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com> References: <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com> Message-ID: <320fb6e01003230826r6080746el3327f05079f2651a@mail.gmail.com> On Thu, Mar 18, 2010 at 7:04 PM, Peter wrote: > > I've done that to Bio/Seq.py on the trunk (added two > FutureWarnings and docstring explanation). Assuming > this doesn't trigger any regressions, we'd need to work > on the documentation (in particular the tutorial, but also > perhaps a news post?) and fix the GA unit test before > the release. > I've fixed the GA unit tests, generally by explicit use of string comparison when working with sequence objects. In the case of test_GAQueens.py, this required me to "correct" the "abuse" of the alphabet object (letters was a list of integers, not a string) and thus indirectly the way the MutableSeq was being created. This has always struck me as a very odd example - but should perhaps be kept in mind for more complex sequence like objects (e.g. sequences with 3-letter protein codes). Peter From tiagoantao at gmail.com Wed Mar 24 06:39:09 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 24 Mar 2010 10:39:09 +0000 Subject: [Biopython-dev] Spam on wiki Message-ID: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> Hi, I think we are being attacked, spam wise. The popgen_dev page was full with external links. I am clearing that page, but others might have the same problem. Maybe there is some what to automate the deletion of contributions from spam authors? Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From tiagoantao at gmail.com Wed Mar 24 06:47:43 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 24 Mar 2010 10:47:43 +0000 Subject: [Biopython-dev] Spam on wiki In-Reply-To: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> Message-ID: <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com> 2010/3/24 Tiago Ant?o : > I think we are being attacked, spam wise. The popgen_dev page was full > with external links. > I am clearing that page, but others might have the same problem. Maybe > there is some what to automate the deletion of contributions from spam > authors? I am clearing this http://www.biopython.org/wiki/Special:Contributions/Wiki0808 by hand, not much. From biopython at maubp.freeserve.co.uk Wed Mar 24 06:47:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 10:47:43 +0000 Subject: [Biopython-dev] Spam on wiki In-Reply-To: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> Message-ID: <320fb6e01003240347x5d10a1f3sa4c2c84fa9edcfbe@mail.gmail.com> 2010/3/24 Tiago Ant?o : > Hi, > > I think we are being attacked, spam wise. The popgen_dev page was full > with external links. > I am clearing that page, but others might have the same problem. Maybe > there is some what to automate the deletion of contributions from spam > authors? > > Tiago Hi, I'm subscribed to the wiki RSS feed, but this happened overnight so I hadn't seen it yet. This seems to happen about once a month or so - I haven't noticed a big rise in attacks or anything. This guy did about ten pages - normally only one or two get abused. Dealing with it is fairly easy - you click on the page history, and rollback to the last good page, and ban the user. Tell me your wiki username and I should be able to give you the rights needed to ban people. Peter From biopython at maubp.freeserve.co.uk Wed Mar 24 07:13:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 11:13:12 +0000 Subject: [Biopython-dev] Spam on wiki In-Reply-To: <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com> References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com> Message-ID: <320fb6e01003240413g27f3d87dp4762bd9f8c32befe@mail.gmail.com> 2010/3/24 Tiago Ant?o : > 2010/3/24 Tiago Ant?o : >> I think we are being attacked, spam wise. The popgen_dev page was full >> with external links. >> I am clearing that page, but others might have the same problem. Maybe >> there is some what to automate the deletion of contributions from spam >> authors? > > I am clearing this > http://www.biopython.org/wiki/Special:Contributions/Wiki0808 > by hand, not much. > The "rollback" link on the page history is the simplest route (you are now an administrator on the wiki so should be able to do this, and ban spammers). I don't know if there is a shortcut to revert all a user's recent changes. I think between us we have fixed all the pages now. Thanks, Peter From peter at maubp.freeserve.co.uk Wed Mar 24 10:08:26 2010 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:08:26 +0000 Subject: [Biopython-dev] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: References: Message-ID: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> Hi, This is probably of interest to all the Bio* projects offering access to the NCBI Entrez utilities. See forwarded message below. I *think* the new guidelines basically say that the email & tool parameters are optional BUT if your IP address ever gets banned for excessive use you then have to register an email & tool combination. Regarding the email address, the NCBI say to use the email of the developer (not the end user). However, they do not distinguish between the developers of a library (like us), and the developers of an application or script using a library (who may also be the end user). Currently we (Biopython) and I think BioPerl ask developers using our libraries to populate the email address themselves. I *think* this is still the right action. Peter ---------- Forwarded message ---------- From: Date: Wed, Mar 24, 2010 at 1:53 PM Subject: [Utilities-announce] NCBI Revised E-utility Usage Policy To: NLM/NCBI List utilities-announce New E-utility documentation now on the NCBI Bookshelf The Entrez Programming Utilities (E-Utilities) Help documentation has been added to the NCBI Bookshelf, and so?is now fully integrated with the Entrez search and retrieval system as a part of the Bookshelf database. This help document has been divided into chapters for better organization and includes several new sample Perl scripts. At present this book covers the standard URL interface for the E-utilties; material about the SOAP interface will be added soon and is still available at the same URL: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html. Revised E-utility usage policy In December, 2009 NCBI announced a change to the usage policy for the E-utilities that would require all requests to contain non-null values for both the?&email and &tool parameters. After several consultations with our users and developers, we have decided to revise this policy change, and the revised?policy is described in detail at the following link: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen Please let us know if you have any questions or concerns about this policy change. Thank you, The E-Utilities Team NIH/NLM/NCBI eutilities at ncbi.nlm.nih.gov. _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce -------------- next part -------------- _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce From biopython at maubp.freeserve.co.uk Wed Mar 24 10:51:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:51:46 +0000 Subject: [Biopython-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> Message-ID: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote: > > On Mar 24, 2010, at 9:08 AM, Peter wrote: > >> Hi, >> >> This is probably of interest to all the Bio* projects offering access >> to the NCBI Entrez utilities. See forwarded message below. >> >> I *think* the new guidelines basically say that the email & tool parameters are >> optional BUT if your IP address ever gets banned for excessive use you then >> have to register an email & tool combination. >> >> Regarding the email address, the NCBI say to use the email of the developer >> (not the end user). However, they do not distinguish between the developers >> of a library (like us), and the developers of an application or script using a >> library (who may also be the end user). >> >> Currently we (Biopython) and I think BioPerl ask developers using our libraries >> to populate the email address themselves. I *think* this is still the >> right action. >> >> Peter > > > Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I > think with the SOAP-based ones as well). ?We're providing a specific set of > tools for user to write up their own applications end applications. ?I can try > contacting them regarding this to get an official response to clarify this > somewhat. Please give the NCBI an email - you can CC me too if you like. > Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a > default, but always leave the email blank and issue a warning if it isn't > set. ?We could just as easily leave both blank and issue warnings for both. We currently leave out the email and set the tool parameter to "Biopython" by default but this can be overridden. Currently leaving out the email does cause Biopython to give a warning. Peter From biopython at maubp.freeserve.co.uk Wed Mar 24 11:16:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 15:16:51 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> Message-ID: <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> On Mon, Mar 22, 2010 at 9:48 PM, Peter wrote: >> In Bio.Nexus, would you normally have handled this with the method >> root_with_outgroup? I intend to port that method to Bio.Phylo once I >> understand it, but the existing code has been kind of hard for me to figure >> out. > > I've just got a quick answer for you now tonight: I've not used Bio.Nexus > to try and do this - I'll try to get back to you in more depth tomorrow. Here is an example using Bio.Nexus.Trees to reroot with an outgroup. #I have encoded the tree here as a string: tree_string = """((gi|6273291|gb|AF191665.1|AF191:0.00418, (gi|6273290|gb|AF191664.1|AF191:0.00189, gi|6273289|gb|AF191663.1|AF191:0.00145) :0.00083):0.00770, (gi|6273287|gb|AF191661.1|AF191:0.00489, gi|6273286|gb|AF191660.1|AF191:0.00295) :0.00014, (gi|6273285|gb|AF191659.1|AF191:0.00094, gi|6273284|gb|AF191658.1|AF191:0.00018) :0.00125);""" from Bio.Nexus import Tree tree = Trees.Tree(tree_string) print "Old" print tree print tree.display() print print "New" #This acts in situ: tree.root_with_outgroup(["gi|6273289|gb|AF191663.1|AF191"]) print tree print tree.display() Old tree a_tree = ((gi|6273291|gb|AF191665.1|AF191,(gi|6273290|gb|AF191664.1|AF191,gi|6273289|gb|AF191663.1|AF191)),(gi|6273287|gb|AF191661.1|AF191,gi|6273286|gb|AF191660.1|AF191),(gi|6273285|gb|AF191659.1|AF191,gi|6273284|gb|AF191658.1|AF191)); # taxon prev succ brlen blen (sum) support comment 0 - None [1, 6, 9] 0.0 0.0 - - 1 - 0 [2, 3] 0.0077 0.0077 - - 2 gi|6273291|gb|AF191665.1|AF191 1 [] 0.00418 0.01188 - - 3 - 1 [4, 5] 0.00083 0.00853 - - 4 gi|6273290|gb|AF191664.1|AF191 3 [] 0.00189 0.01042 - - 5 gi|6273289|gb|AF191663.1|AF191 3 [] 0.00145 0.00998 - - 6 - 0 [7, 8] 0.00014 0.00014 - - 7 gi|6273287|gb|AF191661.1|AF191 6 [] 0.00489 0.00503 - - 8 gi|6273286|gb|AF191660.1|AF191 6 [] 0.00295 0.00309 - - 9 - 0 [10, 11] 0.00125 0.00125 - - 10 gi|6273285|gb|AF191659.1|AF191 9 [] 0.00094 0.00219 - - 11 gi|6273284|gb|AF191658.1|AF191 9 [] 0.00018 0.00143 - - Root: 0 None New tree a_tree = (((((gi|6273287|gb|AF191661.1|AF191,gi|6273286|gb|AF191660.1|AF191),(gi|6273285|gb|AF191659.1|AF191,gi|6273284|gb|AF191658.1|AF191)),gi|6273291|gb|AF191665.1|AF191),gi|6273290|gb|AF191664.1|AF191),gi|6273289|gb|AF191663.1|AF191); # taxon prev succ brlen blen (sum) support comment 0 - 1 [6, 9] 0.0077 0.00998 - - 1 - 3 [0, 2] 0.00083 0.00228 - - 2 gi|6273291|gb|AF191665.1|AF191 1 [] 0.00418 0.00646 - - 3 - 12 [1, 4] 0.00145 0.00145 - - 4 gi|6273290|gb|AF191664.1|AF191 3 [] 0.00189 0.00334 - - 5 gi|6273289|gb|AF191663.1|AF191 12 [] 0.0 0.0 0.0 - 6 - 0 [7, 8] 0.00014 0.01012 - - 7 gi|6273287|gb|AF191661.1|AF191 6 [] 0.00489 0.01501 - - 8 gi|6273286|gb|AF191660.1|AF191 6 [] 0.00295 0.01307 - - 9 - 0 [10, 11] 0.00125 0.01123 - - 10 gi|6273285|gb|AF191659.1|AF191 9 [] 0.00094 0.01217 - - 11 gi|6273284|gb|AF191658.1|AF191 9 [] 0.00018 0.01141 - - 12 - None [3, 5] 0.0 0.0 - - Root: 12 None Here the root_with_outgroup method acts in situ, and returns the new root ID number (not applicable to Bio.Phylo). The outgroup argument seems to be a list of taxon names (here just one). In my example, the outgroup originally has a branch length of 0.00145. A new root node was created (here #12) with two children, one with a branch length of zero (#5, the outgroup) and one with the full length (#3, branch length 0.00145). Essentially this new root node (#12) and the outgroup (#5) are now both right at the base of the tree. There is more than one what to do this though. For example FigTree seems to introduce a new root node half way along the outgroup branch (replacing the edge with two edges of half its length). This way the new root node represents the last common ancestor of the outgroup and the ingroup (everything else), although putting it at the mid point is perhaps a little arbitrary. Peter From nicolas.rapin at bric.ku.dk Thu Mar 25 07:58:53 2010 From: nicolas.rapin at bric.ku.dk (Nicolas Rapin) Date: Thu, 25 Mar 2010 12:58:53 +0100 Subject: [Biopython-dev] GEO database and bio-python Message-ID: Dear all, I just started python, and use biopython quite a lot lately. It's a nice package, and is very convenient. Oh, and I m also new on the mailing list... I need to get access to a lot of data from GEO, and i noticed that it might be a good idea to have the database locally, which lead me to write a little class that can download the compressed files form ncbi (the GSE/GPLxxx_family.tgz files) , and parse the MINimL sort of xml they have in there together with the actual data that is in the compressed files. In the end i have a nicely organized hdf5 file, that i can use to do data mining. I wondered if that was for Biopython. If yes, how do I contribute ? best, Nico From biopython at maubp.freeserve.co.uk Thu Mar 25 08:22:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Mar 2010 12:22:40 +0000 Subject: [Biopython-dev] GEO database and bio-python In-Reply-To: References: Message-ID: <320fb6e01003250522l3c730081y143cc4799f038754@mail.gmail.com> On Thu, Mar 25, 2010 at 11:58 AM, Nicolas Rapin wrote: > Dear all, > > I just started python, and use biopython quite a lot lately. It's a nice package, > and is very convenient. Oh, and I m also new on the mailing list... Great, and welcome :) > I ?need to get access to a lot of data from GEO, and i noticed that it might be > a good idea to have the database locally, which lead me to write ?a little class > that can download ?the compressed files ?form ncbi (the GSE/GPLxxx_family.tgz > files) , and parse the MINimL sort of xml they have in there together with the > actual data that is in the compressed files. In the end i have a nicely organized > hdf5 file, that i can use to do data mining. Have you looked at the existing Bio.GEO module? It hasn't got an active maintainer at the moment, as in some ways is rather simplistic. I found that Sean Davis' GEOquery package for R/Bioconductor was much more complete. > I wondered if that was for Biopython. This sounds like a useful addition. > If yes, how do I contribute ? First of all we use the public mailing lists to discuss things. In terms of code, starting a branch on github would let you show us what you are working on and makes it easier to eventually merge things. See http://biopython.org/wiki/GitUsage Peter From sdavis2 at mail.nih.gov Thu Mar 25 08:29:52 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 25 Mar 2010 08:29:52 -0400 Subject: [Biopython-dev] GEO database and bio-python In-Reply-To: References: Message-ID: <264855a01003250529p7d2290f3qc441228c34f5e720@mail.gmail.com> On Thu, Mar 25, 2010 at 7:58 AM, Nicolas Rapin wrote: > Dear all, > > I just started python, and use biopython quite a lot lately. It's a nice package, and is very convenient. Oh, and I m also new on the mailing list... > > I ?need to get access to a lot of data from GEO, and i noticed that it might be a good idea to have the database locally, which lead me to write ?a little class that can download ?the compressed files ?form ncbi (the GSE/GPLxxx_family.tgz files) , and parse the MINimL sort of xml they have in there together with the actual data that is in the compressed files. In the end i have a nicely organized hdf5 file, that i can use to do data mining. > > I wondered if that was for Biopython. Hi, Nico. Not a direct answer to your question, but have a look at the Bioconductor package GEOmetadb. (There is also an online version.) We have parsed all of GEO metadata into a SQLite database and made it available within R. However, the SQLite database can be used standalone and python has built in support for SQLite, as of late. http://gbnci.abcc.ncifcrf.gov/geo/ http://gbnci.abcc.ncifcrf.gov/geo/GEOmetadb.sqlite.gz http://watson.nci.nih.gov/bioc_mirror/packages/2.6/bioc/html/GEOmetadb.html Also, as for the data, if you are inclined to use R for anything (or rpy2), the GEOquery package can download and parse all the record types in GEO into objects within R and the number of tools for data analysis of microarray data in R/Bioconductor is enormous. http://watson.nci.nih.gov/bioc_mirror/packages/2.6/bioc/html/GEOquery.html Sorry for the advertisement-like email.... Sean > If yes, how do I contribute ? > > > best, > > Nico > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Thu Mar 25 12:25:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Mar 2010 16:25:01 +0000 Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez? Message-ID: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com> Hi all, The NCBI recently announced revised guidlines for the Entrez utilities, which we've started discussing on the OBF mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007499.html http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000623.html As part of this I decided to look at the peak hour rules: http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000644.html The old guideline was: http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements "Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests." This doesn't define a series - for example, would it be OK to run a script making 75 requests every two hours? This could be regarded as multiple separate series each under 100 requests, but the cumulative count over the 8 peak hours is 600 requests. Sadly the new guidelines are even more vague: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen "... and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays." Not very helpful. Also neither version raises the issue of summer/winter time (daylight savings times) but simply gives Eastern Time (EST). While we may get clarification from the NCBI, the following patch to Bio.Entrez may be worth considering. It simply counts the number of Entrez requestes during peak hours, and issues a warning if this exceeds 100 (based on a strict interpretation of the older guidelines). Does this seem worth checking in, or should we try to get some clarification from the NCBI first? Peter diff --git a/Bio/Entrez/__init__.py b/Bio/Entrez/__init__.py index 33d8d14..f670354 100644 --- a/Bio/Entrez/__init__.py +++ b/Bio/Entrez/__init__.py @@ -285,6 +285,26 @@ def _open(cgi, params={}, post=False): ? ? ? ? _open.previous = current + wait ? ? else: ? ? ? ? _open.previous = current + + ? ?# Max 100 requests from 09:00 to 17:00 Eastern Time (EST), which is + ? ?# 5 hours behind Coordinated Universal Time (UTC) aka Greenwich + ? ?# Mean Time (GMT), thus 14:00 to 22:00 UTC/GMT. The NCBI don't + ? ?# mention summer/winter time (daylight saving time), so ignore that. + ? ?if 14 <= time.gmtime(current).tm_hour < 22 \ + ? ?and time.gmtime(current).tm_wday <= 5: + ? ? ? ?# Peak time (Monday = 0, Friday = 5) + ? ? ? ?_open.peak_requests += 1 + ? ? ? ?if _open.peak_requests > 100: + ? ? ? ? ? ?import warnings + ? ? ? ? ? ?warnings.warn("The NCBI request you make at most 100 Entrez " + ? ? ? ? ? ? ? ? ? ? ? ? ?"requests during the peak time 9AM to 5PM EST " + ? ? ? ? ? ? ? ? ? ? ? ? ?"(which is 14:00 to 22:00 UTC/GMT). " + ? ? ? ? ? ? ? ? ? ? ? ? ?"You have exceeded this limit.") + ? ?else: + ? ? ? ?# Off peak + ? ? ? ?# Reset the counter (in case this is a long running script) + ? ? ? ?_open.peak_requests = 0 + ? ? # Remove None values from the parameters ? ? for key, value in params.items(): ? ? ? ? if value is None: @@ -368,3 +388,4 @@ E-utilities.""", UserWarning) ? ? return uhandle _open.previous = 0 +_open.peak_requests = 0 From eric.talevich at gmail.com Thu Mar 25 16:27:23 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 25 Mar 2010 16:27:23 -0400 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> Message-ID: <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> On Wed, Mar 24, 2010 at 11:16 AM, Peter wrote: > On Mon, Mar 22, 2010 at 9:48 PM, Peter > wrote: > >> In Bio.Nexus, would you normally have handled this with the method > >> root_with_outgroup? I intend to port that method to Bio.Phylo once I > >> understand it, but the existing code has been kind of hard for me to > figure > >> out. > > > > I've just got a quick answer for you now tonight: I've not used Bio.Nexus > > to try and do this - I'll try to get back to you in more depth tomorrow. > > Here is an example using Bio.Nexus.Trees to reroot with an outgroup. > > [...] > > In my example, the outgroup originally has a branch length of 0.00145. > A new root node was created (here #12) with two children, one with a > branch length of zero (#5, the outgroup) and one with the full length > (#3, branch length 0.00145). Essentially this new root node (#12) and > the outgroup (#5) are now both right at the base of the tree. > > There is more than one what to do this though. For example FigTree > seems to introduce a new root node half way along the outgroup branch > (replacing the edge with two edges of half its length). This way the > new root node represents the last common ancestor of the outgroup and > the ingroup (everything else), although putting it at the mid point is > perhaps a little arbitrary. > > Peter > I looked up this section in *Inferring Phylogenies* and found no decisive statement on how it should be done. I gathered: 1. The new root can be placed anywhere along the branch between the outgroup and its ancestor. 2. Another way to root a tree is by assuming a molecular clock -- place the root so that the distances to all the tips are roughly equal. So FigTree and Bio.Nexus are both doing reasonable things. (PyCogent doesn't seem to support this operation, as far as I can tell.) Thinking of this operation as extending the tree further back in time, where the (monophyletic) tree without the outgroup is a sub-clade of the larger rooted tree we're introducing -- it makes sense to me that the branch length of the outgroup should represent the total evolutionary distance from the root of the monophyletic sub-clade to the outgroup. Based on that, I'm tempted to do the opposite of Bio.Nexus, letting the outgroup keep its original branch length, and assigning a length of 0 to the branch leading to the remaining sub-clade. Then by default we get something resembling a trifucating root, and the user can shift the actual location of the root further back without too much difficulty. Alternatives: - Take a hint from the molecular clock, and try to equalize the distance from the root to the outgroup and the farthest tip of the main subclade. Problem: in your example the outgroup is not the longest branch, so this would be equivalent to the version I proposed above. The root->subclade branch would only be nonzero sometimes, and it might surprise you when that happens. - Offer a separate method, root_by_clock, which does the expected thing, and can be used to determine good branch lengths at the root after the outgroup operation, if desired. - Combine: add a keyword argument to root_with_outgroup (like molecular_clock=False) which triggers Alternative #1. I'll play with this some more and post an example implementation for you to review. Thanks for your help, Eric From mjldehoon at yahoo.com Thu Mar 25 21:37:26 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 25 Mar 2010 18:37:26 -0700 (PDT) Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez? In-Reply-To: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com> Message-ID: <996381.46391.qm@web62407.mail.re1.yahoo.com> I have no objections, but basically I think that this can be left to the responsibility of the end user. --Michiel. --- On Thu, 3/25/10, Peter wrote: > From: Peter > Subject: NCBI E-utility 100 requests rule in Bio.Entrez? > To: "Biopython-Dev Mailing List" , "Michiel de Hoon" > Date: Thursday, March 25, 2010, 12:25 PM > Hi all, > > The NCBI recently announced revised guidlines for the > Entrez > utilities, which we've started discussing on the OBF > mailing list: > http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007499.html > http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000623.html > > As part of this I decided to look at the peak hour rules: > http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000644.html > > The old guideline was: > > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements > "Run retrieval scripts on weekends or between 9 pm and 5 > am > Eastern Time weekdays for any series of more than 100 > requests." > > This doesn't define a series - for example, would it be OK > to run > a script making 75 requests every two hours? This could be > regarded > as multiple separate series each under 100 requests, but > the > cumulative count over the 8 peak hours is 600 requests. > > Sadly the new guidelines are even more vague: > > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils?=chapter2#chapter2.Usage_Guidelines_and_Requiremen > "... and limit large jobs to either weekends or between > 9:00 PM > and 5:00 AM Eastern time during weekdays." > > Not very helpful. > > Also neither version raises the issue of summer/winter > time > (daylight savings times) but simply gives Eastern Time > (EST). > > While we may get clarification from the NCBI, the following > patch > to Bio.Entrez may be worth considering. It simply counts > the > number of Entrez requestes during peak hours, and issues a > warning if this exceeds 100 (based on a strict > interpretation of > the older guidelines). > > Does this seem worth checking in, or should we try to get > some > clarification from the NCBI first? > > Peter > > diff --git a/Bio/Entrez/__init__.py > b/Bio/Entrez/__init__.py > index 33d8d14..f670354 100644 > --- a/Bio/Entrez/__init__.py > +++ b/Bio/Entrez/__init__.py > @@ -285,6 +285,26 @@ def _open(cgi, params={}, > post=False): > ? ? ? ? _open.previous = current + wait > ? ? else: > ? ? ? ? _open.previous = current > + > + ? ?# Max 100 requests from 09:00 to 17:00 Eastern Time > (EST), which is > + ? ?# 5 hours behind Coordinated Universal Time (UTC) > aka Greenwich > + ? ?# Mean Time (GMT), thus 14:00 to 22:00 UTC/GMT. The > NCBI don't > + ? ?# mention summer/winter time (daylight saving time), > so ignore that. > + ? ?if 14 <= time.gmtime(current).tm_hour < 22 \ > + ? ?and time.gmtime(current).tm_wday <= 5: > + ? ? ? ?# Peak time (Monday = 0, Friday = 5) > + ? ? ? ?_open.peak_requests += 1 > + ? ? ? ?if _open.peak_requests > 100: > + ? ? ? ? ? ?import warnings > + ? ? ? ? ? ?warnings.warn("The NCBI request you make > at most 100 Entrez " > + ? ? ? ? ? ? ? ? ? ? ? ? ?"requests during > the peak time 9AM to 5PM EST " > + ? ? ? ? ? ? ? ? ? ? ? ? ?"(which is 14:00 to > 22:00 UTC/GMT). " > + ? ? ? ? ? ? ? ? ? ? ? ? ?"You have exceeded > this limit.") > + ? ?else: > + ? ? ? ?# Off peak > + ? ? ? ?# Reset the counter (in case this is a long > running script) > + ? ? ? ?_open.peak_requests = 0 > + > ? ? # Remove None values from the parameters > ? ? for key, value in params.items(): > ? ? ? ? if value is None: > @@ -368,3 +388,4 @@ E-utilities.""", UserWarning) > ? ? return uhandle > > _open.previous = 0 > +_open.peak_requests = 0 > From cy at cymon.org Fri Mar 26 07:38:56 2010 From: cy at cymon.org (Cymon Cox) Date: Fri, 26 Mar 2010 11:38:56 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> Message-ID: <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com> Hi Eric and Peter, On 25 March 2010 20:27, Eric Talevich wrote: > On Wed, Mar 24, 2010 at 11:16 AM, Peter >wrote: > > > On Mon, Mar 22, 2010 at 9:48 PM, Peter > > wrote: > > >> In Bio.Nexus, would you normally have handled this with the method > > >> root_with_outgroup? I intend to port that method to Bio.Phylo once I > > >> understand it, but the existing code has been kind of hard for me to > > figure > > >> out. > > > > > > I've just got a quick answer for you now tonight: I've not used > Bio.Nexus > > > to try and do this - I'll try to get back to you in more depth > tomorrow. > > > > Here is an example using Bio.Nexus.Trees to reroot with an outgroup. > > > > [...] > > > > In my example, the outgroup originally has a branch length of 0.00145. > > A new root node was created (here #12) with two children, one with a > > branch length of zero (#5, the outgroup) and one with the full length > > (#3, branch length 0.00145). Essentially this new root node (#12) and > > the outgroup (#5) are now both right at the base of the tree. > > > > There is more than one what to do this though. For example FigTree > > seems to introduce a new root node half way along the outgroup branch > > (replacing the edge with two edges of half its length). This way the > > new root node represents the last common ancestor of the outgroup and > > the ingroup (everything else), although putting it at the mid point is > > perhaps a little arbitrary. > Yes, what FigTree is doing is arbitrary, it introduces information into the displayed tree that is not present, and is open to misinterpretation. But it's doing so purely for the graphical presentation because you are trying to root on a terminal branch. Thankfully, if you save this tree in FigTree it writes the original trifurcating tree. > I looked up this section in *Inferring Phylogenies* and found no decisive > statement on how it should be done. I gathered: > > 1. The new root can be placed anywhere along the branch between the > outgroup > and its ancestor. > The root may in biological reality be anywhere along that branch but, in the absence of further information, the question is where do you place it in this situation ie, rooting (making a bifurcating root node) on that terminal branch. > 2. Another way to root a tree is by assuming a molecular clock -- place the > root so that the distances to all the tips are roughly equal. > > So FigTree and Bio.Nexus are both doing reasonable things. (PyCogent > doesn't > seem to support this operation, as far as I can tell.) > > Thinking of this operation as extending the tree further back in time, > where > the (monophyletic) tree without the outgroup is a sub-clade of the larger > rooted tree we're introducing -- it makes sense to me that the branch > length > of the outgroup should represent the total evolutionary distance from the > root of the monophyletic sub-clade to the outgroup. Yes, the outgroup taxa are included in analyses to orientate the relationships (including br lens) of the ingroup. In this case, with a single outgroup taxon you do not a very good estimate of the ingroup br len (its presumably not the immediate ancestor of the ingroup), but its all you've got given the way the experiment was set up - including more outgroups would have been a good idea. Based on that, I'm > tempted to do the opposite of Bio.Nexus, Curious, because given that I think Bio.Nexus is doing the right thing ;) By using this function you are rooting (making a dichotomous root node) using an outgroup (1 taxon in this case), and the biological interpretation is that the length belongs to the ingroup. letting the outgroup keep its > original branch length, and assigning a length of 0 to the branch leading > to > the remaining sub-clade. Then by default we get something resembling a > trifucating root, and the user can shift the actual location of the root > further back without too much difficulty. > I dont understand what you are getting at here... Other points: They way that FigTree displays the rooted tree from root_with_outgroup() is how I would expect the tree to be presented if you only had a single outgroup taxon. There is a case to be made for not making a dichotomous root, but making the nearest trifurcating node to the designated outgroup the root node - this is what PAUP does (it wont write at dichotomously rooted tree even if you tell it to root it). I think the whole problem stems from only having a single outgroup (which when you root to it ends up 'looking' like the immediate ancestor of the ingroup). Typically, you would include multiple ougroups and present/display the tree with a trifurcating root node, one of which lineages is the ingroup - unless you are using a non-reversible model you dont need dichotomously rooted trees. Cheers, C. -- From bugzilla-daemon at portal.open-bio.org Fri Mar 26 18:28:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 18:28:10 -0400 Subject: [Biopython-dev] [Bug 3036] New: PhyloXML cannot read node colors created by PhyloXML Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3036 Summary: PhyloXML cannot read node colors created by PhyloXML Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: joelb at lanl.gov Using a simple example file provided: >>> tree = Phylo.read('bcl_2.xml','phyloxml') >>> tree.clade[0].color = Phylo.PhyloXML.BranchColor(255,0,255) >>> tree.clade[0].color BranchColor(blue='255', green='0', red='255') Phylo.write(tree,'colored.phyloxml','phyloxml') 1 >>> tree2=Phylo.read('colored.phyloxml','phyloxml') Traceback (innermost last): File "", line 1, in File "/usr/lib64/python2.6/site-packages/Bio/Phylo/_io.py", line 57, in read tree = tree_gen.next() File "/usr/lib64/python2.6/site-packages/Bio/Phylo/_io.py", line 42, in parse for tree in getattr(supported_formats[format], 'parse')(file): File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 317, in parse yield self._parse_phylogeny(elem) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 342, in _parse_phylogeny phylogeny.root = self._parse_clade(elem) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 388, in _parse_clade clade.clades.append(self._parse_clade(elem)) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 410, in _parse_clade setattr(clade, tag, getattr(self, tag)(elem)) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 518, in color return PX.BranchColor(red, green, blue) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXML.py", line 432, in __init__ ), "Color values must be integers between 0 and 255." AssertionError: Color values must be integers between 0 and 255. This is not a problem with an example file not written by biopython: >>> tree = Phylo.parse('made_up.xml','phyloxml').next() >>> tree.clade[0].color BranchColor(blue='28', green='220', red='128') Also, forester/archaeoptryx is able to correctly read colors written by biopython. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 18:30:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 18:30:36 -0400 Subject: [Biopython-dev] [Bug 3037] New: PhyloXMLIO creates extremely ugly xml Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3037 Summary: PhyloXMLIO creates extremely ugly xml Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P3 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: joelb at lanl.gov This is a request for an enhancement. The xml code PhyloXMLIO creates has no linefeeds. It would be very helpful for debugging if the XML is prettyprinted. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Mar 27 08:45:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 27 Mar 2010 12:45:28 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com> Message-ID: <320fb6e01003270545l7e43c3bbu7a1174397a45ce99@mail.gmail.com> On Fri, Mar 26, 2010 at 11:38 AM, Cymon Cox wrote: > > I think the whole problem stems from only having a single outgroup (which > when you root to it ends up 'looking' like the immediate ancestor of the > ingroup). Typically, you would include multiple ougroups and present/display > the tree with a trifurcating root node, one of which lineages is the ingroup > - unless you are using a non-reversible model you dont need dichotomously > rooted trees. > And I thought it would be simpler from this example to use a single out group ;) Thanks for the comments both of you. Peter From bugzilla-daemon at portal.open-bio.org Sat Mar 27 22:58:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 27 Mar 2010 22:58:59 -0400 Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml In-Reply-To: Message-ID: <201003280258.o2S2wxut000915@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3037 ------- Comment #1 from eric.talevich at gmail.com 2010-03-27 22:58 EST ------- (In reply to comment #0) > This is a request for an enhancement. > > The xml code PhyloXMLIO creates has no linefeeds. It would be very helpful for > debugging if the XML is prettyprinted. > This is a shortcoming of the ElementTree module in the Python standard library -- the writer doesn't have an option for setting whitespace. But I agree it would be nice to have this feature, so I'll leave the bug open as a reminder to look for other ways to do this. In the meantime I recommend using some external tool to reformat the XML if you want to look at the raw data. XML Starlet can do this: http://xmlstar.sourceforge.net/ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Mar 28 12:35:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 28 Mar 2010 12:35:34 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003281635.o2SGZYuv009361@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 k.okonechnikov at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biopython-dev at biopython.org |k.okonechnikov at gmail.com Status|NEW |ASSIGNED ------- Comment #6 from k.okonechnikov at gmail.com 2010-03-28 12:35 EST ------- Created an attachment (id=1468) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1468&action=view) This patch solves this issue and also Bug 2951 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Mar 28 14:03:21 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 28 Mar 2010 14:03:21 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003281803.o2SI3LhC011261@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython-dev at biopython.org -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Sun Mar 28 14:18:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 28 Mar 2010 14:18:24 -0400 Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml In-Reply-To: Message-ID: <201003281818.o2SIIOLM011602@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3037 ------- Comment #2 from chapmanb at 50mail.com 2010-03-28 14:18 EST ------- Eric, check out Fredrik Lundh's indent function for ElementTree. I'm not sure this ever made it into the source, but it's small enough to copy/paste: http://effbot.org/zone/element-lib.htm#prettyprint -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Mar 29 06:58:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 11:58:28 +0100 Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez? In-Reply-To: <996381.46391.qm@web62407.mail.re1.yahoo.com> References: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com> <996381.46391.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e01003290358y1e30fc6eme2028a126a36cdb@mail.gmail.com> On Fri, Mar 26, 2010 at 2:37 AM, Michiel de Hoon wrote: > I have no objections, but basically I think that this can be left to the responsibility of the end user. > > --Michiel. OK, unless the NCBI decide to clarify what exactly they mean, then let's just leave this as is it (the responsibility of the end user). Peter From biopython at maubp.freeserve.co.uk Mon Mar 29 07:05:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 12:05:25 +0100 Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally Message-ID: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com> Hi Michiel et al, The NCBI looks to be encouraging more use of the email and tool parameters in their revised guidelines. To make this easy to use we have a global setting for the email - I think we should do the same for the tool (for when users are building their own application or script on top of Biopython). Something like this patch? What do you think? Peter ------------------------ diff --git a/Bio/Entrez/__init__.py b/Bio/Entrez/__init__.py index 33d8d14..f64015c 100644 --- a/Bio/Entrez/__init__.py +++ b/Bio/Entrez/__init__.py @@ -12,6 +12,9 @@ http://www.ncbi.nlm.nih.gov/Entrez/ A list of the Entrez utilities is available at: http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html +Variables: +email Set the Entrez email parameter globally (default is not set). +tool Set the Entrez tool parameter globally (defaults to biopython). Functions: efetch Retrieves records in the requested format from a list of one or @@ -50,7 +53,7 @@ from Bio import File email = None - +tool = "biopython" # XXX retmode? def epost(db, **keywds): @@ -275,6 +278,7 @@ def _open(cgi, params={}, post=False): This function also enforces the "up to three queries per second rule" to avoid abusing the NCBI servers. """ + global tool, email # NCBI requirement: At most three queries per second. # Equivalently, at least a third of second between queries delay = 0.333333334 @@ -291,7 +295,7 @@ def _open(cgi, params={}, post=False): del params[key] # Tell Entrez that we are using Biopython if not "tool" in params: - params["tool"] = "biopython" + params["tool"] = tool # Tell Entrez who we are if not "email" in params: if email!=None: From biopython at maubp.freeserve.co.uk Mon Mar 29 08:36:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 13:36:19 +0100 Subject: [Biopython-dev] Deprecating PropertyManager, Encodings and Bio.utils? Message-ID: <320fb6e01003290536p4c61e1d1u8bc6e2ad83f1c9e5@mail.gmail.com> Hi all, I think we've done pretty well at carefully removing, fixing or replacing most of the dusty bits of code Biopython had acquired over the years. There are still things to clean up though... in particular modules Bio.PropertyManager and Bio.Encodings seem rather unnecessary. Bio.Encodings is tied into the old (and now deprecated) Bio.Translate and Bio.Transcribe code. Once they are removed (after the next release) we can at least cut a lot of Bio.Encodings. Bio.PropertyManager and Bio.Encodings only seem to be used by Bio.utils, which I would also like to deprecate. This is an undocumented module with no unit tests. It offers a few bits of sequence related functionality which would be better off in Bio.Seq or Bio.SeqUtils, and some fairly trivial functions we could just deprecate. These strike me as the only bits of functionality worth keeping in Bio.utils: Function verify_alphabet (which is being used by the code in Bio.NeuralNetwork.Gene) just checks a Seq object's sequence obeys the alphabet letters. This essentially is something I think the Seq object should do itself, during initialisation (Bug 2597). With that done, then Bio.utils.verify_alphabet could be deprecated. There are a few functions for getting molecular weights via the IUPAC alphabet objects. These could be reimplemented by using weight tables belonging to the IUPAC alphabet classes explicitly, perhaps exposed as new functions under Bio.SeqUtils. It would be interesting to look at refinements like handling the start/end of the sequence explicitly (i.e. the 5' and 3' ends of a nucleotide sequence, or the N and C terminals of a peptide). Function reduce_sequence (linked to Bio.Alphabet.Reduced) is for things like mapping a protein sequence to a simplified sequence using the Murphy alphabet (e.g. using a single letter for all the aliphatics: I,L,V). This is perhaps interesting enough to retain - again perhaps under Bio.SeqUtils. It does need documentation and unit tests though. Is anyone interested in updating, documenting and then testing the molecular weight and reduced alphabet code? [I suggest starting a new thread if you are.] If not, should we consider just deprecating Bio.utils, Bio.PropertyManager and Bio.Encodings in the next release? Peter From mjldehoon at yahoo.com Mon Mar 29 09:54:01 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 29 Mar 2010 06:54:01 -0700 (PDT) Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally In-Reply-To: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com> Message-ID: <614002.50617.qm@web62406.mail.re1.yahoo.com> Basically I think that this patch is OK, but why do tool and email need to be global inside the _open function? --Michiel --- On Mon, 3/29/10, Peter wrote: > From: Peter > Subject: Setting the NCBI Entrez tool parameter globally > To: "Michiel de Hoon" , "Biopython-Dev Mailing List" > Date: Monday, March 29, 2010, 7:05 AM > Hi Michiel et al, > > The NCBI looks to be encouraging more use of the email and > tool > parameters in their revised guidelines. To make this easy > to use > we have a global setting for the email - I think we should > do the > same for the tool (for when users are building their own > application > or script on top of Biopython). Something like this patch? > > What do you think? > > Peter > > ------------------------ > > diff --git a/Bio/Entrez/__init__.py > b/Bio/Entrez/__init__.py > index 33d8d14..f64015c 100644 > --- a/Bio/Entrez/__init__.py > +++ b/Bio/Entrez/__init__.py > @@ -12,6 +12,9 @@ http://www.ncbi.nlm.nih.gov/Entrez/ > A list of the Entrez utilities is available at: > http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html > > +Variables: > +email? ? ? ? Set the Entrez email > parameter globally (default is not set). > +tool? ? ? ???Set the Entrez > tool parameter globally (defaults to biopython). > > Functions: > efetch? ? ???Retrieves records in > the requested format from a list of one or > @@ -50,7 +53,7 @@ from Bio import File > > > email = None > - > +tool = "biopython" > > # XXX retmode? > def epost(db, **keywds): > @@ -275,6 +278,7 @@ def _open(cgi, params={}, post=False): > ? ???This function also enforces the > "up to three queries per second rule" > ? ???to avoid abusing the NCBI > servers. > ? ???""" > +? ? global tool, email > ? ???# NCBI requirement: At most three > queries per second. > ? ???# Equivalently, at least a third > of second between queries > ? ???delay = 0.333333334 > @@ -291,7 +295,7 @@ def _open(cgi, params={}, post=False): > ? ? ? ? ? ???del > params[key] > ? ???# Tell Entrez that we are using > Biopython > ? ???if not "tool" in params: > -? ? ? ? params["tool"] = "biopython" > +? ? ? ? params["tool"] = tool > ? ???# Tell Entrez who we are > ? ???if not "email" in params: > ? ? ? ???if email!=None: > From chapmanb at 50mail.com Mon Mar 29 09:50:23 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 29 Mar 2010 09:50:23 -0400 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo Message-ID: <20100329135023.GF42657@sobchak.mgh.harvard.edu> Eric, Peter and Cymon; > I've got a real example of a simple tree manipulation that I would > like to handle via your new module. I have a (small) unrooted tree from a > gene family in Newick format, which by construction includes an > out-group (the same gene but from a more distant organism). I would like to > reroot the tree so that this out-group is at the basal level. Really enjoying the discussion on this. It's a bit outside my area of expertise but I stumbled across DendroPy this weekend: http://packages.python.org/DendroPy/index.html which has a reroot_at function that might be worth looking into: http://github.com/jeetsukumaran/DendroPy/blob/master/dendropy/dataobject/tree.py Hope this helps, Brad From biopython at maubp.freeserve.co.uk Mon Mar 29 11:22:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 16:22:11 +0100 Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally In-Reply-To: <614002.50617.qm@web62406.mail.re1.yahoo.com> References: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com> <614002.50617.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e01003290822u62190365j9c147df71a9de46e@mail.gmail.com> On Mon, Mar 29, 2010 at 2:54 PM, Michiel de Hoon wrote: > Basically I think that this patch is OK, but why do tool and email > need to be global inside the _open function? I just thought it was clearer than implicit scope rules, I'll omit that line and commit the rest. Peter From biopython at maubp.freeserve.co.uk Mon Mar 29 11:35:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 16:35:01 +0100 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <20100329135023.GF42657@sobchak.mgh.harvard.edu> References: <20100329135023.GF42657@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003290835n25dd2a35kcc8c40587dd10c05@mail.gmail.com> On Mon, Mar 29, 2010 at 2:50 PM, Brad Chapman wrote: > Eric, Peter and Cymon; > >> I've got a real example of a simple tree manipulation that I would >> like to handle via your new module. I have a (small) unrooted tree from a >> gene family in Newick format, which by construction includes an >> out-group (the same gene but from a more distant organism). I would like to >> reroot the tree so that this out-group is at the basal level. > > Really enjoying the discussion on this. It's a bit outside my area > of expertise but I stumbled across DendroPy this weekend: > > http://packages.python.org/DendroPy/index.html > > which has a reroot_at function that might be worth looking into: > > http://github.com/jeetsukumaran/DendroPy/blob/master/dendropy/dataobject/tree.py > > Hope this helps, > Brad Hey Brad, I also spotted DendroPy recently (via a blog post or something), but hadn't yet looked to see how they handled this. It looks like their reroot_at function takes an *internal* node as the argument to specify the new root. This neatly avoids the problem about having to introduce a new node when rerooting with a given terminal node (taxon) as the out group. Peter From bugzilla-daemon at portal.open-bio.org Mon Mar 29 12:07:12 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Mar 2010 12:07:12 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003291607.o2TG7CDf013375@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-29 12:07 EST ------- (In reply to comment #6) > Created an attachment (id=1468) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1468&action=view) [details] > This patch solves this issue and also Bug 2951 > Just by eye there is something wrong with your indentation in that patch. Maybe you have mixed tabs and spaces? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Mon Mar 29 13:28:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Mar 2010 13:28:18 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003291728.o2THSIAf015768@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #8 from k.okonechnikov at gmail.com 2010-03-29 13:28 EST ------- Created an attachment (id=1469) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1469&action=view) Improved version of the patch Added default value for serial_num in Model constructor, fixed indentation issues. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Mon Mar 29 13:39:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Mar 2010 13:39:14 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003291739.o2THdEoU016014@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #9 from k.okonechnikov at gmail.com 2010-03-29 13:39 EST ------- Created an attachment (id=1470) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1470&action=view) Simple test script It downloads NMR structure, checks model serial numbers and writes structure to file. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From biopython at maubp.freeserve.co.uk Mon Mar 29 17:41:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 22:41:14 +0100 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> Message-ID: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> On Fri, Mar 19, 2010 at 11:08 PM, Sebastian Bassi wrote: > On Fri, Mar 19, 2010 at 7:45 AM, Peter wrote: >> Give an inch and they'll take a mile ;) > > In Spanish we say: Give a hand and they'll take the whole arm :) I think I like that version more :) >> that if they don't specify the format that Biopython will >> determine it automatically - which it won't. > > In this respect, Python zen favours being explicit,so I see your point. > >> Also, could you clarify if you are in favour of relaxing the >> requirement that the write function takes a list/iterator of >> records/alignments to allow a single SeqRecord or alignment? > > Is OK for me to allow a single record instead of a iterable, this > change will not break any existing code so it is OK for me. That sounds like you don't object, but are not strongly in favour either. No-one else has commented (other than Eric and Marshall who were in favour). Maybe it would be prudent to leave it? [Will this suggestion provoke any further comments I wonder?] Peter From eric.talevich at gmail.com Mon Mar 29 23:05:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 29 Mar 2010 23:05:51 -0400 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> Message-ID: <3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com> On Mon, Mar 29, 2010 at 5:41 PM, Peter wrote: > On Fri, Mar 19, 2010 at 11:08 PM, Sebastian Bassi wrote: > > On Fri, Mar 19, 2010 at 7:45 AM, Peter > wrote: > >> Also, could you clarify if you are in favour of relaxing the > >> requirement that the write function takes a list/iterator of > >> records/alignments to allow a single SeqRecord or alignment? > > > > Is OK for me to allow a single record instead of a iterable, this > > change will not break any existing code so it is OK for me. > > That sounds like you don't object, but are not strongly in > favour either. > > No-one else has commented (other than Eric and Marshall > who were in favour). > > Maybe it would be prudent to leave it? [Will this suggestion > provoke any further comments I wonder?] I know I've already voted, but here's another thought: if we're going to make this change eventually, it would be nice if the very first release of Bio.Phylo had the right behavior and retained the same behavior through later releases. Otherwise we'd have one or more isolated releases where Phylo.write doesn't handle single trees directly, and when documentation is updated to track later releases that do handle single trees, that could cause some confusion for some folks still using Biopython 1.54. -Eric > From bugzilla-daemon at portal.open-bio.org Tue Mar 30 01:17:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 30 Mar 2010 01:17:44 -0400 Subject: [Biopython-dev] [Bug 3036] PhyloXML cannot read node colors created by PhyloXML In-Reply-To: Message-ID: <201003300517.o2U5HiVN001772@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3036 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from eric.talevich at gmail.com 2010-03-30 01:17 EST ------- (In reply to comment #0) > Using a simple example file provided: > ... > This is not a problem with an example file not written by biopython: > >>> tree = Phylo.parse('made_up.xml','phyloxml').next() > >>> tree.clade[0].color > BranchColor(blue='28', green='220', red='128') Thanks for catching this! I pushed a fix to GitHub: http://github.com/biopython/biopython/commit/6e2eac9612f600507491c3bb45fc19ffdc987169 The problem was occurring for color values of 0 -- PhyloXMLIO was using an inline and-or test instead of if-else (Py2.4 compatibility hack) to check and convert the node text to an integer. Since 0 evaluates as boolean False, the expression was returning None instead of integer 0, causing the BranchColor constructor to vom. > Also, forester/archaeoptryx is able to correctly read colors written by > biopython. Good to know. Thanks again for testing this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Mar 30 01:24:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 30 Mar 2010 01:24:09 -0400 Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml In-Reply-To: Message-ID: <201003300524.o2U5O9Hq001917@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3037 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from eric.talevich at gmail.com 2010-03-30 01:24 EST ------- (In reply to comment #2) > Eric, check out Fredrik Lundh's indent function for ElementTree. I'm not sure > this ever made it into the source, but it's small enough to copy/paste: > > http://effbot.org/zone/element-lib.htm#prettyprint > Thanks! I did just that: http://github.com/biopython/biopython/commit/3d892a39015c5659c91ff819ceeea043585f3607 The write() function in Phylo, PhyloXMLIO and Writer class now take an 'indent' argument, which defaults to False for the sake of I/O performance and file size. Side note: apparently the ElementTree module has just emerged from a 4-year hibernation, so new features (like this one) and bug fixes will begin appearing in the Python stdlib as of version 3.2. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Mar 30 05:46:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Mar 2010 10:46:04 +0100 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> <3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com> Message-ID: <320fb6e01003300246s61b75b9v123141ac23240b18@mail.gmail.com> On Tue, Mar 30, 2010 at 4:05 AM, Eric Talevich wrote: > > I know I've already voted, but here's another thought: if we're going to > make this change eventually, it would be nice if the very first release of > Bio.Phylo had the right behavior and retained the same behavior through > later releases. Otherwise we'd have one or more isolated releases where > Phylo.write doesn't handle single trees directly, and when documentation is > updated to track later releases that do handle single trees, that could > cause some confusion for some folks still using Biopython 1.54. > True. Another plus for doing it now is we're relaxing the filename/handle thing, so it makes sense to make this change now (get these changes all done in one go to reduce end user confusion). Peter From chapmanb at 50mail.com Tue Mar 30 08:16:00 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 30 Mar 2010 08:16:00 -0400 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> Message-ID: <20100330121600.GF35248@sobchak.mgh.harvard.edu> Peter; > >> Also, could you clarify if you are in favour of relaxing the > >> requirement that the write function takes a list/iterator of > >> records/alignments to allow a single SeqRecord or alignment? [...] > No-one else has commented (other than Eric and Marshall > who were in favour). > > Maybe it would be prudent to leave it? [Will this suggestion > provoke any further comments I wonder?] +1 from me for making it more flexible. I don't see a lot of downside: it helps avoid a common source of initial confusion and is fully back compatible. Brad From biopython at maubp.freeserve.co.uk Tue Mar 30 08:42:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Mar 2010 13:42:46 +0100 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <20100330121600.GF35248@sobchak.mgh.harvard.edu> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> <20100330121600.GF35248@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003300542m17c02e3fic15ccdfc60dcd33c@mail.gmail.com> On Tue, Mar 30, 2010 at 1:16 PM, Brad Chapman wrote: > Peter; > >> >> Also, could you clarify if you are in favour of relaxing the >> >> requirement that the write function takes a list/iterator of >> >> records/alignments to allow a single SeqRecord or alignment? > [...] >> No-one else has commented (other than Eric and Marshall >> who were in favour). >> >> Maybe it would be prudent to leave it? [Will this suggestion >> provoke any further comments I wonder?] > > +1 from me for making it more flexible. I don't see a lot of downside: > it helps avoid a common source of initial confusion and is fully back > compatible. OK, checked in. We'll need to at least add an FAQ entry to the tutorial on this, and the SeqIO/AlignIO chapters may need clarification too. Eric - could you look at Bio.phylo.write() this week? Thanks, Peter From aaronquinlan at gmail.com Tue Mar 30 13:03:56 2010 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Tue, 30 Mar 2010 13:03:56 -0400 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com> Message-ID: Hi Kevin, Per the author, BamTools supposedly now supports all endian flavors. Aaron On Mar 4, 2010, at 9:07 AM, Kevin Jacobs wrote: > On Thu, Mar 4, 2010 at 8:52 AM, Kevin Jacobs wrote: > On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote: > Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/). The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc. > > I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries. Save for a few minor bugs along the way, it was rather straightforward to include. > > Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools. The bamtools code looks well designed and quite similar to my emerging Cython/Python rendition. > > > Ouch-- never mind. The bamtools code isn't endian-clean -- it will only work correctly on native little-endian architectures. > > -Kevin > From bioinformed at gmail.com Tue Mar 30 20:57:50 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 30 Mar 2010 20:57:50 -0400 Subject: [Biopython-dev] Alignment object In-Reply-To: References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com> Message-ID: On Tue, Mar 30, 2010 at 1:03 PM, Aaron Quinlan wrote: > Hi Kevin, > Per the author, BamTools supposedly now supports all endian flavors. > Aaron > > Thanks, Aaron. I've been in touch with Derek and am testing his new version on Power PC. I was very gratified to see him respond and address several of the minor issues I raised. -Kevin From updates at feedmyinbox.com Wed Mar 31 02:14:22 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 31 Mar 2010 02:14:22 -0400 Subject: [Biopython-dev] 3/31 BioStar - Biopython Questions Message-ID: ================================================== 1. Is there a non-perl alternative to accessing Ensembl's API? ================================================== March 30, 2010 at 11:06 AM I'm looking for a programmatic way to access Ensembl or UCSC's genome browser using something like python or ruby. Perl is just not my thing, sadly. PyCogent seems to have something that I have not yet tested properly, but just wanted to ping the community before going ahead and coding something that might already exist. Any ideas? http://biostar.stackexchange.com/questions/536/is-there-a-non-perl-alternative-to-accessing-ensembls-api -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From crosvera at gmail.com Wed Mar 31 18:39:06 2010 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Wed, 31 Mar 2010 18:39:06 -0400 Subject: [Biopython-dev] PDB-Tidy proposal Message-ID: Dear Biopythoners, I'm Carlos R?os, a student from Chile. As some of you may know, I'm very interested in apply to the Google Summer of Code with the PDB-Tidy idea. So, I wrote a draft that suppose to be my proposal. I'm open to receive any comment, feedback, disagreement... here is the link of the draft: http://github.com/crosvera/pdbtidy_proposal/blob/master/proposal Regards. Ps: sorry if my English is not so good. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 From bugzilla-daemon at portal.open-bio.org Mon Mar 1 18:14:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Mar 2010 13:14:45 -0500 Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic alignment, e.g. align[1:2, 5:-5] In-Reply-To: Message-ID: <201003011814.o21IEjcK024496@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2551 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-01 13:14 EST ------- I've started a possible implementation of an improved multiple sequence alignment object on a github branch: http://github.com/peterjc/biopython/commits/alignment-obj This now covers: Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5] Bug 2552 - Adding alignments Bug 2553 - Adding SeqRecord objects to an alignment (append or extend) Bug 2554 - Creating an Alignment from a list of SeqRecord objects -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bioinformed at gmail.com Mon Mar 1 23:22:42 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 1 Mar 2010 18:22:42 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> Message-ID: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> On Thu, Feb 11, 2010 at 12:29 AM, Peter wrote: > On Mon, Jan 11, 2010 at 5:11 PM, Peter > wrote: > > I didn't want to rush the SFF support into Biopython 1.53, but its been > > waiting "ready" for a while now. Any objections or comments about > > me merging this now? > > There were no objections, and I ran this by Brad and Michiel and > have just merged this into the master branch. Time for some more > testing! > > I've tried out the recently landed SFF SeqIO code and am pleased to report that it works very well. I am parsing gsMapper 454PairAlign.txt output and converting it to SAM/BAM format to view in IGV (among other things) and wanted to include per-based quality score information from the SFF files. The only glitch so far is that the indexed access mode yields sequences with no alphabet assigned. The solution is to add the following to the beginning of SffDict.__init__: if alphabet is None: alphabet = Alphabet.generic_dna My only other comment is that several file reads and struct.unpacks can be merged in _sff_read_seq_record. Given the number of records in most 454 SFF files, I suspect the micro-optimization effort will be worth the slight cost in code clarity. Thanks to Peter and Jose for all of their hard work! Best regards, -Kevin From biopython at maubp.freeserve.co.uk Tue Mar 2 10:08:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 10:08:27 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> Message-ID: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote: > On Thu, Feb 11, 2010 at 12:29 AM, Peter > wrote: >> >> On Mon, Jan 11, 2010 at 5:11 PM, Peter >> wrote: >> > I didn't want to rush the SFF support into Biopython 1.53, but its been >> > waiting "ready" for a while now. Any objections or comments about >> > me merging this now? >> >> There were no objections, and I ran this by Brad and Michiel and >> have just merged this into the master branch. Time for some more >> testing! >> > > I've tried out the recently landed SFF SeqIO code and am pleased to > report that it works very well. Great :) If you have suggestions for the documentation please voice them. Also did the handling of trimmed reads seem sensible? Until we release this we can tweak the API. > I am parsing gsMapper 454PairAlign.txt output and > converting it to SAM/BAM format to view in IGV (among other things) and > wanted to include per-based quality score information from the SFF files. Are you reading and writing SAM/BAM format with Python? Looking into this is on my (long) todo list. >?The only glitch so far is that the indexed access mode yields sequences > with no alphabet assigned. ?The solution is to add the following to the > beginning of SffDict.__init__: > ?? ? ? ?if alphabet is None: > ?? ? ? ? ?alphabet = Alphabet.generic_dna Thanks - I'll look at that. > My only other comment is that several file reads and struct.unpacks can be > merged in?_sff_read_seq_record. ?Given the number of records in most 454 SFF > files, I suspect the micro-optimization effort will be worth the slight cost > in code clarity. I did try and spend some effort on the run time, but it wouldn't surprise me that there was still room for improvement. I found that since most of my SFF files were only up to 2GB with under a million reads, that this wasn't such an issue (compared to FASTQ files with Solexa data). I guess you mean the flowgram values, flowgram index, bases and qualities might be loaded with a single read? That would be worth trying. > Thanks to Peter and Jose for all of their hard work! > Best regards, > -Kevin And thanks for the feedback :) Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 12:02:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 12:02:53 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> Message-ID: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com> On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote: > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote: >>?The only glitch so far is that the indexed access mode yields sequences >> with no alphabet assigned. ?The solution is to add the following to the >> beginning of SffDict.__init__: >> ?? ? ? ?if alphabet is None: >> ?? ? ? ? ?alphabet = Alphabet.generic_dna > > Thanks - I'll look at that. Yes, that looks sensible - change commited. Would you like to be credited in our NEWS and CONTRIB file for this little bug fix? Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 12:25:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 12:25:05 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <20091028121833.GC22395@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman wrote: >Peter wrote: >> My rough work in progress in on github - at the moment I'm still trying >> things out, and don't assume anything is set in stone. If you want to >> have a play with this code, feedback is very welcome - probably best >> on the dev list rather than here. See: >> >> http://github.com/peterjc/biopython/tree/seqrecords >> >> (a lot of the alignment things I want to support, like slicing and adding >> are very closely linked to doing the same operations to SeqRecords) Here is a new branch implementing a multiple-sequence-alignment class (living under Bio.Align for now) based on the recent support for slicing and adding SeqRecord objects: http://github.com/peterjc/biopython/tree/alignment-obj This handles most of the basic tasks I want to be able to easily do with classical alignments, based on previous discussions on the mailing list and/or bugzilla: http://bugzilla.open-bio.org/show_bug.cgi?id=2551 http://bugzilla.open-bio.org/show_bug.cgi?id=2552 http://bugzilla.open-bio.org/show_bug.cgi?id=2553 http://bugzilla.open-bio.org/show_bug.cgi?id=2554 At its core, the alignment is still held as a list of SeqRecord objects, which should mean minimal problems with backwards compatibility. If anyone would like to try out the code, comments would be very welcome. There are plenty of doctests in the docstrings which should explain how I expect things to work. > The bx-python alignment object is nice and goes to/from MAF > and AXT formats: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py > > This supports slicing by alignment coordinates and by reference > coordinates for a species in the alignment. Some other useful > features are limiting the alignment to specific species and removing > all gap columns that can result. The representation is a high level > Alignment object containing multiple Components. My code does not (yet) attempt to deal with next-gen sequencing alignments, which would require padding all the (short) reads with leading and trailing gaps to ensure all rows of the alignment have the same length. Doing this in a memory efficient way could be done with a PaddedSeq object, or a very different alignment object (hold read and their offsets in memory). I'm not sure what is best, but the bx-python model looks worth understanding to help decide. Perhaps until this is settled, it would be premature to merge my alignment class to the trunk. After all, we may need to tweak the alignment object class heirachy. Peter From bioinformed at gmail.com Tue Mar 2 12:29:38 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 07:29:38 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com> Message-ID: <2e1434c11003020429y37343796oddf02ad433ab82ea@mail.gmail.com> On Tue, Mar 2, 2010 at 7:02 AM, Peter wrote: > On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote: > > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote: > >> The only glitch so far is that the indexed access mode yields sequences > >> with no alphabet assigned. The solution is to add the following to the > >> beginning of SffDict.__init__: > >> if alphabet is None: > >> alphabet = Alphabet.generic_dna > > > > Thanks - I'll look at that. > > Yes, that looks sensible - change commited. Would you like to be credited > in our NEWS and CONTRIB file for this little bug fix? > > I'm happy to contribute and be listed in the credits. Thanks, -Kevin From bioinformed at gmail.com Tue Mar 2 12:36:27 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 07:36:27 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> Message-ID: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote: > On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman > wrote:My code does not (yet) attempt to deal with next-gen sequencing > alignments, which would require padding all the (short) reads with > leading and trailing gaps to ensure all rows of the alignment have > the same length. Doing this in a memory efficient way could be > done with a PaddedSeq object, or a very different alignment object > (hold read and their offsets in memory). I'm not sure what is best, > but the bx-python model looks worth understanding to help decide. > > Perhaps until this is settled, it would be premature to merge my > alignment class to the trunk. After all, we may need to tweak the > alignment object class heirachy. Hi Peter, I'm just jumping in here and have not yet read all of the background material. However, I am working with next-gen alignments and am curious as to what you have in mind. At first glance, it sounds like you want to access aligned reads in a 'pileup' format (i.e., an object model akin to http://samtools.sourceforge.net/pileup.shtml). Or are you thinking of something different entirely? Best regards, -Kevin From bioinformed at gmail.com Tue Mar 2 12:28:22 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 07:28:22 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> Message-ID: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> On Tue, Mar 2, 2010 at 5:08 AM, Peter wrote: > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs > wrote: > > I've tried out the recently landed SFF SeqIO code and am pleased to > > report that it works very well. > > Great :) > > If you have suggestions for the documentation please voice them. > Also did the handling of trimmed reads seem sensible? Until we > release this we can tweak the API. I only looked at the module documentation and it was more than sufficient to get started. I've never really used BioPython before, so I was pleasantly surprised at how easy it was to get started. The BioPython SFF parser and indexed access replaced a hairy process of extracting data using 454's sffinfo and packing it into a BDB file. > > I am parsing gsMapper 454PairAlign.txt output and > > converting it to SAM/BAM format to view in IGV (among other things) and > > wanted to include per-based quality score information from the SFF files. > > Are you reading and writing SAM/BAM format with Python? Looking > into this is on my (long) todo list. > Yes-- so far I have code to populate the basic data for unpaired reads, but none of the optional annotations. My script reads the 454 pairwise alignment data, finds each read in the source SFF file, figures out if extra trimming was applied by gsMapper, and extracts the matching PHRED quality scores. Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools FAQ). The script can output SAM records or create a subprocess to sort the records and recode to BAM format using samtools. I've attached the current version script and you are welcome to use it for any purpose. > My only other comment is that several file reads and struct.unpacks can be > > merged in _sff_read_seq_record. Given the number of records in most 454 > SFF > > files, I suspect the micro-optimization effort will be worth the slight > cost > > in code clarity. > > [...]I guess you mean the flowgram values, flowgram index, bases and qualities might be loaded with a single read? That would > be worth trying. > Exactly! Also, flowgrams do not need to be unpacked when trimming. My own bias is to encode the quality scores and flowgrams in numpy arrays rather than lists, however I understand that the goal is to keep the external dependencies to a minimum (although NumPy is required elsewhere). Also, the test "chr(0)*padding != handle.read(padding)" could be written just as clearly as "handle.read(padding).count('\0') != padding" and not generate as many temporary objects. Best regards, -Kevin -------------- next part -------------- # -*- coding: utf-8 -*- # Convert 454PairAlign.txt and the corresponding SFF files into SAM/BAM format import re import sys from operator import getitem, itemgetter from itertools import izip, imap, groupby, repeat from subprocess import Popen, PIPE import numpy as np try: # Import fancy versions of basic IO functions from my GLU package # see http://code.google.com/p/glu-genetics from glu.lib.fileutils import autofile,hyphen,table_writer,table_reader except ImportError: import csv # The real version handles automatic gz/bz2 (de)compression autofile = file def hyphen(filename,default): if filename=='-' and default is not None: return default return filename # Write a tab-delimited ASCII file # The real version handles many more formats (CSV, XLS, Stata), column # selection, header optionds, row filters, and other toys. def table_writer(filename,hyphen=None): if filename=='-' and hyphen is not None: dest = hyphen else: dest = autofile(filename,'wb') return csv.writer(dest, dialect='excel-tab') # Read a tab-delimited ASCII file # The real version handles many more formats (CSV, XLS, Stata), column # selection, header optionds, row filters, and other toys. def table_reader(filename,hyphen=None): if filename=='-' and hyphen is not None: dest = hyphen else: dest = autofile(filename,'rb') return csv.reader(dest, dialect='excel-tab') CIGAR_map = { ('-','-'):'P' } for a in 'NACGTacgt': CIGAR_map[a,'-'] = 'I' CIGAR_map['-',a] = 'D' for b in 'NACGTacgt': CIGAR_map[a,b] = 'M' def make_cigar_py(query,ref): assert len(query)==len(ref) igar = imap(getitem, repeat(CIGAR_map), izip(query,ref)) cigar = ''.join('%d%s' % (len(list(run)),code) for code,run in groupby(igar)) return cigar # Try to import the optimized Cython version # The Python version is pretty fast, but I wanted to play with Cython. try: from cigar import make_cigar except ImportError: make_cigar = make_cigar_py class SFFIndex(object): def __init__(self, sfffiles): self.sffindex = sffindex = {} for sfffile in sfffiles: from Bio import SeqIO prefix,ext = sfffile[-13:].split('.') assert ext=='sff' print >> sys.stderr,'Loading SFF index for',sfffile reads = SeqIO.index(sfffile, 'sff-trim') sffindex[prefix] = reads def get_quality(self, qname, query, qstart, qstop): prefix = qname[:9] sff = self.sffindex.get(prefix) if not sff: return '*' rec = sff[qname] phred = rec.letter_annotations['phred_quality'] sffqual = np.array(phred,dtype=np.uint8) sffqual += 33 sffqual = sffqual.tostring() # Align the query to the original read to find the matching quality # score information. This is complicated by the extra trimming done by # gsMapper. We could obtain this information by parsing the # 454TrimStatus.txt, but it is easier to search for the sub-sequence in # the reference. Ones hopes the read maps uniquely, but this is not # checked. # CASE 1: Forward read alignment if qstart> sys.stderr,'MATCHED TYPE F2: name=%s, qstart=%d(%d), qstop=%d, qlen=%d, len.query=%d' % (qname,start+1,qstart,qstop,qlen,len(query)) qual = sffqual[start:start+len(query)] # CASE 2: Backward read alignment else: # Try using specified cut-points read = str(rec.seq.complement()) seq = read[qstop-1:qstart][::-1] read = read[::-1] # If it matches, then compute quality if seq==query: qual = sffqual[qstop-1:qstart][::-1] else: # otherwise gsMapper applied extra trimming, so we have to manually find the offset start = read.index(query) seq = read[start:start+len(query)] if seq==query: #print >> sys.stderr,'MATCHED TYPE R2: name=%s, qstart=%d, qstop=%d(%d), qlen=%d, len.query=%d' % (qname,qstart,start+1,qstop,qlen,len(query)) qual = sffqual[::-1][start:start+len(query)] assert seq==query assert len(qual) == len(query) return qual def pair_align(filename, sffindex): records = autofile(filename) split = re.compile('[\t ,.]+').split mrnm = '*' mpos = 0 isize = 0 mapq = 60 for line in records: assert line.startswith('>') fields = split(line) qname = fields[0][1:] qstart = int(fields[1]) qstop = int(fields[2]) #qlen = int(fields[4]) rname = fields[6] rstart = int(fields[7]) rstop = int(fields[8]) #rlen = int(fields[10]) query = split(records.next())[2] qq = query.replace('-','') ref = split(records.next())[2] cigar = make_cigar(query,ref) qual = sffindex.get_quality(qname, qq, qstart, qstop) flag = 0 if qstart>qstop: flag |= 0x10 if rstart>rstop: flag |= 0x20 yield [qname, flag, rname, rstart, mapq, cigar, mrnm, mpos, isize, qq, qual] def option_parser(): import optparse usage = 'usage: %prog [options] 454PairAlign.txt[.gz] [SFFfiles.sff..]' parser = optparse.OptionParser(usage=usage) parser.add_option('-r', '--reflist', dest='reflist', metavar='FILE', help='Reference genome contig list') parser.add_option('-o', '--output', dest='output', metavar='FILE', default='-', help='Output SAM file') return parser def main(): parser = option_parser() options,args = parser.parse_args() if not args: parser.print_help(sys.stderr) sys.exit(2) sffindex = SFFIndex(args[1:]) alignment = pair_align(hyphen(args[0],sys.stdin), sffindex) write_bam = options.output.endswith('.bam') if write_bam: if not options.reflist: raise ValueError('Conversion to BAM format requires a reference genome contig list (-r/--reflist)') # Creating the following two-stage pipeline deadlocks due to problems with subprocess # -- use the shell method below instead #sammer = Popen(['samtools','import',options.reflist,'-','-'],stdin=PIPE,stdout=PIPE) #bammer = Popen(['samtools','sort','-', options.output[:-4]], stdin=sammer.stdout) cmd = 'samtools import "%s" - - | samtools sort - "%s"' % (options.reflist,options.output[:-4]) bammer = Popen(cmd,stdin=PIPE,shell=True,bufsize=-1) out = table_writer(bammer.stdin) else: out = table_writer(options.output,hyphen=sys.stdout) out.writerow(['@HD', 'VN:1.0']) if options.reflist: reflist = table_reader(options.reflist) for row in reflist: if len(row)<2: continue contig_name = row[0] contig_len = int(row[1]) out.writerow(['@SQ', 'SN:%s' % contig_name, 'LN:%d' % contig_len]) print >> sys.stderr, 'Generating alignment from %s to %s' % (args[0],options.output) for qname,qalign in groupby(alignment,itemgetter(0)): qalign = list(qalign) if len(qalign)>1: # Set MAPQ to 0 for multiply aligned reads for row in qalign: row[4] = 0 out.writerow(row) else: out.writerow(qalign[0]) if write_bam: print >> sys.stderr,'Finishing BAM encoding...' bammer.communicate() if __name__=='__main__': if 1: main() else: try: import cProfile as profile except ImportError: import profile import pstats prof = profile.Profile() try: prof.runcall(main) finally: stats = pstats.Stats(prof) stats.strip_dirs() stats.sort_stats('time', 'calls') stats.print_stats(25) From biopython at maubp.freeserve.co.uk Tue Mar 2 13:01:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 13:01:53 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> Message-ID: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> Kevin wrote: > I only looked at the module documentation and it was more than sufficient to > get started. ?I've never really used BioPython before, so I was pleasantly > surprised at how easy it was to get started. ?The BioPython SFF parser and > indexed access replaced a hairy process of extracting data using 454's > sffinfo and packing it into a BDB file. Great :) >> > I am parsing gsMapper 454PairAlign.txt output and >> > converting it to SAM/BAM format to view in IGV (among other things) and >> > wanted to include per-based quality score information from the SFF >> > files. >> >> Are you reading and writing SAM/BAM format with Python? Looking >> into this is on my (long) todo list. > > Yes-- so far I have code to populate the basic data for unpaired reads, but > none of the optional annotations. ?My script reads the 454 pairwise > alignment data, finds each read in the source SFF file, figures out if extra > trimming was applied by gsMapper, and extracts the matching PHRED quality > scores. ?Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and > non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools > FAQ). ?The script can output SAM records or create a subprocess to sort the > records and recode to BAM format using samtools. ?I've attached the current > version script and you are welcome to use it for any purpose. I'll take a look... >> [...] I guess you mean the flowgram values, flowgram index, bases >> and qualities might be loaded with a single read? That would >> be worth trying. > > Exactly! If I recall I felt the unpacking was more complicated (and not needed for the sequence bases), but I agree it this is faster it is worthwhile. > Also, flowgrams do not need to be unpacked when trimming. True, that shouldn't make the function much more complex. I'll try to look at that later today. > My own bias is to encode the quality scores and flowgrams in numpy > arrays rather than lists, however I understand that the goal is to keep > the external dependencies to a minimum (although NumPy is required > elsewhere). Yes, I did wonder about using NumPy here but wanted to ensure that the core of Biopython remains without an external dependency here. > Also, the test "chr(0)*padding != handle.read(padding)" could be written > just as clearly as "handle.read(padding).count('\0') != padding" and not > generate as many temporary objects. Good point, done - and you're in the contributors list now ;) Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 14:34:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 14:34:07 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> Message-ID: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> On Tue, Mar 2, 2010 at 12:36 PM, Kevin Jacobs wrote: > On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote: >> My code does not (yet) attempt to deal with next-gen sequencing >> alignments, which would require padding all the (short) reads with >> leading and trailing gaps to ensure all rows of the alignment have >> the same length. Doing this in a memory efficient way could be >> done with a PaddedSeq object, or a very different alignment object >> (hold read and their offsets in memory). I'm not sure what is best, >> but the bx-python model looks worth understanding to help decide. >> >> Perhaps until this is settled, it would be premature to merge my >> alignment class to the trunk. After all, we may need to tweak the >> alignment object class heirachy. > > > Hi Peter, > > I'm just jumping in here and have not yet read all of the background > material. ?However, I am working with next-gen alignments and am > curious as to what you have in mind. ?At first glance, it sounds like > you want to access aligned reads in a 'pileup' format (i.e., an object > model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are > you thinking of something different entirely? Probably something different. My general concern boils down to the fact that the current Alignment model as an enhanced "list of SeqRecord objects" is potentially limiting. The alignment code in Biopython (and my branch which is basically an extension to that) deals with classical multiple sequence alignments like ClustalW etc. You can think of the alignment as a matrix of letters, each row is a sequence (e.g. a gene), and there will be some gap characters for insertions, and padding for leading/trailing commissions. There may or may not be a consensus sequence too. With assembles you have a (long) consensus with many (short) reads aligned to it. In order to hold this as a "matrix" representation, all the (short) reads would require (lots of) leading/trailing padding. The same applies when mapping reads to a reference genome. So, while the current object model may work, all this extra padding might mean too much of a memory overhead (especially as all the rows are currently stored as SeqRecord objects). Instead, we might just store the (short) read sequence, name, and its offset (and perhaps the strand). We can then reconstruct columns or rows mimicking the "matrix" interpretation on demand. However, the API should make it easy to get the unpadded reads and their offsets too - so the current alignment API might either be extended or perhaps changed. Related to this, a "Lite" version of the alignment object might be useful when there is no annotation requiring using SeqRecord objects. e.g. For ClustalW, FASTA, PHYLIP alignments all we need is the sequence and identifiers. Regarding one of your points, accessing aligned reads (or rows) from an alignment - currently this is only supported by index (row number). In most cases the reads (rows) have a unique identifier/name, and thus one idea I am considering for this branch is overloading the align[...] syntax further to allow a record's id to be used as an alternative. i.e. More like a dictionary. Other ideas for enhancements on this branch including sorting the rows (with a list like sort method, defaulting to sorting on the record's id strings), per-column annotation (useful for PFAM alignments and the match string in pairwise alignments), and a general annotations dictionary (like we have on SeqRecord objects). Peter From bioinformed at gmail.com Tue Mar 2 14:36:32 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 09:36:32 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> Message-ID: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote: > Kevin wrote:> My own bias is to encode the quality scores and flowgrams in > numpy > > arrays rather than lists, however I understand that the goal is to keep > > the external dependencies to a minimum (although NumPy is required > > elsewhere). > > Yes, I did wonder about using NumPy here but wanted to ensure that > the core of Biopython remains without an external dependency here. > In addition to not creating many little objects, my leanings toward using NumPy are also due to the generality of tricks like the following to recode quality scores to Sanger ASCII-33 format: sffqual = np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) sffqual += 33 sffqual = sffqual.tostring() That said, the alternatives aren't that slow and small integers are shared from a pre-allocated pool, so this is not as big a concern. -Kevin From biopython at maubp.freeserve.co.uk Tue Mar 2 14:44:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 14:44:13 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> Message-ID: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com> On Tue, Mar 2, 2010 at 2:36 PM, Kevin Jacobs wrote: > On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote: >> Yes, I did wonder about using NumPy here but wanted to ensure that >> the core of Biopython remains without an external dependency here. > > In addition to not creating many little objects, my leanings toward using > NumPy are also due to the generality of tricks like the following to recode > quality scores to Sanger ASCII-33 format: > > ? ?sffqual ?= > np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) > ? ?sffqual += 33 > ? ?sffqual ?= sffqual.tostring() > Yeah - I had this kind of thing in mind for the qualities, both when looking at the SFF files and earlier when doing the FASTQ and QUAL stuff. You can probably make that more efficient with one line: sffqual = (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) + 33).tostring() Not sure if it will make a measurable difference mind you ;) > That said, the alternatives aren't that slow and small integers are shared > from a pre-allocated pool, so this is not as big a concern. Indeed. Peter From bioinformed at gmail.com Tue Mar 2 14:51:04 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 09:51:04 -0500 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com> <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com> <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com> Message-ID: <2e1434c11003020651y541ce3e5q92fb0fea308a59e9@mail.gmail.com> On Tue, Mar 2, 2010 at 9:44 AM, Peter wrote: > You can probably make that more efficient with one line: > > sffqual = > (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8) > + 33).tostring() > > Not sure if it will make a measurable difference mind you ;) > I haven't measured, but my understanding is that the inplace "+= 33" will avoid creating a temporary copy and thus be quicker. But as you said, not likely to make a difference in practice. -Kevin From chapmanb at 50mail.com Tue Mar 2 15:03:08 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 2 Mar 2010 10:03:08 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> Message-ID: <20100302150308.GP98028@sobchak.mgh.harvard.edu> Peter and Kevin; > >> My code does not (yet) attempt to deal with next-gen sequencing > >> alignments, [...] > >> Perhaps until this is settled, it would be premature to merge my > >> alignment class to the trunk. After all, we may need to tweak the > >> alignment object class heirachy. My vote would be to merge what you've done in for handling standard multiple alignments, and then look at next-generation read representation as an analogous but separate problem. All of the SeqRecord objects which are useful for drilling in on multiple alignments are likely going to be memory hogs for any real world next gen work. > > I'm just jumping in here and have not yet read all of the background > > material. ?However, I am working with next-gen alignments and am > > curious as to what you have in mind. ?At first glance, it sounds like > > you want to access aligned reads in a 'pileup' format (i.e., an object > > model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are > > you thinking of something different entirely? This is a good way to go. SAM is at least an emerging standard that people are adopting, and samtools and the pysam module do a good job of dealing with them: http://code.google.com/p/pysam/ pysam exposes a Pileup style API from sorted and indexed BAM files and scales great for large alignment files: http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html This is a good starting point for providing interoperability with Biopython; it would be great to re-use what we can from these projects. Brad From biopython at maubp.freeserve.co.uk Tue Mar 2 15:28:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 15:28:45 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com> <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com> <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com> <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com> Message-ID: <320fb6e01003020728v760e8208h5da4288dfaef7ed7@mail.gmail.com> On Tue, Mar 2, 2010 at 12:28 PM, Kevin Jacobs wrote: > >?Also, flowgrams do not need to be unpacked when trimming. > True - change made on the trunk, should make parsing SFF files as trimmed records a little bit faster. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Mar 2 16:43:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 16:43:18 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003020843n72a23176wa023786c46ffb7b3@mail.gmail.com> On Tue, Mar 2, 2010 at 3:03 PM, Brad Chapman wrote: > Peter and Kevin; > >> >> My code does not (yet) attempt to deal with next-gen sequencing >> >> alignments, > [...] >> >> Perhaps until this is settled, it would be premature to merge my >> >> alignment class to the trunk. After all, we may need to tweak the >> >> alignment object class heirachy. > > My vote would be to merge what you've done in for handling > standard multiple alignments, and then look at next-generation read > representation as an analogous but separate problem. All of the > SeqRecord objects which are useful for drilling in on multiple > alignments are likely going to be memory hogs for any real world > next gen work. OK - that is what I was leaning towards. What do you think about the fact I am introducing an "improved" version of the existing Bio.Align.Generic.Alignment class under Bio.Align.MultipleSeqAlignment? That's actually several questions in one - should this be a new object or just enhance the old one? I favour a new object here because I want to *enforce* the fact that all the rows are the same length, but I doubt people are using the flexibility of the current alignment object in this way. Next where should the new object live? I find the current use of Bio.Align.Generic somewhat hidden away, thus my suggestion of using Bio.Align directly. Next, what should the new object be called? We could reuse the old name of Alignment but it is a bit vague and would cause confusion given the existing object is also called that. I have used MultipleSeqAlignment but am open to suggestions (e.g. MulSeqAlignment is shorter). Peter From bioinformed at gmail.com Tue Mar 2 17:07:03 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 2 Mar 2010 12:07:03 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> Message-ID: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> On Tue, Mar 2, 2010 at 10:03 AM, Brad Chapman wrote: > Kevin; > > > I'm just jumping in here and have not yet read all of the background > > > material. However, I am working with next-gen alignments and am > > > curious as to what you have in mind. At first glance, it sounds like > > > you want to access aligned reads in a 'pileup' format (i.e., an object > > > model akin to http://samtools.sourceforge.net/pileup.shtml). Or are > > > you thinking of something different entirely? > > This is a good way to go. SAM is at least an emerging standard that > people are adopting, and samtools and the pysam module do a good job > of dealing with them: > > http://code.google.com/p/pysam/ > > I find pysam pretty limited for doing more than reading and subsetting SAM/BAM files. I'm planning to add a constructor and helper functions for creating new aligned reads. The current AlignedRead object is also read-only, which will need to be relaxed for many serious applications. Until then, I'm writing (text) SAM records and piping them to samtools to encode in BAM format (see the script attached to one of my earlier emails). > pysam exposes a Pileup style API from sorted and indexed BAM files > and scales great for large alignment files: > > http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html Scalability is okay for conversion to pileup format, but not what I'd consider great. But I agree, pysam is a good starting point. I just wish that the read identifiers and attributes were available via the C API, since those are often needed when, e.g., writing a genotype caller. -Kevin From chapmanb at 50mail.com Wed Mar 3 14:12:15 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 3 Mar 2010 09:12:15 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> Message-ID: <20100303141215.GZ98028@sobchak.mgh.harvard.edu> Kevin and Peter; > I find pysam pretty limited for doing more than reading and subsetting > SAM/BAM files. I'm planning to add a constructor and helper functions for > creating new aligned reads. The current AlignedRead object is also > read-only, which will need to be relaxed for many serious applications. > Until then, I'm writing (text) SAM records and piping them to samtools to > encode in BAM format (see the script attached to one of my earlier emails). Agreed. These sound like good improvements. > Scalability is okay for conversion to pileup format, but not what I'd > consider great. But I agree, pysam is a good starting point. I just wish > that the read identifiers and attributes were available via the C API, > since those are often needed when, e.g., writing a genotype caller. Do you think we could build off of what pysam has? The project hasn't seemed especially active, but it would be great to have a unified code base in python for dealing with BAM files. They use mercurial for revision control, so worst case we can always fork this on bitbucket and work off of that. Galaxy has a fork for their use: http://bitbucket.org/kanwei/kanwei-pysam/ The bioconductor folks also seem to be standardizing around SAM/BAM for their analysis pipelines, so practically we may be able to borrow some of their APIs once they have a released version of Rsamtools. > What do you think about the fact I am introducing an "improved" > version of the existing Bio.Align.Generic.Alignment class under > Bio.Align.MultipleSeqAlignment? Yes please. I don't think Generic is that great and am happy to see it improved upon. > That's actually several questions in one - should this be a new > object or just enhance the old one? I favour a new object here > because I want to *enforce* the fact that all the rows are the > same length, but I doubt people are using the flexibility of > the current alignment object in this way. > > Next where should the new object live? I find the current use > of Bio.Align.Generic somewhat hidden away, thus my > suggestion of using Bio.Align directly. > > Next, what should the new object be called? We could reuse > the old name of Alignment but it is a bit vague and would > cause confusion given the existing object is also called that. > I have used MultipleSeqAlignment but am open to suggestions > (e.g. MulSeqAlignment is shorter). I like MultipleSeqAlignment, and agree it should be as top level as possible in Bio.Align. If you think a new object is better, go for that and we can move Generic on a deprecation path. It's great you are cleaning this up. Brad From biopython at maubp.freeserve.co.uk Wed Mar 3 15:03:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 15:03:38 +0000 Subject: [Biopython-dev] EMBOSS eprimer3 parser In-Reply-To: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com> References: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com> Message-ID: <320fb6e01003030703k691fdbe8i3ab3dfd5ba1640a6@mail.gmail.com> On Mon, Jan 18, 2010 at 4:33 PM, Peter wrote: > Hi all, > > Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in > Biopython? I'd like someone to look over Leighton's proposed enhancements > to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968 > > There are two main issues. First, the current code doesn't cope with multiple > primer sets (so Leighton introduces read/parse functions in line with other > modules for single or multiple sets of primers). This seems entirely sensible > to me, and worthwhile in itself. I've made changes on github to do this based on Leighton's code. > Second, Leighton makes some changes to the primer record objects. > I'm not so sure about the necessity here, even if it is backwards > compatible, but I haven't really used this code. What do the rest of > you think? I expect to doing some work with eprimer3 this month, so will feel I can make a more informed choice later. Peter From bugzilla-daemon at portal.open-bio.org Wed Mar 3 15:06:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Mar 2010 10:06:47 -0500 Subject: [Biopython-dev] [Bug 2968] Modifications to Emboss eprimer3 parser and associated files In-Reply-To: Message-ID: <201003031506.o23F6lgb005243@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2968 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-03 10:06 EST ------- (In reply to comment #0) > The existing Emboss primer3/eprimer3 code has a couple of issues, and some > scope for improvement: > > - The existing Primer3.py parser code can only parse output when eprimer3 is > applied to a single sequence. When eprimer3 is applied to multiple sequence > input, it groups all primers for all sequences into a single record, which may > incorrectly associate primers with the wrong sequences in downstream analysis. > - The current parser lacks an iterator for iterating over multiple sequence > output I've made changes on github to support multiple targets (with a read and a parse function) this based on Leighton's code which addresses the above issues. > - The current parser creates 'ghost' primers for all primer pairs, with length > zero and sequence as an empty string; it does not do this for internal oligos. > A more intuitive solution might be to return None for absent primers/oligos > - The current data model stores all primer data as individual attributes. It > might be more useful to group the attributes of individual primers into their > natural associations Regarding the object changes, I'll be doing some work with eprimer3 this month, so will feel I can make a more informed choice later. See also: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007255.html http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007398.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Mar 3 15:57:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 15:57:09 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100303141215.GZ98028@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> On Wed, Mar 3, 2010 at 2:12 PM, Brad Chapman wrote: > Kevin and Peter; > >> I find pysam pretty limited for doing more than reading and subsetting >> SAM/BAM files. ?I'm planning to add a constructor and helper functions for >> creating new aligned reads. ?The current AlignedRead object is also >> read-only, which will need to be relaxed for many serious applications. >> ?Until then, I'm writing (text) SAM records and piping them to samtools to >> encode in BAM format (see the script attached to one of my earlier emails). > > Agreed. These sound like good improvements. > >> Scalability is okay for conversion to pileup format, but not what I'd >> consider great. ?But I agree, pysam is a good starting point. ?I just wish >> that the read identifiers and attributes were ?available via the C API, >> since those are often needed when, e.g., writing a genotype caller. > > Do you think we could build off of what pysam has? The project hasn't > seemed especially active, but it would be great to have a unified > code base in python for dealing with BAM files. They use mercurial > for revision control, so worst case we can always fork this on > bitbucket and work off of that. Galaxy has a fork for their use: > > http://bitbucket.org/kanwei/kanwei-pysam/ > > The bioconductor folks also seem to be standardizing around > SAM/BAM for their analysis pipelines, so practically we may be > able to borrow some of their APIs once they have a released > version of Rsamtools. I agree that we should work towards supporting SAM (and perhaps also BAM) in Biopython, and other projects APIs can be very useful for inspiration or guidance. I was aware of pysam but am concerned about the dependencies: pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools itself - which may all be fine on Linux, but will likely be trouble for us on other platforms (especially Windows). Is anyone aware of any other SAM/BAM parser in Python? >> What do you think about the fact I am introducing an "improved" >> version of the existing Bio.Align.Generic.Alignment class under >> Bio.Align.MultipleSeqAlignment? > > Yes please. I don't think Generic is that great and am happy to see > it improved upon. > >> That's actually several questions in one - should this be a new >> object or just enhance the old one? I favour a new object here >> because I want to *enforce* the fact that all the rows are the >> same length, but I doubt people are using the flexibility of >> the current alignment object in this way. >> >> Next where should the new object live? I find the current use >> of Bio.Align.Generic somewhat hidden away, thus my >> suggestion of using Bio.Align directly. >> >> Next, what should the new object be called? We could reuse >> the old name of Alignment but it is a bit vague and would >> cause confusion given the existing object is also called that. >> I have used MultipleSeqAlignment but am open to suggestions >> (e.g. MulSeqAlignment is shorter). > > I like MultipleSeqAlignment, and agree it should be as top level as > possible in Bio.Align. If you think a new object is better, go for > that and we can move Generic on a deprecation path. It's great you > are cleaning this up. OK then - I've been wanting to "clean this up" for some time. I'll make time to merge what I have so far (which shouldn't be controversial) and update the tutorial. I would also like to investigate moving the useful bits of the SummaryInfo class into methods of the main alignment class. Testing would be very welcome! Peter From biopython at maubp.freeserve.co.uk Wed Mar 3 17:51:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 17:51:41 +0000 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> Message-ID: <320fb6e01003030951n261c124bq31578bc9cc5814c9@mail.gmail.com> On Wed, Mar 3, 2010 at 3:57 PM, Peter wrote: > > OK then - I've been wanting to "clean this up" for some time. > I'll make time to merge what I have so far (which shouldn't be > controversial) and update the tutorial. The merge is done, updates to the tutorial to show how to use the new object pending (but already in the doctests). Peter From bioinformed at gmail.com Wed Mar 3 18:30:49 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 3 Mar 2010 13:30:49 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> Message-ID: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> On Wed, Mar 3, 2010 at 10:57 AM, Peter wrote: > I agree that we should work towards supporting SAM (and perhaps > also BAM) in Biopython, and other projects APIs can be very > useful for inspiration or guidance. > > Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully between samtools and Picard source code, I've been able to work out most of the tricky bits. I'm glad to know that the R folks are also working on this, since they're usually very good about generating clear documentation. > I was aware of pysam but am concerned about the dependencies: > pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools > itself - which may all be fine on Linux, but will likely be trouble for > us on other platforms (especially Windows). > > Is anyone aware of any other SAM/BAM parser in Python? Parsing SAM is pretty simple and I can certainly help with gluing it into Biopython (with some help on the Biopython side, since I'm still a newb). I'm about half-way to having a BAM reader and writer for my own purposes. I'm coding the time-critical parts in Cython with a fallback to pure Python, so it may not be ideal for use in Biopython. -Kevin From chapmanb at 50mail.com Thu Mar 4 13:13:52 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 4 Mar 2010 08:13:52 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> Message-ID: <20100304131352.GB19053@sobchak.mgh.harvard.edu> Kevin and Peter; > I was aware of pysam but am concerned about the dependencies: > pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools > itself - which may all be fine on Linux, but will likely be trouble for > us on other platforms (especially Windows). I believe you can remove the pyrex requirement by shipping the generated C file with the distribution. Samtools itself may be an issue; however, right now it is probably a practical need for dealing with SAM/BAM since it implements a lot of BAM generation, sorting, merging and indexing you need in workflows. Also, the C code is included with the distribution so it is more a matter of getting it compiled than introducing extra dependencies. The bioconductor work appears to do the same thing. > > I agree that we should work towards supporting SAM (and perhaps > > also BAM) in Biopython, and other projects APIs can be very > > useful for inspiration or guidance. All of my work converts SAM directly into sorted and indexed BAM, and then build from that. For me, direct SAM parsing wouldn't be as useful as BAM. > Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully > between samtools and Picard source code, I've been able to work out most of > the tricky bits. I'm glad to know that the R folks are also working on > this, since they're usually very good about generating clear documentation. Agreed, but at least we are converging on something instead of having to write a parser every time you use a new aligner. The bioconductor SVN is here: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/ (user: readonly, pass: readonly) I think the pysam API does a decent job for reading and exposing this. The higher level things that would be nice to add are: - Converting the CIGAR string into something more useful. - Smartly dealing with the X? fields from various aligners. These often contain very useful information missing from the SAM specification. Where the data actually is will be aligner specific. - More generally easing dealing with the optional fields. > Parsing SAM is pretty simple and I can certainly help with gluing it into > Biopython (with some help on the Biopython side, since I'm still a newb). > I'm about half-way to having a BAM reader and writer for my own purposes. > I'm coding the time-critical parts in Cython with a fallback to pure > Python, so it may not be ideal for use in Biopython. Cool. Does the BAM reader require samtools C code or is it independent of that? Brad From aaronquinlan at gmail.com Thu Mar 4 13:33:40 2010 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Thu, 4 Mar 2010 08:33:40 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> Message-ID: Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/). The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc. I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries. Save for a few minor bugs along the way, it was rather straightforward to include. Aaron Aaron Quinlan, Ph.D. NRSA Postdoctoral Fellow Hall Laboratory University of Virginia Biochem. & Mol. Genetics aaronquinlan at gmail.com On Mar 4, 2010, at 8:13 AM, Brad Chapman wrote: > Kevin and Peter; > >> I was aware of pysam but am concerned about the dependencies: >> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools >> itself - which may all be fine on Linux, but will likely be trouble for >> us on other platforms (especially Windows). > > I believe you can remove the pyrex requirement by shipping the > generated C file with the distribution. Samtools itself may be an > issue; however, right now it is probably a practical need for dealing > with SAM/BAM since it implements a lot of BAM generation, sorting, > merging and indexing you need in workflows. Also, the C code is > included with the distribution so it is more a matter of getting it > compiled than introducing extra dependencies. The bioconductor work > appears to do the same thing. > >>> I agree that we should work towards supporting SAM (and perhaps >>> also BAM) in Biopython, and other projects APIs can be very >>> useful for inspiration or guidance. > > All of my work converts SAM directly into sorted and indexed BAM, > and then build from that. For me, direct SAM parsing wouldn't be as > useful as BAM. > >> Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully >> between samtools and Picard source code, I've been able to work out most of >> the tricky bits. I'm glad to know that the R folks are also working on >> this, since they're usually very good about generating clear documentation. > > Agreed, but at least we are converging on something instead of > having to write a parser every time you use a new aligner. The > bioconductor SVN is here: > > https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/ > (user: readonly, pass: readonly) > > I think the pysam API does a decent job for reading and exposing > this. The higher level things that would be nice to add are: > > - Converting the CIGAR string into something more useful. > - Smartly dealing with the X? fields from various aligners. These > often contain very useful information missing from the SAM > specification. Where the data actually is will be aligner > specific. > - More generally easing dealing with the optional fields. > >> Parsing SAM is pretty simple and I can certainly help with gluing it into >> Biopython (with some help on the Biopython side, since I'm still a newb). >> I'm about half-way to having a BAM reader and writer for my own purposes. >> I'm coding the time-critical parts in Cython with a fallback to pure >> Python, so it may not be ideal for use in Biopython. > > Cool. Does the BAM reader require samtools C code or is it > independent of that? > > Brad > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bioinformed at gmail.com Thu Mar 4 13:44:39 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 4 Mar 2010 08:44:39 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> Message-ID: <2e1434c11003040544j278ffb0fya984cd2668a6d278@mail.gmail.com> On Thu, Mar 4, 2010 at 8:13 AM, Brad Chapman wrote: > All of my work converts SAM directly into sorted and indexed BAM, > and then build from that. For me, direct SAM parsing wouldn't be as > useful as BAM. Same here-- I construct and unserialize alignment data into SAM-like records, but it would be foolish to actually store them natively to disk. > > > Parsing SAM is pretty simple and I can certainly help with gluing it into > > Biopython (with some help on the Biopython side, since I'm still a newb). > > I'm about half-way to having a BAM reader and writer for my own purposes. > > I'm coding the time-critical parts in Cython with a fallback to pure > > Python, so it may not be ideal for use in Biopython. > > Cool. Does the BAM reader require samtools C code or is it > independent of that? > It is intended to be independent of the samtools distribution, though some of the C code is currently duplicated (e.g., bgzf). Of course, a Cython/Python re-write would be simple enough, though obviously extra work. -Kevin From bioinformed at gmail.com Thu Mar 4 13:52:33 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 4 Mar 2010 08:52:33 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> Message-ID: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote: > Just an FYI for those interested in developing tools to work with BAM: it > may also be worth looking into the BamTools C++ API developed by Derek > Barnett at Boston College (http://sourceforge.net/projects/bamtools/). > The API is quite nice and has much of the necessary functionality for > iterators, getters/setters, etc. > > I added BAM support for my BEDTools package ( > http://code.google.com/p/bedtools/) using the BAMTools libraries. Save > for a few minor bugs along the way, it was rather straightforward to > include. > Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools. The bamtools code looks well designed and quite similar to my emerging Cython/Python rendition. -Kevin From bioinformed at gmail.com Thu Mar 4 14:07:03 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 4 Mar 2010 09:07:03 -0500 Subject: [Biopython-dev] Alignment object In-Reply-To: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com> <20100302150308.GP98028@sobchak.mgh.harvard.edu> <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com> <20100303141215.GZ98028@sobchak.mgh.harvard.edu> <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com> <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com> <20100304131352.GB19053@sobchak.mgh.harvard.edu> <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com> Message-ID: <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com> On Thu, Mar 4, 2010 at 8:52 AM, Kevin Jacobs < bioinformed at gmail.com> wrote: > On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote: > >> Just an FYI for those interested in developing tools to work with BAM: it >> may also be worth looking into the BamTools C++ API developed by Derek >> Barnett at Boston College (http://sourceforge.net/projects/bamtools/). >> The API is quite nice and has much of the necessary functionality for >> iterators, getters/setters, etc. >> >> I added BAM support for my BEDTools package ( >> http://code.google.com/p/bedtools/) using the BAMTools libraries. Save >> for a few minor bugs along the way, it was rather straightforward to >> include. >> > > Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools. > The bamtools code looks well designed and quite similar to my emerging > Cython/Python rendition. > > Ouch-- never mind. The bamtools code isn't endian-clean -- it will only work correctly on native little-endian architectures. -Kevin From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:47:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:47:36 -0500 Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic alignment, e.g. align[1:2, 5:-5] In-Reply-To: Message-ID: <201003051047.o25Ala5W006656@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2551 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:47 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:18 -0500 Subject: [Biopython-dev] [Bug 2552] Adding alignments In-Reply-To: Message-ID: <201003051048.o25AmIoF006689@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2552 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:34 -0500 Subject: [Biopython-dev] [Bug 2553] Adding SeqRecord objects to an alignment (append or extend) In-Reply-To: Message-ID: <201003051048.o25AmYYH006723@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2553 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:36 -0500 Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of SeqRecord objects In-Reply-To: Message-ID: <201003051048.o25AmaIn006735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2554 Bug 2554 depends on bug 2553, which changed state. Bug 2553 Summary: Adding SeqRecord objects to an alignment (append or extend) http://bugzilla.open-bio.org/show_bug.cgi?id=2553 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:50 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:48:50 -0500 Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of SeqRecord objects In-Reply-To: Message-ID: <201003051048.o25AmoWN006761@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2554 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST ------- Git branch merged to trunk as discussed on the dev mailing list, marking this enhancement as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:50:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 05:50:45 -0500 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201003051050.o25Aojkg006835@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Short read alignment format |Short read alignment format | |SAM / BAM ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:50 EST ------- Updating summary to include SAM and BAM keywords. See also recent mailing list discussions such as this thread: http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007397.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 5 11:40:05 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Mar 2010 06:40:05 -0500 Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory In-Reply-To: Message-ID: <201003051140.o25Be532008197@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3010 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 06:40 EST ------- I suspect any memory leak is within KDTree.c function KDTree_set_data. Looking at this I wondered how the memory allocated by KDTree_add_point gets freed. The following *might* help, but even if I am right, this is at best only a partial fix: diff --git a/Bio/KDTree/KDTree.c b/Bio/KDTree/KDTree.c index d074f26..07cdc1f 100644 --- a/Bio/KDTree/KDTree.c +++ b/Bio/KDTree/KDTree.c @@ -621,9 +621,14 @@ int KDTree_set_data(struct KDTree* tree, float *coords, long tree->_radius_list = NULL; } tree->_count=0; + if (tree->_data_point_list) { + free(tree->_data_point_list); + tree->_data_point_list = NULL; + tree->_data_point_list_size = 0; + } /* keep pointer to coords to delete it */ tree->_coords=coords; -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Mar 10 14:30:57 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Mar 2010 14:30:57 +0000 Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc) Message-ID: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> Dear Biopythoneers, The Open Bioinformatics Foundation (the Bio* umbrella organisation) is preparing an application for the 2010 Google Summer of Code (GSoC). http://code.google.com/soc/ If you are interested in becoming a mentor for a Biopython related project, you can join us in the application. If you are a student and are interested in a project (or would like to propose one), please take a look at these pages: http://www.open-bio.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/Google_Summer_of_Code Regards, Brad & Peter From biopython at maubp.freeserve.co.uk Thu Mar 11 11:21:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 11:21:50 +0000 Subject: [Biopython-dev] Bio.Phylo.Applications? Message-ID: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com> Hi Eric et al, We have started a collection of command line tool wrappers for multiple sequence alignments under Bio.Align.Applications, so I was thinking about where to put wrappers for phylogenetic tree command line tools. How does Bio.Phylo.Applications sound (following the same structure as the Bio.Align.Applications module). The kind of things I am thinking about include: QuickTree (neighbour joining, NJ) http://www.sanger.ac.uk/resources/software/quicktree/ QuickJoin (NJ) http://www.daimi.au.dk/~mailund/quick-join.html RaxML (maximum likelihood, ML), http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm [We should talk to Biopython contributor Frank Kauff as he uses this with Python] And so on. Plus pointers in the documentation to the EMBOSS module for PHYLIP tools. Peter From biopython at maubp.freeserve.co.uk Thu Mar 11 11:30:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 11:30:04 +0000 Subject: [Biopython-dev] Adding format method to phylo tree object? Message-ID: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> Hi Eric (et al), Are you familiar with the format method of the SeqRecord and alignment object (plus the __format__ method which does the same thing aiming to work nicely with the Python 2.6 built in function format)? This allows the user to turn their data into a string in a specified output format. Internally the method calls Bio.SeqIO.write (or AlignIO) with a StringIO handle. Do you think it would it make sense to have this for the tree objects in Bio.Phylo, allowing easy access to the object as a Newick tree format etc? For people using IPython, the __pretty__ method looks related. I know the Bio.Nexus tree has a "prity print" method which might be exposed like this. I wonder if this convention will become more widespread? http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html Peter From p.j.a.cock at googlemail.com Thu Mar 11 15:34:07 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 15:34:07 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 Message-ID: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Hi all, It is probably time to starting getting ready for Biopython 1.54, perhaps aiming to release within about a months time? This means not landing any major additions to the trunk for now (keep things like GFF and Geography on branches for now). Other than finishing up any documentation for new stuff (especially the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, are there any important issues we should address before the release? Regards, Peter From tiagoantao at gmail.com Thu Mar 11 15:42:21 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 11 Mar 2010 15:42:21 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com> On Thu, Mar 11, 2010 at 3:34 PM, Peter Cock wrote: > Other than finishing up any documentation for new stuff (especially > the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, > are there any important issues we should address before the release? I think I will be able to commit my code around the 20th. Currently I need to address the issue of supporting thousands of markers in the genepop parser as people do complain about that (like a couple of times a month or so, not more). -- "Heavier than air flying machines are impossible" Lord Kelvin, President, Royal Society, c. 1895 From andrea at biocomp.unibo.it Thu Mar 11 17:11:00 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 11 Mar 2010 18:11:00 +0100 (CET) Subject: [Biopython-dev] Planning for Biopython 1.54 Message-ID: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> What about the Uniprot XML format parser? The code is functional, and was reviewd, but it would be nice to have some beta testing. The only remaining "issue" is where to save the comment fields. The actual implementation will work for biosql schema, and store most of the data in the comment fields. Andrea From p.j.a.cock at googlemail.com Thu Mar 11 17:31:08 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 17:31:08 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> Message-ID: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> On Thu, Mar 11, 2010 at 5:11 PM, Andrea Pierleoni wrote: > What about the Uniprot XML format parser? > The code is functional, and was reviewd, but it would be nice to have some > beta testing. > The only remaining "issue" is where to save the comment fields. > The actual implementation will work for biosql schema, and store most > of the data in the comment fields. > > Andrea Hi Andrea, Your UnitProt XML parser was one of the things I thought we should delay until after getting Biopython 1.54 out the door, but I would expect it to be included in Biopython 1.55. There are at least two remaining issues, (1) where to save the comment fields, and (2) what to call the format in SeqIO. Both of these should ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to ensure the OBF projects which use simple strings for file formats are consistent. Would you like me to start a discussion there regarding the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe even "unitprotxml". Personally, "uniprot" seems fine provided this is going to be the primary file format for UniProt records in the short to medium term. Also I don't think any of the current Biopython developers have sat down to review the code. As the Bio.SeqIO maintainer, I will do this, but right now I think getting Biopython 1.54 out should be prioritised. From a very quick look just now, the recent merging of the SFF support to the trunk will require a few tweaks in test_SeqIO.py (e.g. an empty file is not valid for SFF files as well as the UniProt XML). Also including a UniProt XML file in test_BioSQL_SeqIO.py would be worthwhile. Regards, Peter From andrea at biocomp.unibo.it Thu Mar 11 17:43:13 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 11 Mar 2010 18:43:13 +0100 (CET) Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> Message-ID: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it> > > Hi Andrea, > > Your UnitProt XML parser was one of the things I thought we should > delay until after getting Biopython 1.54 out the door, but I would > expect it to be included in Biopython 1.55. > > There are at least two remaining issues, (1) where to save the comment > fields, and (2) what to call the format in SeqIO. Both of these should > ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to > ensure the OBF projects which use simple strings for file formats are > consistent. Would you like me to start a discussion there regarding > the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe > even "unitprotxml". Personally, "uniprot" seems fine provided this is > going to be the primary file format for UniProt records in the short > to medium term. > Of course you are free to open a discussion. I used 'uniprot' for sake of simplicity, but then I noticed that the format is called 'uniprotxml' in EBI REST web services. A common name will easier for everybody. > Also I don't think any of the current Biopython developers have sat > down to review the code. The code was reviewed by Mauro Amico, I don't know if he is one of the "current Biopython developers", anyhow any additional review is welcome. > As the Bio.SeqIO maintainer, I will do this, > but right now I think getting Biopython 1.54 out should be > prioritised. From a very quick look just now, the recent merging of > the SFF support to the trunk will require a few tweaks in > test_SeqIO.py (e.g. an empty file is not valid for SFF files as well > as the UniProt XML). Also including a UniProt XML file in > test_BioSQL_SeqIO.py would be worthwhile. > Mauro also added some unit testing that should be useful for this. Let me know if you need any help/info. Bests, Andrea From p.j.a.cock at googlemail.com Thu Mar 11 17:49:50 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 17:49:50 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it> References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it> <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com> <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it> Message-ID: <320fb6e01003110949v206a1868g6360002198a41ddd@mail.gmail.com> On Thu, Mar 11, 2010 at 5:43 PM, Andrea Pierleoni wrote: > >> >> Hi Andrea, >> >> Your UnitProt XML parser was one of the things I thought we should >> delay until after getting Biopython 1.54 out the door, but I would >> expect it to be included in Biopython 1.55. >> >> There are at least two remaining issues, (1) where to save the comment >> fields, and (2) what to call the format in SeqIO. Both of these should >> ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to >> ensure the OBF projects which use simple strings for file formats are >> consistent. Would you like me to start a discussion there regarding >> the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe >> even "unitprotxml". Personally, "uniprot" seems fine provided this is >> going to be the primary file format for UniProt records in the short >> to medium term. > > Of course you are free to open a discussion. I used 'uniprot' for sake of > simplicity, but then I noticed that the format is called 'uniprotxml' in > EBI REST web services. A common name will easier for everybody. In that case, given the EBI REST convention, uniprotxml may be wise. >> Also I don't think any of the current Biopython developers have sat >> down to review the code. > > The code was reviewed by Mauro Amico, I don't know if he is one of the > "current Biopython developers", anyhow any additional review is welcome. I don't recall Mauro Amico contributing to Biopython in the past, but as you say, the more eyes on the code the better :) Peter From eric.talevich at gmail.com Thu Mar 11 22:54:38 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 11 Mar 2010 17:54:38 -0500 Subject: [Biopython-dev] Adding format method to phylo tree object? In-Reply-To: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> Message-ID: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com> On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote: > Hi Eric (et al), > > Are you familiar with the format method of the SeqRecord and alignment > object (plus the __format__ method which does the same thing aiming to > work nicely with the Python 2.6 built in function format)? This allows > the user to turn their data into a string in a specified output > format. Internally the method calls Bio.SeqIO.write (or AlignIO) with > a StringIO handle. > > Do you think it would it make sense to have this for the tree objects > in Bio.Phylo, allowing easy access to the object as a Newick tree > format etc? > Sure, I could do that. It makes a lot of sense for Newick trees, and could be useful with the XML formats for debugging. > > For people using IPython, the __pretty__ method looks related. I know > the Bio.Nexus tree has a "prity print" method which might be exposed > like this. I wonder if this convention will become more widespread? > > http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html > I didn't know about that. I also have a pretty_print method in Bio.Phylo which does something much different from the Bio.Nexus printer -- the Nexus one looks more like it's more useful for debugging the Tree object's internal structure in terms of references, so (highly biased judgment) I'm inclined to use the code from Bio.Phylo._utils.pretty_print to implement __pretty__ for IPython. But I'll play with this IPython feature to see how it's supposed to behave in general. -Eric From eric.talevich at gmail.com Thu Mar 11 23:03:59 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 11 Mar 2010 18:03:59 -0500 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com> On Thu, Mar 11, 2010 at 10:34 AM, Peter Cock wrote: > Hi all, > > It is probably time to starting getting ready for Biopython 1.54, > perhaps aiming to release within about a months time? > > This means not landing any major additions to the trunk for now (keep > things like GFF and Geography on branches for now). > > Other than finishing up any documentation for new stuff (especially > the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, > are there any important issues we should address before the release? > Is it all right to leave the documentation for Bio.Phylo on the wiki for now, or should I try to add something to the main tutorial? -Eric From p.j.a.cock at googlemail.com Thu Mar 11 23:18:18 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Mar 2010 23:18:18 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com> Message-ID: <320fb6e01003111518o3f50b95bw6b2446611fbb9bf5@mail.gmail.com> On Thu, Mar 11, 2010 at 11:03 PM, Eric Talevich wrote: >> Other than finishing up any documentation for new stuff (especially >> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, >> are there any important issues we should address before the release? > > Is it all right to leave the documentation for Bio.Phylo on the wiki > for now, or should I try to add something to the main tutorial? I would like at least a short section in the tutorial mentioning the new module with a link to the wiki. That way people just browsing the tutorial to get an idea of what Biopython covers will be made aware of it. In the long term I think the module deserves a chapter (which can be based on the wiki text). Are you familiar with LaTeX? (The mark up language the tutorial is written in). Also, I think it would be great to have a post on the news server (which we can link to in the release announcement) talking about what Bio.Phylo adds (and thank GSoC and NESCent etc). A little advertising ;) How does that sound? Regards, Peter From biopython at maubp.freeserve.co.uk Thu Mar 11 23:23:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 23:23:54 +0000 Subject: [Biopython-dev] Adding format method to phylo tree object? In-Reply-To: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com> References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com> <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com> Message-ID: <320fb6e01003111523r4fe5f4c7va9f77e089385ba0c@mail.gmail.com> On Thu, Mar 11, 2010 at 10:54 PM, Eric Talevich wrote: > On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote: > >> Hi Eric (et al), >> >> Are you familiar with the format method of the SeqRecord and alignment >> object (plus the __format__ method which does the same thing aiming to >> work nicely with the Python 2.6 built in function format)? This allows >> the user to turn their data into a string in a specified output >> format. Internally the method calls Bio.SeqIO.write (or AlignIO) with >> a StringIO handle. >> >> Do you think it would it make sense to have this for the tree objects >> in Bio.Phylo, allowing easy access to the object as a Newick tree >> format etc? >> > > Sure, I could do that. It makes a lot of sense for Newick trees, and could > be useful with the XML formats for debugging. > Great. >> For people using IPython, the __pretty__ method looks related. I know >> the Bio.Nexus tree has a "pretty print" method which might be exposed >> like this. I wonder if this convention will become more widespread? >> >> http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html >> > > I didn't know about that. I only read about it recently myself - it may not be worth doing. (I'm not trying to invent work here *grin*, just looking for things we can polish before your code gets its first proper release.) Thanks, Peter From lpritc at scri.ac.uk Fri Mar 12 08:18:09 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 12 Mar 2010 08:18:09 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: On 11/03/2010 Thursday, March 11, 15:34, "Peter Cock" wrote: > Other than finishing up any documentation for new stuff (especially > the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon, > are there any important issues we should address before the release? There are those updates to ePrimer3/PrimerSearch EMBOSS interaction (that you'll need for that differential primer script, BTW...) Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Fri Mar 12 13:22:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Mar 2010 13:22:55 +0000 Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML) Message-ID: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com> Hi all, Back in November I set up a simple pair of cron jobs to update the code snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html I've just added another job which takes the latest Tutorial.tex file and compiles it with pdflatex (already installed) and hevea (installed from source under my user account) to make the PDF and HTML files. These are then copied to the webserver and published as: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf These are currently updated once a day (at 2:40am which shouldn't be too busy whichever USA timezone the server uses). Assuming I got my crontab settings right - in the short term I'll keep an eye on it to check ;) In comparison the "official" versions at the following URLs are generally updated only for releases: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf I know that not everyone has latex or hevea installed (installing hevea from source is a bit of a hassle even on Linux), and further more proof reading the raw markup in Tutorial.tex isn't that easy. So, the point of all this effort is now anyone can help proofread the latest version of the tutorial - this should also be of use to those users/contributors actually running the latest code from git rather than the official releases. Regards, Peter From biopython at maubp.freeserve.co.uk Fri Mar 12 13:32:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Mar 2010 13:32:32 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> Message-ID: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> Hi all, I'd like to proceed as outlined below for Biopython 1.54, i.e. don't change the current Seq equality but add a warning that we plan to change it. Should we have a discussion on the main list first? Peter On Mon, Feb 22, 2010 at 2:48 PM, Peter wrote: > Hi all, > > I've just got back from Japan - Brad and I were fortunate to be > able to attend the DBCLS BioHackathon 2010 held in Tokyo, > http://hackathon3.dbcls.jp/ > > As Brad already mentioned in passing, we also managed to have > dinner one evening with Michiel, and had an informal chat about > Biopython plans. Expect a few more emails on other topics to > follow. > > One of the short term aims we agreed on was to press ahead > with the Seq equality changes outlined on this thread late last > year. Mailing list archive link: > http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html > > To recap, the agreed best behaviour was to make Seq equality > act like string equality, but to raise a Python warning when > incompatible alphabets are compared (e.g. DNA to Protein). > This also applies to all the other comparison operators: > not equal, less than, greater than, less than or equal, and > greater than or equal. > > This is my outline plan for the change: > > For Biopython up to 1.53, Seq class uses object equality, > seq1==seq2 acts as id(seq1)==id(seq2) > > For Biopython 1.54 (and perhaps a few more releases), > the Seq classes will still use object equality but will trigger > a warning suggesting explicit use of ?id(seq1)==id(seq2) > or str(seq1)==str(seq2) as appropriate. > > For Biopython 1.xx (maybe 1.55 or 1.56?) the Seq classes > will switch to using string equality (with an alphabet aware > warning for comparing DNA to RNA etc), but will also trigger > a warning that this is a change from previous releases, and > suggest in the short term the continued explicit use of either > id(seq1)==id(seq2) for object identity or str(seq1)==str(seq2) > for string identity. > > For Biopython 1.yy (maybe 1.57?) the Seq classes will > use string equality (with an alphabet aware warning for > comparing DNA to RNA etc), without any warning about > this being a change from historic behaviour. > > These warning messages could also point at a wiki page, > and we'd need a FAQ entry in the tutorial as well. The > aim of this slightly drawn out switch is to try and make > sure all users are aware of the change, even if they > only update their copy of Biopython every few releases. > > Does that all sound sensible? If so, we should probably > have an announcement on the main mailing list, in case > there are any other views. > > Other more complex options include a flag for switching > between the modes - but that complexity doesn't seem > such a good idea to me. All my own code and most of > the unit tests use str(seq1)==str(seq2) explicitly anyway. > The only exception is some of the genetic algorithm unit > tests which do seem to want explicit object identity. > > Regards, > > Peter > From kellrott at gmail.com Fri Mar 12 18:00:45 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 12 Mar 2010 10:00:45 -0800 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: > > > It is probably time to starting getting ready for Biopython 1.54, > perhaps aiming to release within about a months time? > > This means not landing any major additions to the trunk for now (keep > things like GFF and Geography on branches for now). > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I don't think it counts as a major addition. I think to finish it off, we just needed to finalize the driver names. For post 1.54 stuff, I have some HMMER3, Pfam, and GO parsing code (Chris Lasher has a GO fork as well). But I need some community feedback to fill in the interface holes. Kyle From p.j.a.cock at googlemail.com Fri Mar 12 18:09:39 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Mar 2010 18:09:39 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> Message-ID: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> On Fri, Mar 12, 2010 at 6:00 PM, Kyle wrote: >> >> It is probably time to starting getting ready for Biopython 1.54, >> perhaps aiming to release within about a months time? >> >> This means not landing any major additions to the trunk for now (keep >> things like GFF and Geography on branches for now). > > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I > don't think it counts as a major addition. ?I think to finish it off, we > just needed to finalize the driver names. Oh yeah - I confess I'd forgotten about that. Has there been any news on the Jython front about SQLite support? > For post 1.54 stuff, I have some HMMER3, Pfam, and GO??parsing code?(Chris > Lasher has a GO fork as well). But I need some community feedback to fill in > the interface holes. > Kyle Lots of exciting stuff to come then :) Peter From kellrott at gmail.com Fri Mar 12 18:28:45 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 12 Mar 2010 10:28:45 -0800 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> Message-ID: > > > Oh yeah - I confess I'd forgotten about that. Has there been any news > on the Jython front about SQLite support? > There is no official support, but you can always work through existing Java packages ( http://old.nabble.com/SQLite-%2B-JDBC-%2B-Jython.-Example-td13322270.html ). Kyle From eric.talevich at gmail.com Fri Mar 12 19:14:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 12 Mar 2010 14:14:51 -0500 Subject: [Biopython-dev] Bio.Phylo.Applications? In-Reply-To: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com> References: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com> Message-ID: <3f6baf361003121114v36b8a311i5b4dc9cee27961c2@mail.gmail.com> On Thu, Mar 11, 2010 at 6:21 AM, Peter wrote: > Hi Eric et al, > > We have started a collection of command line tool wrappers for > multiple sequence alignments under Bio.Align.Applications, so I was > thinking about where to put wrappers for phylogenetic tree command > line tools. How does Bio.Phylo.Applications sound (following the same > structure as the Bio.Align.Applications module). > Sounds great to me! I don't have any code that would go there yet, but feel free to add the directory and any new code you have. -Eric From bugzilla-daemon at portal.open-bio.org Fri Mar 12 21:57:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 12 Mar 2010 16:57:53 -0500 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: Message-ID: <201003122157.o2CLvrtP008861@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3000 ------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2010-03-12 16:57 EST ------- Hi Peter, I finally got back to this. Thank your for all your work. I would be glad if one could use the accession without the trailing ".1", etc for get_raw() and get(). I think just any version of the record should be returned, and maybe a list if there were multiple versions of the same. >>> print data.get_raw("BC035166") Traceback (most recent call last): File "", line 1, in File "Bio/SeqIO/_index.py", line 280, in get_raw handle.seek(dict.__getitem__(self, key)) KeyError: 'BC035166' >>> Similarly, if I loop over the entries I have to do: >>> mylist = ['ACC1', 'ACC2', 'ACC3'] >>> sequences = [] >>> for acc in data.keys(): ... if data.get(acc).id.split('.')[0] in mylist: ... sequences.append(data.get(acc)) Oh no, this is not what I wanted, in full: from Bio import SeqIO data = SeqIO.index("full.gb", "gb") mylist = ['AC11111.1', 'AC2222.2', 'AC3333.3'] sequences = [] for acc in mylist: if acc in map(lambda x: x.split('.')[0], data.keys()): print "Found %s" % acc if data.get(acc + '.1'): sequences.append(data.get(acc + '.1')) else: if data.get(acc + '.2'): sequences.append(data.get(acc + '.2')) else: sequences.append(data.get(acc + '.3')) else: print "Missing %s" % acc output_handle = open("filtered.gb", "w") SeqIO.write(sequences, output_handle, "genbank") There was already a discussing on the user mailing list, I do not think forcing uppercase letters for genbank files is a good idea. Just stick with what was supplied. Myself, I use mixed typically to emphasize, ORFs, but sometimes in lower-case low-quality regions. Anyway, I provided original NCBI-web GenBank file of an EST and the DNA sequence was in lowercase, biopython returned uppercase. In turn, diff(1) command returns too many changed lines, unnecessarily. I suggest giving use an opportunity to specify on input parsing whether to keep mixed-case/lower-case or force uppercase. Also, protein sequences I have often seen in lower-case, which is ugly to my eyes, btw. Finally, the remaining differences are here (probably the first is in bug #2578): --- /tmp/orig.gb 2010-03-12 21:09:24.000000000 +0100 +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100 @@ -1,4 +1,4 @@ -LOCUS CR603932 1625 bp mRNA linear HTC 16-OCT-2008 +LOCUS CR603932 1625 bp DNA HTC 16-OCT-2008 DEFINITION full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized of Homo sapiens (human). ACCESSION CR603932 @@ -29,39 +29,39 @@ division of Invitrogen. FEATURES Location/Qualifiers source 1..1625 - /organism="Homo sapiens" /mol_type="mRNA" - /db_xref="taxon:9606" /clone="CS0DK007YH24" + /db_xref="taxon:9606" /tissue_type="HeLa cells Cot 25-normalized" /plasmid="pCMVSPORT_6" + /organism="Homo sapiens" ORIGIN Thanks for all you work on this, it is a great service. ;-) Next, I will try to filter by .features['tissue_type'] but sadly will have to search for the very same string through COMMENT string as well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 12 22:05:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 12 Mar 2010 17:05:39 -0500 Subject: [Biopython-dev] [Bug 3026] New: Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3026 Summary: Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Traceback (most recent call last): File "/home/mmokrejs/bin/filter-accessions.py", line 22, in SeqIO.write(sequences, output_handle, "genbank") File "/usr/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 363, in write count = writer_class(handle).write_file(sequences) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 691, in write_record self._write_comment(record) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 579, in _write_comment self._write_multi_line("", line) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 335, in _write_multi_line lines = self._split_multi_line(text, max_len) File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 279, in _split_multi_line "Your description cannot be broken into nice lines!" AssertionError: Your description cannot be broken into nice lines! Please fix the message so it prints out the accession/version number. ;-) LOCUS BF378302 501 bp mRNA linear EST 27-NOV-2000 DEFINITION CM0-UM0001-060300-270-g07 UM0001 Homo sapiens cDNA, mRNA sequence. ACCESSION BF378302 VERSION BF378302.1 GI:11367336 KEYWORDS EST. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 501) AUTHORS Dias Neto,E., Garcia Correa,R., Verjovski-Almeida,S., Briones,M.R., Nagai,M.A., da Silva,W. Jr., Zago,M.A., Bordin,S., Costa,F.F., Goldman,G.H., Carvalho,A.F., Matsukuma,A., Baia,G.S., Simpson,D.H., Brunstein,A., deOliveira,P.S., Bucher,P., Jongeneel,C.V., O'Hare ,M.J., Soares,F., Brentani,R.R., Reis,L.F., de Souza,S.J. and Simpson,A.J. TITLE Shotgun sequencing of the human transcriptome with ORF expressed sequence tags JOURNAL Proc. Natl. Acad. Sci. U.S.A. 97 (7), 3491-3496 (2000) PUBMED 10737800 COMMENT Contact: Simpson A.J.G. Laboratory of Cancer Genetics Ludwig Institute for Cancer Research Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP, Brazil Tel: +55-11-2704922 Fax: +55-11-2707001 Email: asimpson at ludwig.org.br This sequence was derived from the FAPESP/LICR Human Cancer Genome Project. This entry can be seen in the following URL (http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001-060300-270-g07&t3=2000-03-06&t4=1 ) Seq primer: puc 18 forward. FEATURES Location/Qualifiers [cut] I have few more example slike this from some dbEST data, I think all from a same project, though. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Mar 13 13:43:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Mar 2010 13:43:53 +0000 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz> References: <201002021840.o12Ie88i015906@portal.open-bio.org> <4B6995D0.3030405@fold.natur.cuni.cz> <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com> <4B9AA432.2050407@fold.natur.cuni.cz> Message-ID: <320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com> On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ? wrote: > > I finally got back to this. Thank your for all your work. > I would be glad if one could use the accession without > the trailing ".1", etc for get_raw() and get(). I think > just any version of the record should be returned, > and maybe a list if there were multiple versions of > the same. This is just a quick reply to answer this part of your email. It would be unwise to try and be clever with the key matching - in this case yes, for GenBank files we know what the names means, accession.version - but this is not true in general. In this case the answer for your needs would be to use the Bio.SeqIO.index optional argument to specify the keys. e.g. something like this: from Bio import SeqIO def strip_version(identifier): return identifier.rsplit(".",1)[0] my_dict = SeqIO.index(filename, "gb", key_function=strip_version) That way all the keys will have just the accession without the version (assuming there are no clashes which I think will raise an error). Peter From sbassi at clubdelarazon.org Sun Mar 14 07:16:25 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Sun, 14 Mar 2010 04:16:25 -0300 Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc) In-Reply-To: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> References: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> Message-ID: <9e2f512b1003132316j55a95ca7u6a87191ff877898d@mail.gmail.com> On Wed, Mar 10, 2010 at 11:30 AM, Peter Cock wrote: > related project, you can join us in the application. If you are > a student and are interested in a project (or would like to > propose one), please take a look at these pages: > http://www.open-bio.org/wiki/Google_Summer_of_Code > http://biopython.org/wiki/Google_Summer_of_Code Regarding GSoC call in Biopython, I found the PDB-Tidy task pretty interesting. I will study the proposal and write back to you. I am working currently with microRNA but I use Bio.PDB a lot to help my wife who does antigen structure prediction and works with modeller, PyMol and PDB files. A tool like the proposed PDB-Tidy could come handily. Best, SB. From biopython at maubp.freeserve.co.uk Sun Mar 14 13:50:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Mar 2010 13:50:52 +0000 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: <4B9BB1F6.9000505@fold.natur.cuni.cz> References: <201002021840.o12Ie88i015906@portal.open-bio.org> <4B6995D0.3030405@fold.natur.cuni.cz> <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com> <4B9AA432.2050407@fold.natur.cuni.cz> <320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com> <4B9BB1F6.9000505@fold.natur.cuni.cz> Message-ID: <320fb6e01003140650o54a8eea2h66ea87abc42c754@mail.gmail.com> On Sat, Mar 13, 2010 at 3:40 PM, Martin MOKREJ? wrote: > > Thanks Peter, > ?yes, that is what I already ended-up with in a more awkward way. ;-) > But basically I have the same workaround. > Best, > M. So does using the Bio.SeqIO.index() function's key_function argument seem like a good solution to your key problem? Peter From biopython at maubp.freeserve.co.uk Sun Mar 14 20:30:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Mar 2010 20:30:45 +0000 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz> References: <201002021840.o12Ie88i015906@portal.open-bio.org> <4B6995D0.3030405@fold.natur.cuni.cz> <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com> <4B9AA432.2050407@fold.natur.cuni.cz> Message-ID: <320fb6e01003141330t199bbbcfm6bf32c5357b9fd77@mail.gmail.com> On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ? wrote: > > Finally, the remaining differences are here (probably the first is in bug #2578): > > --- /tmp/orig.gb ? ? ? ?2010-03-12 21:09:24.000000000 +0100 > +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100 > @@ -1,4 +1,4 @@ > -LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?mRNA ? ?linear ? HTC 16-OCT-2008 > +LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?DNA ? ? ? ? ? ? ?HTC 16-OCT-2008 > ?DEFINITION ?full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized > ? ? ? ? ? ? of Homo sapiens (human). > ?ACCESSION ? CR603932 > @@ -29,39 +29,39 @@ > ? ? ? ? ? ? division of Invitrogen. > ?FEATURES ? ? ? ? ? ? Location/Qualifiers > ? ? ?source ? ? ? ? ?1..1625 > - ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens" > ? ? ? ? ? ? ? ? ? ? ?/mol_type="mRNA" > - ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606" > ? ? ? ? ? ? ? ? ? ? ?/clone="CS0DK007YH24" > + ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606" > ? ? ? ? ? ? ? ? ? ? ?/tissue_type="HeLa cells Cot 25-normalized" > ? ? ? ? ? ? ? ? ? ? ?/plasmid="pCMVSPORT_6" > + ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens" > ?ORIGIN > Yes, the LOCUS line issue would be part of Bug 2578. As to the order of the feature qualifiers, these are stored in a Python dictionary which does not preserve the order. I personally don't think the order of the qualifiers is important and thus don't care that is can change like this. Assuming the NCBI have a defined sort order for the qualifiers (I'm not aware one), then we could sort the feature qualifiers on output. Another option would be to store the qualifiers in an ordered-dictionary. Or just leave it as it is ;) Peter From bugzilla-daemon at portal.open-bio.org Sun Mar 14 23:31:51 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 14 Mar 2010 19:31:51 -0400 Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! In-Reply-To: Message-ID: <201003142331.o2ENVp3v015452@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3026 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-14 19:31 EST ------- I just used the Entrez web interface, and it comes with the URL split already to meet the 80 column limit. Also doing it via the API: >>> from Bio import Entrez >>> data = Entrez.efetch("nucest", id="BF378302", rettype="gb").read() >>> print data[1095:1800] PUBMED 10737800 COMMENT Contact: Simpson A.J.G. Laboratory of Cancer Genetics Ludwig Institute for Cancer Research Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP, Brazil Tel: +55-11-2704922 Fax: +55-11-2707001 Email: asimpson at ludwig.org.br This sequence was derived from the FAPESP/LICR Human Cancer Genome Project. This entry can be seen in the following URL (http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001- 060300-270-g07&t3=2000-03-06&t4=1) Seq primer: puc 18 forward. FEATURES Location/Qualifiers In this particular case, it looks like splitting the string on a hyphen would be a reasonable option (i.e. copy what the NCBI seems to be doing). Did you just cut and paste it from the NCBI's HTML page where it does seem to be shown with the URL is shown unbroken (giving a line more than 80 characters)? Or can we download a "broken" GenBank file from the NCBI somewhere (maybe the FTP site)? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 15 00:44:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 14 Mar 2010 20:44:59 -0400 Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line(): Your description cannot be broken into nice lines! In-Reply-To: Message-ID: <201003150044.o2F0ixwP017517@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3026 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2010-03-14 20:44 EST ------- Most I copy&pasted from their web, so this is probably the case. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Mar 15 15:40:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Mar 2010 15:40:20 +0000 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? Message-ID: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> Hi all (especially Eric), As recently discussed SeqIO and AlignIO will now take filenames as well as handles. This matches the existing behaviour of Bio.Nexus, Eric's Bio.Phylo, and several big 3rd partly libraries like ReportLab. http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007352.html I've updated most of the tutorial to take advantage of this, and quickly got used less typing when working at the Python prompt. It does make things easier, and I probably should have conceded this earlier. It made me wonder about relaxing another restraint of the SeqIO and AlignIO write functions - they currently insist on a list or iterator of records or alignments. Giving a single object raises an error, but we could handle this unambiguously. Amusingly Eric just updated Bio.Phylo to match this strict behaviour - one reason I sat down and wrote this email. So, should we continue to insist on: record = SeqRecord(...) SeqIO.write([record], filename, format) or should be relax a little more and allow this too?: record = SeqRecord(...) SeqIO.write(record, filename, format) For SeqIO and AlignIO we can do a simple isinstance check for a SeqRecord or alignment object - there isn't really a problem with ambiguity here. Probably also try for Phylo? What's the general consensus on the dev list? Peter From updates at feedmyinbox.com Tue Mar 16 06:16:42 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Tue, 16 Mar 2010 02:16:42 -0400 Subject: [Biopython-dev] 3/16 BioStar - Biopython Questions Message-ID: <0ef45bfc18dff2fe627af99c71f3b412@74.63.51.88> ================================================== 1. Compare two protein sequences using local BLAST ================================================== March 15, 2010 at 7:24 PM Hi, I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful! Cheers http://biostar.stackexchange.com/questions/302/compare-two-protein-sequences-using-local-blast -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From mhampton at d.umn.edu Tue Mar 16 16:01:41 2010 From: mhampton at d.umn.edu (Marshall Hampton) Date: Tue, 16 Mar 2010 11:01:41 -0500 (CDT) Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: References: Message-ID: I'm strongly in favor of such relaxations. It would also be convenient if SeqRecords had a write function. -Marshall Hampton >So, should we continue to insist on: > >record = SeqRecord(...) >SeqIO.write([record], filename, format) >or should be relax a little more and allow this too?: >record = SeqRecord(...) >SeqIO.write(record, filename, format) >For SeqIO and AlignIO we can do a simple isinstance check >for a SeqRecord or alignment object - there isn't really a >problem with ambiguity here. Probably also try for Phylo? >What's the general consensus on the dev list? From rodrigo_faccioli at uol.com.br Tue Mar 16 19:24:58 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 16 Mar 2010 16:24:58 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) Message-ID: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Hi all, I want to know the primary sequence (fasta file) of all proteins. In other the words, I would like a database which contain the fasta files of all proteins. I'm a computer scientist and I don't know how hard it is. However, we have worked with SEQRES section of PDB files and BioPython. So, we want to work with fasta files and BioPython to check our results. I searched the NCBI web-site where I found a lot of databases. I confess I'm lost with them :) Sorry if my email is a basic question. But, I'm very lost. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Tue Mar 16 19:42:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 19:42:43 +0000 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli wrote: > > Hi all, > > I want to know the primary sequence (fasta file) of all proteins. In other > the words, I would like a database which contain the fasta files of all > proteins. > > I'm a computer scientist and I don't know how hard it is. However, we have > worked with SEQRES section of PDB files and BioPython. So, we want to work > with fasta files and BioPython to check our results. A single FASTA file of all know proteins would be enormous. Even the non-redundant ("nr") dataset used by the NCBI for their hugely popular BLAST search is pretty big. It sounds like many all you need is a FASTA file containing all the sequences with structures in the PDB - something you may be able to download directly from the PDB FTP site. Peter From biopython at maubp.freeserve.co.uk Tue Mar 16 19:42:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 19:42:43 +0000 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli wrote: > > Hi all, > > I want to know the primary sequence (fasta file) of all proteins. In other > the words, I would like a database which contain the fasta files of all > proteins. > > I'm a computer scientist and I don't know how hard it is. However, we have > worked with SEQRES section of PDB files and BioPython. So, we want to work > with fasta files and BioPython to check our results. A single FASTA file of all know proteins would be enormous. Even the non-redundant ("nr") dataset used by the NCBI for their hugely popular BLAST search is pretty big. It sounds like many all you need is a FASTA file containing all the sequences with structures in the PDB - something you may be able to download directly from the PDB FTP site. Peter From rodrigo_faccioli at uol.com.br Wed Mar 17 01:01:01 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 16 Mar 2010 22:01:01 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com> Message-ID: <3715adb71003161801n294d15ccwb3a52f6d5ea83c23@mail.gmail.com> Peter, Thank you for your reply. Actually, we want to store the sequence of the fasta files in a relational database which has been developed by my research group. So, we have developed some calculations with primary sequence of proteins. We did not download the PDB database because our computation of protein properties are based on their primary sequence. Therefore, our idea is to work with the primary sequence of all proteins. My understanding is the PDB database contains the proteins which is known their tearty structure. The others are in other database. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Tue, Mar 16, 2010 at 4:42 PM, Peter wrote: > On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli > wrote: > > > > Hi all, > > > > I want to know the primary sequence (fasta file) of all proteins. In > other > > the words, I would like a database which contain the fasta files of all > > proteins. > > > > I'm a computer scientist and I don't know how hard it is. However, we > have > > worked with SEQRES section of PDB files and BioPython. So, we want to > work > > with fasta files and BioPython to check our results. > > A single FASTA file of all know proteins would be enormous. Even the > non-redundant ("nr") dataset used by the NCBI for their hugely popular > BLAST search is pretty big. > > It sounds like many all you need is a FASTA file containing all the > sequences with structures in the PDB - something you may be > able to download directly from the PDB FTP site. > > Peter > From bugzilla-daemon at portal.open-bio.org Wed Mar 17 11:33:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 17 Mar 2010 07:33:09 -0400 Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS 6.1.0 arguments In-Reply-To: Message-ID: <201003171133.o2HBX9kO004765@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2966 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 07:33 EST ------- (In reply to comment #2) > I also found an issue with the PrimerSearchCommandline. The command line > options -sequences and -primers do not appear to be used in EMBOSS6.1.0, having > been replaced by -seqall and -infile, respectively. I changed the options > accordingly, and the modified files are available at > http://github.com/widdowquinn/biopython/tree/emboss-branch. I've merged that fix on the master, http://github.com/biopython/biopython/commit/39708be130eb771eacccf96eed3e8ce0a44ea4f0 Will have a look at eprimer3 next. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Mar 17 12:13:46 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 17 Mar 2010 08:13:46 -0400 Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS 6.1.0 arguments In-Reply-To: Message-ID: <201003171213.o2HCDkf4006396@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2966 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 08:13 EST ------- (In reply to comment #1) > I have made changes to Primer3Commandline that involve adding the EMBOSS 6.1.0 > arguments, and docstrings to each argument. I have also added doctests. > > The proposed code can be inspected at my GitHub repository: > > http://github.com/widdowquinn/biopython/commit/9c0643e333b0cafb4e356426fb4902e0e9d2385c > Cherry picked to merge to the trunk. Marking bug as fixed - thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Wed Mar 17 18:32:17 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 17 Mar 2010 15:32:17 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> Message-ID: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com> On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli wrote: > I want to know the primary sequence (fasta file) of all proteins. In other > the words, I would like a database which contain the fasta files of all > proteins. You don't need Biopython to get this file. Just download NR database y use "fastacmd", a program found in the blast suite. BLAST FTP is not working for me right now so I can't give you the exact URL to download, but look from here: ftp://ftp.ncbi.nih.gov/blast/ Here is how to use fastacmd to retrieve sequences from NR database: http://pwet.fr/man/linux/commandes/fastacmd From kellrott at gmail.com Wed Mar 17 22:14:25 2010 From: kellrott at gmail.com (Kyle) Date: Wed, 17 Mar 2010 15:14:25 -0700 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com> Message-ID: > > > > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I > > don't think it counts as a major addition. I think to finish it off, we > > just needed to finalize the driver names. > > Oh yeah - I confess I'd forgotten about that. > I've posted a fork from the master branch on github ( http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes related to zxjdbc. I've added two driver requests, "MySQL" and "PostgreSQL", that select the appropriate driver based on the platform. Kyle From tiagoantao at gmail.com Wed Mar 17 22:28:36 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 17 Mar 2010 22:28:36 +0000 Subject: [Biopython-dev] Planning for Biopython 1.54 In-Reply-To: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com> References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com> <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com> Message-ID: <6d941f121003171528p1e60fbb8q419485f6c6f171c2@mail.gmail.com> Hi, 2010/3/11 Tiago Ant?o : > I think I will be able to commit my code around the 20th. Currently I > need to address the issue of supporting thousands of markers in the > genepop parser as people do complain about that (like a couple of > times a month or so, not more). I am going to add this and support for haploid markers also. I would like to ask, when its done (soon!) a code review on the part of support of thousands of markers (The parser will change in nature, and files will be maintained open during the whole existence of the parser object). No need for domain knowledge, just comments on code quality. Also some help with merging with the main trunk would be appreciated, as I don' t use github for my stuff (bazaar fan here ;) ). Thanks, Tiago -- "Heavier than air flying machines are impossible" Lord Kelvin, President, Royal Society, c. 1895 From rodrigo_faccioli at uol.com.br Thu Mar 18 00:59:49 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Wed, 17 Mar 2010 21:59:49 -0300 Subject: [Biopython-dev] Primary Sequence of all protein (help) In-Reply-To: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com> References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com> <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com> Message-ID: <3715adb71003171759p7107f2cbod85339a5335374d5@mail.gmail.com> Sebastian, Thank you for your reply. I'll study it. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Wed, Mar 17, 2010 at 3:32 PM, Sebastian Bassi wrote: > On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli > wrote: > > I want to know the primary sequence (fasta file) of all proteins. In > other > > the words, I would like a database which contain the fasta files of all > > proteins. > > You don't need Biopython to get this file. Just download NR database y > use "fastacmd", a program found in the blast suite. > BLAST FTP is not working for me right now so I can't give you the > exact URL to download, but look from here: > ftp://ftp.ncbi.nih.gov/blast/ > Here is how to use fastacmd to retrieve sequences from NR database: > http://pwet.fr/man/linux/commandes/fastacmd > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Thu Mar 18 11:19:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 11:19:03 +0000 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 Message-ID: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> On Wed, Mar 17, 2010 at 10:14 PM, Kyle wrote: >> >>> I think the zxJDBC support (Jython MySQL for BioSQL) was almost >>> done. I don't think it counts as a major addition. ?I think to finish it off, >>> we just needed to finalize the driver names. >> >> Oh yeah - I confess I'd forgotten about that. > > I've posted a fork from the master branch on github ( > http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes > related to zxjdbc. I've added two driver requests, "MySQL" and > "PostgreSQL", that select the appropriate driver based on the platform. > Kyle Hmm. I think it might be cleaner to have a new optional argument like batabase back end (MySQL, PostgreSQL, SQLite3). If the back end is specified without the driver (which would be the encouraged usage) then we will pick the driver at run time (based on if in Jython, or for PostgreSQL which drivers are installed). Existing scripts can continue to specify the driver directly (but we can eventually deprecated this?). Peter From anaryin at gmail.com Thu Mar 18 11:33:05 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 18 Mar 2010 04:33:05 -0700 Subject: [Biopython-dev] Small Typo in PDBParser Message-ID: Hello All, There's a small typo in the Bio.PDB PDBParser module. Line 159: "PDBContructionError" should be "PDBConstructionError" So that I learn, how do I submit a bug and a patch to the project, such as in this case? Best! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From anaryin at gmail.com Thu Mar 18 11:36:15 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 18 Mar 2010 04:36:15 -0700 Subject: [Biopython-dev] Small Typo in PDBParser In-Reply-To: References: Message-ID: Well, actually, PDBConstructionError is not even defined.. It should likely be PDBConstructionException. Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Thu, Mar 18, 2010 at 4:33 AM, Jo?o Rodrigues wrote: > Hello All, > > There's a small typo in the Bio.PDB PDBParser module. Line 159: > > "PDBContructionError" should be "PDBConstructionError" > > So that I learn, how do I submit a bug and a patch to the project, such as > in this case? > > Best! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > From biopython at maubp.freeserve.co.uk Thu Mar 18 12:02:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 12:02:32 +0000 Subject: [Biopython-dev] Small Typo in PDBParser In-Reply-To: References: Message-ID: <320fb6e01003180502w573baa84od9924f4b8e2486c8@mail.gmail.com> On Thu, Mar 18, 2010 at 11:33 AM, Jo?o Rodrigues wrote: > Hello All, > > There's a small typo in the Bio.PDB PDBParser module. Line 159: > > "PDBContructionError" should be "PDBConstructionError" > > So that I learn, how do I submit a bug and a patch to the project, such as > in this case? > > Best! Hi Jo?o, I've you've found a bug in a release, and worked out how to fix it, one of the first steps would be to try the latest code from the repository to see if the bug is still there (and if you fix would need changing). In this case the problem has already been fixed (February 23, 2010), see: http://github.com/biopython/biopython/commits/master/Bio/PDB/PDBParser.py For a simple change like this, you can use the command line tool diff to generate a patch file (see "man diff" for details), which you can then attach to a bug report on our bugzilla. The basic diff usage would be: diff original_file.py fixed_file.py > bug_fix.patch For more complex changes, I would suggest you look at learning git. If you make a change locally you can get a patch file with this: git diff > bug_fix.patch Or, publish the fix to a public copy of the repository (e.g. on github). See also http://biopython.org/wiki/GitUsage I hope that helps, and that you'll have more patches for us in future :) Peter From biopython at maubp.freeserve.co.uk Thu Mar 18 19:01:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 19:01:32 +0000 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> Message-ID: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> On Mon, Mar 15, 2010 at 5:26 PM, Eric Talevich wrote: > On Mon, Mar 15, 2010 at 11:40 AM, Peter wrote: >> >> So, should we continue to insist on: >> >> record = SeqRecord(...) >> SeqIO.write([record], filename, format) >> >> or should be relax a little more and allow this too?: >> >> record = SeqRecord(...) >> SeqIO.write(record, filename, format) >> >> For SeqIO and AlignIO we can do a simple isinstance check >> for a SeqRecord or alignment object - there isn't really a >> problem with ambiguity here. Probably also try for Phylo? >> >> What's the general consensus on the dev list? > > Sounds good to me! The code I just deleted from Bio.Phylo._io > was doing something foolish anyway (testing whether the > argument is iterable) -- now that Bio.Phylo has a single legitimate > base class, I can restore the feature with an isinstance(trees, > BaseTree.Tree) check if we have a consensus here. > > -Eric There was another +1 vote from Marshall Hampton, and no comments against (so far). Let's leave it a few days, but unless anyone speaks out in favour of the status-quo (keep the current strict check in the write function), then make the change. Peter From biopython at maubp.freeserve.co.uk Thu Mar 18 19:04:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 19:04:10 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> References: <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> Message-ID: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com> On Fri, Mar 12, 2010 at 1:32 PM, Peter wrote: > Hi all, > > I'd like to proceed as outlined below for Biopython 1.54, > i.e. don't change the current Seq equality but add a warning > that we plan to change it. I've done that to Bio/Seq.py on the trunk (added two FutureWarnings and docstring explanation). Assuming this doesn't trigger any regressions, we'd need to work on the documentation (in particular the tutorial, but also perhaps a news post?) and fix the GA unit test before the release. If anyone on the dev list thinks this is a bad idea, please speak up (sooner rather than later). Thanks, Peter From kellrott at gmail.com Thu Mar 18 19:28:58 2010 From: kellrott at gmail.com (Kyle) Date: Thu, 18 Mar 2010 12:28:58 -0700 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: What should the parameter by called? Possibilities: 'backend', 'dbtype', ... ideas anyone? Kyle On Thu, Mar 18, 2010 at 4:19 AM, Peter wrote: > Hmm. I think it might be cleaner to have a new optional argument like > batabase back end (MySQL, PostgreSQL, SQLite3). If the back end > is specified without the driver (which would be the encouraged usage) > then we will pick the driver at run time (based on if in Jython, or for > PostgreSQL which drivers are installed). Existing scripts can continue > to specify the driver directly (but we can eventually deprecated this?). > > Peter > From biopython at maubp.freeserve.co.uk Thu Mar 18 19:34:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 19:34:39 +0000 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> On Thu, Mar 18, 2010 at 7:28 PM, Kyle wrote: > What should the parameter be called? Possibilities: > 'backend', 'dbtype', ... ideas anyone? Just database would be too vague. I quite like backend. Peter From sbassi at clubdelarazon.org Thu Mar 18 19:39:40 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Thu, 18 Mar 2010 16:39:40 -0300 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> Message-ID: <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> On Thu, Mar 18, 2010 at 4:01 PM, Peter wrote: > There was another +1 vote from Marshall Hampton, and no > comments against (so far). Let's leave it a few days, but unless > anyone speaks out in favour of the status-quo (keep the > current strict check in the write function), then make the change. If we are going to change this, why not setting "fasta" as default input/output format? This would also results in less typing when processing fasta files (most of the time in my workflow at least). From bugzilla-daemon at portal.open-bio.org Thu Mar 18 21:27:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 18 Mar 2010 17:27:48 -0400 Subject: [Biopython-dev] [Bug 3029] New: PhyloXML.Phylogeny.is_preterminal() fails Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3029 Summary: PhyloXML.Phylogeny.is_preterminal() fails Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: joelb at lanl.gov tree.is_preterminal() raises an AttributeError "'Phylogeny' object has no attribute 'clades'" File BaseTree.py line 442. git fetch on Feb. 22. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Thu Mar 18 22:03:09 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 18 Mar 2010 22:03:09 +0000 Subject: [Biopython-dev] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4BA29706.8040606@cornell.edu> References: <4BA29706.8040606@cornell.edu> Message-ID: <320fb6e01003181503j7e3030aao7bce7ebf4d8be06@mail.gmail.com> Good news for GSoC 2010 :) ---------- Forwarded message ---------- From: Robert Buels Date: Thu, Mar 18, 2010 at 9:11 PM Subject: Google Summer of Code is *ON* for OBF projects! Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). ? Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo Student applications are due April 9, 2010 at 19:00 UTC. ?Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and who to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2010 Administrator From biopython at maubp.freeserve.co.uk Fri Mar 19 10:45:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Mar 2010 10:45:55 +0000 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> Message-ID: <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> Hi Sebastian, On Thu, Mar 18, 2010 at 7:39 PM, Sebastian Bassi wrote: > On Thu, Mar 18, 2010 at 4:01 PM, Peter wrote: >> There was another +1 vote from Marshall Hampton, and no >> comments against (so far). Let's leave it a few days, but unless >> anyone speaks out in favour of the status-quo (keep the >> current strict check in the write function), then make the change. > > If we are going to change this, why not setting "fasta" as default > input/output format? This would also results in less typing when > processing fasta files (most of the time in my workflow at least). Give an inch and they'll take a mile ;) I agree that FASTA is likely to be the most common file format for most users, but I don't think we should make it the default. One specific reason is because the FASTA parser will allow and ignore a header comment, you will get confusing results if the file is not actually a FASTA file (typically it will parse other text files like GenBank, EMBL or FASTQ with no errors, but will return no records). I am worried that people will assume that if they don't specify the format that Biopython will determine it automatically - which it won't. [Yes, I'm talking about the read/parse functions here, but it would be odd if the write function defaulted to FASTA but they did not.] Also, could you clarify if you are in favour of relaxing the requirement that the write function takes a list/iterator of records/alignments to allow a single SeqRecord or alignment? Thanks, Peter From bugzilla-daemon at portal.open-bio.org Fri Mar 19 13:22:51 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Mar 2010 09:22:51 -0400 Subject: [Biopython-dev] [Bug 3029] PhyloXML.Phylogeny.is_preterminal() fails In-Reply-To: Message-ID: <201003191322.o2JDMpYW015069@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3029 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from eric.talevich at gmail.com 2010-03-19 09:22 EST ------- (In reply to comment #0) > tree.is_preterminal() > > raises an AttributeError > "'Phylogeny' object has no attribute 'clades'" > File BaseTree.py line 442. > > git fetch on Feb. 22. > Thanks for catching this. It's fixed on the trunk now. I also checked the rest of TreeMixin for other occurrences of the same problem (accessing self.clades directly instead of going through self.root.clades) and found none, so it shouldn't happen again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Fri Mar 19 22:08:17 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Fri, 19 Mar 2010 19:08:17 -0300 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> Message-ID: <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> On Fri, Mar 19, 2010 at 7:45 AM, Peter wrote: > Give an inch and they'll take a mile ;) In Spanish we say: Give a hand and they'll take the whole arm :) > that if they don't specify the format that Biopython will > determine it automatically - which it won't. In this respect, Python zen favours being explicit,so I see your point. > Also, could you clarify if you are in favour of relaxing the > requirement that the write function takes a list/iterator of > records/alignments to allow a single SeqRecord or alignment? Is OK for me to allow a single record instead of a iterable, this change will not break any existing code so it is OK for me. Best, SB. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 06:24:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 02:24:39 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003200624.o2K6OdCd010209@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #1 from crosvera at gmail.com 2010-03-20 02:24 EST ------- Created an attachment (id=1463) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1463&action=view) propose patch bug2948.patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 06:26:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 02:26:10 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003200626.o2K6QAoV010279@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crosvera at gmail.com ------- Comment #2 from crosvera at gmail.com 2010-03-20 02:26 EST ------- Here I show an example about what Paul says: bash-4.0$ python Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34) [GCC 4.4.1 (CRUX)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import * >>> parser=PDBParser() >>> structure = parser.get_structure("2beg", "2BEG.pdb") >>> structure.header.keys() ['structure_method', 'head', 'journal', 'journal_reference', 'compound', 'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source', 'resolution', 'structure_reference'] >>> structure.header['head'] 'protein fibril' >>> structure.header['name'] " d structure of alzheimer's abeta(1-42) fibrils" I made a patch, which change the regex. From: tail=re.sub("\A\w+\s+\d*\s*","",h) TO: tail=re.sub("\A\w+\s+\d*\s+","",h Seems that this patch works. The result I got is this: bash-4.0$ python Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34) [GCC 4.4.1 (CRUX)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import * >>> parser=PDBParser() >>> structure = parser.get_structure("2beg", "2BEG.pdb") >>> structure.header.keys() ['structure_method', 'head', 'journal', 'journal_reference', 'compound', 'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source', 'resolution', 'structure_reference'] >>> structure.header['head'] 'protein fibril' >>> structure.header['name'] " 3d structure of alzheimer's abeta(1-42) fibrils" >>> I propose this patch. (my first one). -- Carlos R??os V. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 06:56:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 02:56:53 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003200656.o2K6urZa011050@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crosvera at gmail.com ------- Comment #1 from crosvera at gmail.com 2010-03-20 02:56 EST ------- (In reply to comment #0) > [...] > elif key=="REVDAT": > #Modified by Paul T. Bathen to get most recent date instead of > oldest date. > #Also added additional dict entries > if dict['release_date'] == "1909-01-08": #set in init > rr=re.search("\d\d-\w\w\w-\d\d",tail) > if rr!=None: > dict['release_date']=_format_date(_nice_case(rr.group())) > > dict['mod_number'] = hh[7:10].strip() > dict['mod_id'] = hh[23:28].strip() > dict['mod_type'] = hh[31:32].strip() The Protein Data Bank Contents Guide (Version 3.20, http://www.wwpdb.org/documentation/format32/sect2.html#REVDAT) says that modNum use the colums: 8-10. modId use the colums: 24-27. And modType use the colum 32. So the last part of your code should change to: dict['mod_number'] = hh[7:9] dict['mod_id'] = hh[23:26] dict['mod_type'] = hh[31] Regards. -- Carlos Rios V. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Mar 20 23:02:16 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 20 Mar 2010 19:02:16 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003202302.o2KN2GFb006461@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 ------- Comment #2 from crosvera at gmail.com 2010-03-20 19:02 EST ------- Currently I got this with the actual code: bash-4.0$ python Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34) [GCC 4.4.1 (CRUX)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import * >>> parser=PDBParser() >>> structure = parser.get_structure("2beg", "../2BEG.pdb") >>> structure.header.keys() ['structure_method', 'head', 'journal', 'journal_reference', 'compound', 'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source', 'resolution', 'structure_reference'] >>> structure.header['release_date'] '2005-11-22' >>> but the grep command returns this: bash-4.0$ grep REVDAT ../2BEG.pdb REVDAT 3 24-FEB-09 2BEG 1 VERSN REVDAT 2 20-DEC-05 2BEG 1 JRNL REVDAT 1 22-NOV-05 2BEG 0 So, the actual code is showing the oldest date from REVDAT. I don't know if you (the developer) are trying to say with 'release_date' if is the first version or the last. But I think, as Paul said, that should be the most current date. By the way, my previous comment I said that the last part of the code pasted by Paul should be: dict['mod_number'] = hh[7:9] dict['mod_id'] = hh[23:26] dict['mod_type'] = hh[31] But it has to be: dict['mod_number'] = hh[7:10] dict['mod_id'] = hh[23:27] dict['mod_type'] = hh[31] Other thing, instead to add 'mod_number', 'mod_id', 'mod_type' directly into dict. I think that these keys should be inside a 'release_data' key: dict={'name':"", [...] 'release_date' : "1909-01-08", 'release_data' : {'mod_number' : "", 'mod_id' : "", 'mod_type' : ""}, 'structure_method' : "unknown", [...] } Please comment :) Regards. -- Carlos Rios V. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Sun Mar 21 04:29:30 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Sun, 21 Mar 2010 10:29:30 +0600 Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project Message-ID: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> Dear BioPython developers, my name is Konstantin. I am first-year master's student at Novosibirsk State University, Russia. The subject of my bachelor diploma work was development of 3D biological macromolecular structure visualization tool for open-source bioinformatics project called UGENE . This work was successfully finished about a year ago. The task included a lot of work with PDB format: parsing, correctness testing etc. For testing purposes even whole PDB database was downloaded and tested for simple assertions. Such stress testing revealed a lot of problems and helped to improve code significantly. So, one may say, I have some experience with PDB format :) I used BioPython when I was studying bioinformatics basics and really liked it. I would like to contribute to the project by improving Bio.PDB module and implementing a set of convenient tools to work with PDB files. Best regards, Konstantin From tiagoantao at gmail.com Sun Mar 21 12:59:31 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 21 Mar 2010 12:59:31 +0000 Subject: [Biopython-dev] Changes to the main repo Message-ID: <6d941f121003210559o506b853ci381927fed3aa836f@mail.gmail.com> Hi, I've made some changes in the main repository (my first changes with github), some comments: 1. Many thanks for the GitUsage wiki page. REALLY useful. 2. That being said, if I did any mistakes, they are my own fault. 3. I've added support for big genepop files, something I tend do be asked quite a lot 4. And support for haploid data (nobody really asked this) 5. I remember Peter sending an email about needed corrections to the code. I am afraid I've lost that email :( . If you send it to me, I will do them ASAP 6. New test cases and test data files 7. I might add support, in the future, to Arlequin (file format and application). Allowing for statistics over sequences and other goodies with sequence data. Regards, Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From eric.talevich at gmail.com Mon Mar 22 01:54:27 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 21 Mar 2010 21:54:27 -0400 Subject: [Biopython-dev] GSoC: Refining the PDB-Tidy project idea Message-ID: <3f6baf361003211854g41a4d358pc7fc49c156dcbb7b@mail.gmail.com> Hi GSoC'ers, The PDB-Tidy idea on Biopython's Summer of Code page seems to have attracted interest from a number of highly qualified students: http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files Please, don't let this deter you from applying! Google allocates student slots to each organization based on the number of applications received, so if OBF receives more applications, we can accept more students. However, I'm also concerned that I've made the project description too general. (Or is it too specific?) This article describes the characteristics of a well-defined GSoC project idea: http://en.flossmanuals.net/GSoCMentoring/SelectingProjects In the interest of improving the opportunities for each student, I'm suggesting that the proposals that are submitted under the PDB-Tidy theme focus on a specific goal beyond the manipulation PDB files. At the risk of being "That Guy", I'll give some examples of what I mean: (a) Improve interoperability with external tools like AutoDock or Modeller; (b) Port some MolProbity-like functionality to Biopython; (c) Improve interoperability and consistency between Bio.PDB and the rest of Biopython; (d) Write a parser for some useful format. Also, would anyone else be interested in co-mentoring one of these projects? It's good for a GSoC project to have a secondary mentor -- not required, but helpful -- and I think some support from a more experienced structural biologist would be valuable here. Thanks & best regards, Eric From bugzilla-daemon at portal.open-bio.org Mon Mar 22 02:50:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 21 Mar 2010 22:50:24 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003220250.o2M2oOoP003409@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #3 from eric.talevich at gmail.com 2010-03-21 22:50 EST ------- (In reply to comment #2) > > I made a patch, which change the regex. > From: tail=re.sub("\A\w+\s+\d*\s*","",h) > TO: tail=re.sub("\A\w+\s+\d*\s+","",h > Seems that this patch works. The result I got is this: > > ... Thanks for triaging this, Carlos. However, I think it would be better if the code is a direct reflection of the actual PDB specification: http://www.wwpdb.org/documentation/format32/sect2.html It looks like "continuation" numbers are ignored by this code, so only the text starting in column 11 onward (hh[10:]) is ever used, also dropping leading spaces. Similarly, the key found by regexp is just the first whitespace-delimited word. Can you change your patch to use string methods instead of regular expressions? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Mar 22 03:22:32 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 21 Mar 2010 23:22:32 -0400 Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project In-Reply-To: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> References: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> Message-ID: <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com> On Sun, Mar 21, 2010 at 12:29 AM, Konstantin Okonechnikov < k.okonechnikov at gmail.com> wrote: > Dear BioPython developers, > my name is Konstantin. I am first-year master's student at Novosibirsk > State > University, Russia. > The subject of my bachelor diploma work was development of 3D biological > macromolecular structure visualization tool for open-source bioinformatics > project called UGENE . This work was successfully > finished about a > year > ago. > The task included a lot of work with PDB format: parsing, correctness > testing etc. For testing purposes even whole PDB database was downloaded > and > tested for simple assertions. Such stress testing revealed a lot of > problems > and helped to improve code significantly. So, one may say, I have some > experience with PDB format :) > I used BioPython when I was studying bioinformatics basics and really liked > it. I would like to contribute to the project by improving Bio.PDB module > and implementing a set of convenient tools to work with PDB files. > Best regards, > Konstantin > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > Hi Konstantin, That's really cool. You might also be interested in a project based on this idea from another GSoC organization, NESCent: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Visualization_of_Protein_3D_Structure_Evolution (It's OK to apply to more than one GSoC mentoring organization as a student.) I sent an e-mail earlier today describing some possible refinements to the PDB-Tidy project; did any of those interest you? While we're at it, here's a good place to start improving Bio.PDB before Summer of Code begins: http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED Feel free to e-mail or gchat me with any questions you have. Thanks, Eric From bugzilla-daemon at portal.open-bio.org Mon Mar 22 03:50:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 21 Mar 2010 23:50:47 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003220350.o2M3olGt004920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 ------- Comment #3 from eric.talevich at gmail.com 2010-03-21 23:50 EST ------- (In reply to comment #2) > So, the actual code is showing the oldest date from REVDAT. I don't know if you > (the developer) are trying to say with 'release_date' if is the first version > or the last. But I think, as Paul said, that should be the most current date. It's probably an accidental result of repeatedly setting the same field. Surely the most recent revision date is at least as important as the date of the first revision, given that the initial deposition date is recorded separately. I'm not the original developer, but I'd say it would be best to keep a list or dictionary in a new "revisions" attribute, leaving release_date alone or deprecating it in case someone is actually relying on the current behavior. We should discuss this on biopython-dev before implementing it. > Other thing, instead to add 'mod_number', 'mod_id', 'mod_type' directly into > dict. I think that these keys should be inside a 'release_data' key: That name could lead to some typo-related confusion... but yes, a list-of-dicts or dict-of-dicts would be a nice way to store this info. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k.okonechnikov at gmail.com Mon Mar 22 06:04:50 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Mon, 22 Mar 2010 12:04:50 +0600 Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project In-Reply-To: <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com> References: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com> <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com> Message-ID: <884d1faa1003212304vcdc86d6t4a6931adce8214fc@mail.gmail.com> Hi Eric! Hi Konstantin, > > That's really cool. You might also be interested in a project based on this > idea from another GSoC organization, NESCent: > > https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Visualization_of_Protein_3D_Structure_Evolution > > This project looks really nice, though it requires some proficiency in Java. Actually I don't like the idea of applying to many organizations, I would better choose one project and concentrate my efforts. > (It's OK to apply to more than one GSoC mentoring organization as a > student.) > > I sent an e-mail earlier today describing some possible refinements to the > PDB-Tidy project; did any of those interest you? > > I need some time to investigate them. There is one question so far: what "useful formats" do you have in mind? AFAI, there are not so many data formats for storing 3d structures. I know about PDB XML and NCBI data format. The last one is ASN.1 variation, it is used for diffrent kinds of data (sequences etc.). > While we're at it, here's a good place to start improving Bio.PDB before > Summer of Code begins: > > > http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED > > OK, I will look at it. Fixing a couple of bugs is good way to get aquainted with the code :) > > > Feel free to e-mail or gchat me with any questions you have. > > Thanks, > Eric > p.s. Sorry for misprint in letter subject, I hope that the project won't be that small :) -- Best regards, Okonechnikov Konstantin From p.j.a.cock at googlemail.com Mon Mar 22 09:19:38 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 22 Mar 2010 09:19:38 +0000 Subject: [Biopython-dev] pylint, was: Changes to the main repo Message-ID: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com> 2010/3/21 Tiago Ant?o : > Hi, > > I've made some changes in the main repository (my first changes with > github), some comments: > 1. Many thanks for the GitUsage wiki page. REALLY useful. > 2. That being said, if I did any mistakes, they are my own fault. > 3. I've added support for big genepop files, something I tend do be > asked quite a lot > 4. And support for haploid data (nobody really asked this) > 5. I remember Peter sending an email about needed corrections to the > code. I am afraid I've lost that email :( . If you send it to me, I > will do them ASAP > 6. New test cases and test data files > 7. I might add support, in the future, to Arlequin (file format and > application). Allowing for statistics over sequences and other goodies > with sequence data. > > Regards, > Tiago Hi Tiago, That sounds good. Regarding point 5, running pylint over the code reported some possible errors in Bio.PopGen. Have a look at this - they are all undefined variable issues: http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html I just ran it again on the latest code, and the line numbers have changed a tiny bit but that is all: $ pylint --disable-msg-cat=CRW --include-ids=y --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen No config file found, using default configuration ************* Module Bio.PopGen.Async E0602: 78:Async.get_result: Undefined variable 'done' E0602: 79:Async.get_result: Undefined variable 'done' ************* Module Bio.PopGen.GenePop E0602:166:Record.split_in_pops: Undefined variable 'GenePop' E0602:183:Record.split_in_loci: Undefined variable 'GenePop' ************* Module Bio.PopGen.GenePop.Controller E0602: 41:_read_allele_freq_table: Undefined variable 'self' E0602:133:_hw_func: Undefined variable 'self' E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext' E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable 'currrent_pop' ************* Module Bio.PopGen.GenePop.FileParser E1120:219:FileRecord.remove_locus_by_name: No value passed for parameter 'fw' in function call ************* Module Bio.PopGen.SimCoal.Cache E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config' E0602: 88: Undefined variable 'Cache' ************* Module Bio.PopGen.SimCoal.Controller E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config' Peter From krother at rubor.de Mon Mar 22 15:27:30 2010 From: krother at rubor.de (Kristian Rother) Date: Mon, 22 Mar 2010 16:27:30 +0100 Subject: [Biopython-dev] RNA secondary structure parsing Message-ID: Hi, Took me a while to do some basic clean up in my code - finally managed to contribute something. I just added a branch 'rna' with basic RNA 2D format parsers (Vienna, CT, BPSEQ), and a module that can extract 2D structure elements (helices, loops, bulges, junctions). http://github.com/krother/biopython/tree/rna Its all in: Bio.RNA Tests.test_RNA_* Any kind of feedback is welcome. Best Regards, Kristian From biopython at maubp.freeserve.co.uk Mon Mar 22 16:08:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 16:08:27 +0000 Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML) In-Reply-To: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com> References: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com> Message-ID: <320fb6e01003220908s264401e6s3dab9aa7f2a3f87b@mail.gmail.com> On Fri, Mar 12, 2010 at 1:22 PM, Peter wrote: > Hi all, > > Back in November I set up a simple pair of cron jobs to update the code > snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour: > http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html > > I've just added another job which takes the latest Tutorial.tex file and > compiles it with pdflatex (already installed) and hevea (installed from > source under my user account) to make the PDF and HTML files. > These are then copied to the webserver and published as: > > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf > > These are currently updated once a day (at 2:40am which shouldn't > be too busy whichever USA timezone the server uses). Assuming > I got my crontab settings right - in the short term I'll keep an eye on > it to check ;) It looks like the PDF is working (which happens first in the script), but not the HTML. I'll look into this... Peter From biopython at maubp.freeserve.co.uk Mon Mar 22 16:21:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 16:21:16 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo Message-ID: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> Hi Eric, I've got a real example of a simple tree manipulation that I would like to handle via your new module. I have a (small) unrooted tree from a gene family in Newick format, which by construction includes an out-group (the same gene but from a more distant organism). I would like to reroot the tree so that this out-group is at the basal level. Can Bio.Phylo help me here? Thanks, Peter P.S. Why is Bio.Phylo.trim_str a public method? From bugzilla-daemon at portal.open-bio.org Mon Mar 22 16:28:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 12:28:24 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003221628.o2MGSOqs027450@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 krother at rubor.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |krother at rubor.de -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 16:39:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 12:39:01 -0400 Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for oldest entry. In-Reply-To: Message-ID: <201003221639.o2MGd1mk027807@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2949 ------- Comment #4 from krother at rubor.de 2010-03-22 12:39 EST ------- I originally contributed the parse_pdb_header module a long time ago. I think one or two persons added some changes in the meantime. I like Erics idea of adding a separate 'revisions' attribute. When the code does what is needed I think it's time for me to do some cleanup work. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 17:24:32 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 13:24:32 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003221724.o2MHOWY7029072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1463 is|0 |1 obsolete| | ------- Comment #4 from crosvera at gmail.com 2010-03-22 13:24 EST ------- Created an attachment (id=1464) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1464&action=view) new proposed patch for bug2948 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 17:25:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 13:25:14 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003221725.o2MHPEbj029122@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #5 from crosvera at gmail.com 2010-03-22 13:25 EST ------- ok, I made other patch, this one replace some regex for string-slice methods. what I got: crosvera at cabernet:~/programming/biopython/Bio$ python Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from PDB import * >>> parser = PDBParser() >>> structure = parser.get_structure("2beg", "PDB/2BEG.pdb") >>> structure.header['name'] " 3d structure of alzheimer's abeta(1-42) fibrils" >>> patch file: 0001-modified-parse_pdb_header.py.patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Mar 22 18:17:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Mar 2010 14:17:34 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201003221817.o2MIHYpm030968@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 crosvera at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1464 is|0 |1 obsolete| | ------- Comment #6 from crosvera at gmail.com 2010-03-22 14:17 EST ------- Created an attachment (id=1465) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1465&action=view) new proposed patch for bug2948 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Mar 22 20:28:21 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Mar 2010 16:28:21 -0400 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> Message-ID: <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> On Mon, Mar 22, 2010 at 12:21 PM, Peter wrote: > Hi Eric, > > I've got a real example of a simple tree manipulation that I would like to > handle via your new module. I have a (small) unrooted tree from a gene > family in Newick format, which by construction includes an out-group > (the same gene but from a more distant organism). I would like to reroot > the tree so that this out-group is at the basal level. > > Can Bio.Phylo help me here? > In Bio.Nexus, would you normally have handled this with the method root_with_outgroup? I intend to port that method to Bio.Phylo once I understand it, but the existing code has been kind of hard for me to figure out. Let's address it here, then. Is there a detailed plain-text description somewhere of how this operation should work in general? Given that the outgroup taxon is already somewhere inside the existing unrooted tree, I would guess something like: 0. Load the tree: tree = Phylo.read('example.nwk', 'newick') 1. Locate the outgroup in the tree, remembering the lineage for future operations: outgroup_path = tree.get_path({'name': 'OUTGROUP'}) # or however you can identify it 2. Tracing the outgroup lineage backwards, reattach the subclades to new locations under a new root (or the old root, repurposed). Picturing the unrooted tree as an arbitrarily rooted tree, invert everything above the outgroup in the tree, but keep the descendants of those clades as they are: # Untested, hardly even thought through, danger danger! root = tree.root old_clades = root.clades # needed? root.clades = [] new_parent = root last = outgroup_path[-1] for parent in outgroup_path[-2::-1]: siblings = [kid for kid in parent.clades if kid != last] new_parent.clades = # TODO new_parent = last last = parent tree.rooted = True Bio.Phylo does no internal bookkeeping, so it's OK (i.e. sometimes required) to shuffle clades directly. Is this what "root with outgroup" is supposed to do? What functionality in Bio.Nexus.Trees.root_with_outgroup is missing here? And, do you happen to have an example of a tree with edge cases that I could use for testing? P.S. Why is Bio.Phylo.trim_str a public method? > Oops, I'll fix it. Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Mar 22 21:48:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 21:48:31 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> Message-ID: <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> On Mon, Mar 22, 2010 at 8:28 PM, Eric Talevich wrote: > On Mon, Mar 22, 2010 at 12:21 PM, Peter wrote: > >> Hi Eric, >> >> I've got a real example of a simple tree manipulation that I would like to >> handle via your new module. I have a (small) unrooted tree from a gene >> family in Newick format, which by construction includes an out-group >> (the same gene but from a more distant organism). I would like to reroot >> the tree so that this out-group is at the basal level. >> >> Can Bio.Phylo help me here? >> > > In Bio.Nexus, would you normally have handled this with the method > root_with_outgroup? I intend to port that method to Bio.Phylo once I > understand it, but the existing code has been kind of hard for me to figure > out. > > Let's address it here, then. Is there a detailed plain-text description > somewhere of how this operation should work in general? I've just got a quick answer for you now tonight: I've not used Bio.Nexus to try and do this - I'll try to get back to you in more depth tomorrow. Thanks, Peter From p.j.a.cock at googlemail.com Tue Mar 23 11:50:24 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Mar 2010 11:50:24 +0000 Subject: [Biopython-dev] pylint, was: Changes to the main repo In-Reply-To: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com> References: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com> Message-ID: <320fb6e01003230450h502adce0p27080d3a00ddda23@mail.gmail.com> 2010/3/22 Peter Cock : > 2010/3/21 Tiago Ant?o : >> Hi, >> >> I've made some changes in the main repository (my first changes with >> github), some comments: >> 1. Many thanks for the GitUsage wiki page. REALLY useful. >> 2. That being said, if I did any mistakes, they are my own fault. >> 3. I've added support for big genepop files, something I tend do be >> asked quite a lot >> 4. And support for haploid data (nobody really asked this) >> 5. I remember Peter sending an email about needed corrections to the >> code. I am afraid I've lost that email :( . If you send it to me, I >> will do them ASAP >> 6. New test cases and test data files >> 7. I might add support, in the future, to Arlequin (file format and >> application). Allowing for statistics over sequences and other goodies >> with sequence data. >> >> Regards, >> Tiago > > Hi Tiago, > > That sounds good. Regarding point 5, running pylint over the > code reported some possible errors in Bio.PopGen. Have a > look at this - they are all undefined variable issues: > http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html > > I just ran it again on the latest code, and the line numbers have > changed a tiny bit but that is all: > > $ pylint --disable-msg-cat=CRW --include-ids=y > --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen > No config file found, using default configuration > ************* Module Bio.PopGen.Async > E0602: 78:Async.get_result: Undefined variable 'done' > E0602: 79:Async.get_result: Undefined variable 'done' > ************* Module Bio.PopGen.GenePop > E0602:166:Record.split_in_pops: Undefined variable 'GenePop' > E0602:183:Record.split_in_loci: Undefined variable 'GenePop' > ************* Module Bio.PopGen.GenePop.Controller > E0602: 41:_read_allele_freq_table: Undefined variable 'self' > E0602:133:_hw_func: Undefined variable 'self' > E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext' > E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable > 'currrent_pop' > ************* Module Bio.PopGen.GenePop.FileParser > E1120:219:FileRecord.remove_locus_by_name: No value passed for > parameter 'fw' in function call > ************* Module Bio.PopGen.SimCoal.Cache > E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config' > E0602: 88: Undefined variable 'Cache' > ************* Module Bio.PopGen.SimCoal.Controller > E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config' > > Peter > Hi Taigo, This is looking much better after your fixes last night - just one left: $ pylint --disable-msg-cat=CRW --include-ids=y --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen No config file found, using default configuration ************* Module Bio.PopGen.GenePop.Controller E0602: 41:_read_allele_freq_table: Undefined variable 'self' Note if I turn off those particular error messages which in other situations I had tentatively tagged as false positives, there could be a few more issues: $ pylint --disable-msg-cat=CRW --include-ids=y -r n Bio.PopGenNo config file found, using default configuration ************* Module Bio.PopGen.Async E1101: 59:Async.run_program: Instance of 'Async' has no '_run_program' member ************* Module Bio.PopGen.GenePop.Controller E0602: 41:_read_allele_freq_table: Undefined variable 'self' ************* Module Bio.PopGen.GenePop.EasyController E1101: 33:EasyController.get_basic_info: Module 'Bio.PopGen.GenePop' has no 'parse' member E1101: 43:EasyController.test_hw_pop: Instance of 'GenePopController' has no 'test_pop_hz_prob' member ************* Module Bio.PopGen.GenePop.FileParser E1101:197:FileRecord.remove_population: Instance of 'FileRecord' has no 'populations' member E1101:206:FileRecord.remove_locus_by_position: Instance of 'FileRecord' has no 'populations' member Some of these may be harmless, for example the Async class has a run_program method which calls _run_program, which you expect to be implemented in any subclass. You could add a dummy method to show the expected arguments and just raise a NotImplementedError exception with a comment that the subclass should implement it. e.g. def _run_program(self, program, parameters, input_files): """Actually run the program, handled by a subclass (PRIVATE). This method should be replaced by any derived class to do something useful. It will be called by the run_program method. """ raise NotImplementedError("This object should be subclassed") That particular change is probably worth doing anyway from a code clarity point of view. Peter From biopython at maubp.freeserve.co.uk Tue Mar 23 15:26:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Mar 2010 15:26:56 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com> References: <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com> <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com> <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com> Message-ID: <320fb6e01003230826r6080746el3327f05079f2651a@mail.gmail.com> On Thu, Mar 18, 2010 at 7:04 PM, Peter wrote: > > I've done that to Bio/Seq.py on the trunk (added two > FutureWarnings and docstring explanation). Assuming > this doesn't trigger any regressions, we'd need to work > on the documentation (in particular the tutorial, but also > perhaps a news post?) and fix the GA unit test before > the release. > I've fixed the GA unit tests, generally by explicit use of string comparison when working with sequence objects. In the case of test_GAQueens.py, this required me to "correct" the "abuse" of the alphabet object (letters was a list of integers, not a string) and thus indirectly the way the MutableSeq was being created. This has always struck me as a very odd example - but should perhaps be kept in mind for more complex sequence like objects (e.g. sequences with 3-letter protein codes). Peter From tiagoantao at gmail.com Wed Mar 24 10:39:09 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 24 Mar 2010 10:39:09 +0000 Subject: [Biopython-dev] Spam on wiki Message-ID: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> Hi, I think we are being attacked, spam wise. The popgen_dev page was full with external links. I am clearing that page, but others might have the same problem. Maybe there is some what to automate the deletion of contributions from spam authors? Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From tiagoantao at gmail.com Wed Mar 24 10:47:43 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 24 Mar 2010 10:47:43 +0000 Subject: [Biopython-dev] Spam on wiki In-Reply-To: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> Message-ID: <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com> 2010/3/24 Tiago Ant?o : > I think we are being attacked, spam wise. The popgen_dev page was full > with external links. > I am clearing that page, but others might have the same problem. Maybe > there is some what to automate the deletion of contributions from spam > authors? I am clearing this http://www.biopython.org/wiki/Special:Contributions/Wiki0808 by hand, not much. From biopython at maubp.freeserve.co.uk Wed Mar 24 10:47:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 10:47:43 +0000 Subject: [Biopython-dev] Spam on wiki In-Reply-To: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> Message-ID: <320fb6e01003240347x5d10a1f3sa4c2c84fa9edcfbe@mail.gmail.com> 2010/3/24 Tiago Ant?o : > Hi, > > I think we are being attacked, spam wise. The popgen_dev page was full > with external links. > I am clearing that page, but others might have the same problem. Maybe > there is some what to automate the deletion of contributions from spam > authors? > > Tiago Hi, I'm subscribed to the wiki RSS feed, but this happened overnight so I hadn't seen it yet. This seems to happen about once a month or so - I haven't noticed a big rise in attacks or anything. This guy did about ten pages - normally only one or two get abused. Dealing with it is fairly easy - you click on the page history, and rollback to the last good page, and ban the user. Tell me your wiki username and I should be able to give you the rights needed to ban people. Peter From biopython at maubp.freeserve.co.uk Wed Mar 24 11:13:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 11:13:12 +0000 Subject: [Biopython-dev] Spam on wiki In-Reply-To: <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com> References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com> <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com> Message-ID: <320fb6e01003240413g27f3d87dp4762bd9f8c32befe@mail.gmail.com> 2010/3/24 Tiago Ant?o : > 2010/3/24 Tiago Ant?o : >> I think we are being attacked, spam wise. The popgen_dev page was full >> with external links. >> I am clearing that page, but others might have the same problem. Maybe >> there is some what to automate the deletion of contributions from spam >> authors? > > I am clearing this > http://www.biopython.org/wiki/Special:Contributions/Wiki0808 > by hand, not much. > The "rollback" link on the page history is the simplest route (you are now an administrator on the wiki so should be able to do this, and ban spammers). I don't know if there is a shortcut to revert all a user's recent changes. I think between us we have fixed all the pages now. Thanks, Peter From peter at maubp.freeserve.co.uk Wed Mar 24 14:08:26 2010 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:08:26 +0000 Subject: [Biopython-dev] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: References: Message-ID: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> Hi, This is probably of interest to all the Bio* projects offering access to the NCBI Entrez utilities. See forwarded message below. I *think* the new guidelines basically say that the email & tool parameters are optional BUT if your IP address ever gets banned for excessive use you then have to register an email & tool combination. Regarding the email address, the NCBI say to use the email of the developer (not the end user). However, they do not distinguish between the developers of a library (like us), and the developers of an application or script using a library (who may also be the end user). Currently we (Biopython) and I think BioPerl ask developers using our libraries to populate the email address themselves. I *think* this is still the right action. Peter ---------- Forwarded message ---------- From: Date: Wed, Mar 24, 2010 at 1:53 PM Subject: [Utilities-announce] NCBI Revised E-utility Usage Policy To: NLM/NCBI List utilities-announce New E-utility documentation now on the NCBI Bookshelf The Entrez Programming Utilities (E-Utilities) Help documentation has been added to the NCBI Bookshelf, and so?is now fully integrated with the Entrez search and retrieval system as a part of the Bookshelf database. This help document has been divided into chapters for better organization and includes several new sample Perl scripts. At present this book covers the standard URL interface for the E-utilties; material about the SOAP interface will be added soon and is still available at the same URL: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html. Revised E-utility usage policy In December, 2009 NCBI announced a change to the usage policy for the E-utilities that would require all requests to contain non-null values for both the?&email and &tool parameters. After several consultations with our users and developers, we have decided to revise this policy change, and the revised?policy is described in detail at the following link: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen Please let us know if you have any questions or concerns about this policy change. Thank you, The E-Utilities Team NIH/NLM/NCBI eutilities at ncbi.nlm.nih.gov. _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce -------------- next part -------------- _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce From biopython at maubp.freeserve.co.uk Wed Mar 24 14:51:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:51:46 +0000 Subject: [Biopython-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI Revised E-utility Usage Policy In-Reply-To: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> References: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com> <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu> Message-ID: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com> On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote: > > On Mar 24, 2010, at 9:08 AM, Peter wrote: > >> Hi, >> >> This is probably of interest to all the Bio* projects offering access >> to the NCBI Entrez utilities. See forwarded message below. >> >> I *think* the new guidelines basically say that the email & tool parameters are >> optional BUT if your IP address ever gets banned for excessive use you then >> have to register an email & tool combination. >> >> Regarding the email address, the NCBI say to use the email of the developer >> (not the end user). However, they do not distinguish between the developers >> of a library (like us), and the developers of an application or script using a >> library (who may also be the end user). >> >> Currently we (Biopython) and I think BioPerl ask developers using our libraries >> to populate the email address themselves. I *think* this is still the >> right action. >> >> Peter > > > Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I > think with the SOAP-based ones as well). ?We're providing a specific set of > tools for user to write up their own applications end applications. ?I can try > contacting them regarding this to get an official response to clarify this > somewhat. Please give the NCBI an email - you can CC me too if you like. > Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a > default, but always leave the email blank and issue a warning if it isn't > set. ?We could just as easily leave both blank and issue warnings for both. We currently leave out the email and set the tool parameter to "Biopython" by default but this can be overridden. Currently leaving out the email does cause Biopython to give a warning. Peter From biopython at maubp.freeserve.co.uk Wed Mar 24 15:16:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 15:16:51 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> Message-ID: <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> On Mon, Mar 22, 2010 at 9:48 PM, Peter wrote: >> In Bio.Nexus, would you normally have handled this with the method >> root_with_outgroup? I intend to port that method to Bio.Phylo once I >> understand it, but the existing code has been kind of hard for me to figure >> out. > > I've just got a quick answer for you now tonight: I've not used Bio.Nexus > to try and do this - I'll try to get back to you in more depth tomorrow. Here is an example using Bio.Nexus.Trees to reroot with an outgroup. #I have encoded the tree here as a string: tree_string = """((gi|6273291|gb|AF191665.1|AF191:0.00418, (gi|6273290|gb|AF191664.1|AF191:0.00189, gi|6273289|gb|AF191663.1|AF191:0.00145) :0.00083):0.00770, (gi|6273287|gb|AF191661.1|AF191:0.00489, gi|6273286|gb|AF191660.1|AF191:0.00295) :0.00014, (gi|6273285|gb|AF191659.1|AF191:0.00094, gi|6273284|gb|AF191658.1|AF191:0.00018) :0.00125);""" from Bio.Nexus import Tree tree = Trees.Tree(tree_string) print "Old" print tree print tree.display() print print "New" #This acts in situ: tree.root_with_outgroup(["gi|6273289|gb|AF191663.1|AF191"]) print tree print tree.display() Old tree a_tree = ((gi|6273291|gb|AF191665.1|AF191,(gi|6273290|gb|AF191664.1|AF191,gi|6273289|gb|AF191663.1|AF191)),(gi|6273287|gb|AF191661.1|AF191,gi|6273286|gb|AF191660.1|AF191),(gi|6273285|gb|AF191659.1|AF191,gi|6273284|gb|AF191658.1|AF191)); # taxon prev succ brlen blen (sum) support comment 0 - None [1, 6, 9] 0.0 0.0 - - 1 - 0 [2, 3] 0.0077 0.0077 - - 2 gi|6273291|gb|AF191665.1|AF191 1 [] 0.00418 0.01188 - - 3 - 1 [4, 5] 0.00083 0.00853 - - 4 gi|6273290|gb|AF191664.1|AF191 3 [] 0.00189 0.01042 - - 5 gi|6273289|gb|AF191663.1|AF191 3 [] 0.00145 0.00998 - - 6 - 0 [7, 8] 0.00014 0.00014 - - 7 gi|6273287|gb|AF191661.1|AF191 6 [] 0.00489 0.00503 - - 8 gi|6273286|gb|AF191660.1|AF191 6 [] 0.00295 0.00309 - - 9 - 0 [10, 11] 0.00125 0.00125 - - 10 gi|6273285|gb|AF191659.1|AF191 9 [] 0.00094 0.00219 - - 11 gi|6273284|gb|AF191658.1|AF191 9 [] 0.00018 0.00143 - - Root: 0 None New tree a_tree = (((((gi|6273287|gb|AF191661.1|AF191,gi|6273286|gb|AF191660.1|AF191),(gi|6273285|gb|AF191659.1|AF191,gi|6273284|gb|AF191658.1|AF191)),gi|6273291|gb|AF191665.1|AF191),gi|6273290|gb|AF191664.1|AF191),gi|6273289|gb|AF191663.1|AF191); # taxon prev succ brlen blen (sum) support comment 0 - 1 [6, 9] 0.0077 0.00998 - - 1 - 3 [0, 2] 0.00083 0.00228 - - 2 gi|6273291|gb|AF191665.1|AF191 1 [] 0.00418 0.00646 - - 3 - 12 [1, 4] 0.00145 0.00145 - - 4 gi|6273290|gb|AF191664.1|AF191 3 [] 0.00189 0.00334 - - 5 gi|6273289|gb|AF191663.1|AF191 12 [] 0.0 0.0 0.0 - 6 - 0 [7, 8] 0.00014 0.01012 - - 7 gi|6273287|gb|AF191661.1|AF191 6 [] 0.00489 0.01501 - - 8 gi|6273286|gb|AF191660.1|AF191 6 [] 0.00295 0.01307 - - 9 - 0 [10, 11] 0.00125 0.01123 - - 10 gi|6273285|gb|AF191659.1|AF191 9 [] 0.00094 0.01217 - - 11 gi|6273284|gb|AF191658.1|AF191 9 [] 0.00018 0.01141 - - 12 - None [3, 5] 0.0 0.0 - - Root: 12 None Here the root_with_outgroup method acts in situ, and returns the new root ID number (not applicable to Bio.Phylo). The outgroup argument seems to be a list of taxon names (here just one). In my example, the outgroup originally has a branch length of 0.00145. A new root node was created (here #12) with two children, one with a branch length of zero (#5, the outgroup) and one with the full length (#3, branch length 0.00145). Essentially this new root node (#12) and the outgroup (#5) are now both right at the base of the tree. There is more than one what to do this though. For example FigTree seems to introduce a new root node half way along the outgroup branch (replacing the edge with two edges of half its length). This way the new root node represents the last common ancestor of the outgroup and the ingroup (everything else), although putting it at the mid point is perhaps a little arbitrary. Peter From nicolas.rapin at bric.ku.dk Thu Mar 25 11:58:53 2010 From: nicolas.rapin at bric.ku.dk (Nicolas Rapin) Date: Thu, 25 Mar 2010 12:58:53 +0100 Subject: [Biopython-dev] GEO database and bio-python Message-ID: Dear all, I just started python, and use biopython quite a lot lately. It's a nice package, and is very convenient. Oh, and I m also new on the mailing list... I need to get access to a lot of data from GEO, and i noticed that it might be a good idea to have the database locally, which lead me to write a little class that can download the compressed files form ncbi (the GSE/GPLxxx_family.tgz files) , and parse the MINimL sort of xml they have in there together with the actual data that is in the compressed files. In the end i have a nicely organized hdf5 file, that i can use to do data mining. I wondered if that was for Biopython. If yes, how do I contribute ? best, Nico From biopython at maubp.freeserve.co.uk Thu Mar 25 12:22:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Mar 2010 12:22:40 +0000 Subject: [Biopython-dev] GEO database and bio-python In-Reply-To: References: Message-ID: <320fb6e01003250522l3c730081y143cc4799f038754@mail.gmail.com> On Thu, Mar 25, 2010 at 11:58 AM, Nicolas Rapin wrote: > Dear all, > > I just started python, and use biopython quite a lot lately. It's a nice package, > and is very convenient. Oh, and I m also new on the mailing list... Great, and welcome :) > I ?need to get access to a lot of data from GEO, and i noticed that it might be > a good idea to have the database locally, which lead me to write ?a little class > that can download ?the compressed files ?form ncbi (the GSE/GPLxxx_family.tgz > files) , and parse the MINimL sort of xml they have in there together with the > actual data that is in the compressed files. In the end i have a nicely organized > hdf5 file, that i can use to do data mining. Have you looked at the existing Bio.GEO module? It hasn't got an active maintainer at the moment, as in some ways is rather simplistic. I found that Sean Davis' GEOquery package for R/Bioconductor was much more complete. > I wondered if that was for Biopython. This sounds like a useful addition. > If yes, how do I contribute ? First of all we use the public mailing lists to discuss things. In terms of code, starting a branch on github would let you show us what you are working on and makes it easier to eventually merge things. See http://biopython.org/wiki/GitUsage Peter From sdavis2 at mail.nih.gov Thu Mar 25 12:29:52 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 25 Mar 2010 08:29:52 -0400 Subject: [Biopython-dev] GEO database and bio-python In-Reply-To: References: Message-ID: <264855a01003250529p7d2290f3qc441228c34f5e720@mail.gmail.com> On Thu, Mar 25, 2010 at 7:58 AM, Nicolas Rapin wrote: > Dear all, > > I just started python, and use biopython quite a lot lately. It's a nice package, and is very convenient. Oh, and I m also new on the mailing list... > > I ?need to get access to a lot of data from GEO, and i noticed that it might be a good idea to have the database locally, which lead me to write ?a little class that can download ?the compressed files ?form ncbi (the GSE/GPLxxx_family.tgz files) , and parse the MINimL sort of xml they have in there together with the actual data that is in the compressed files. In the end i have a nicely organized hdf5 file, that i can use to do data mining. > > I wondered if that was for Biopython. Hi, Nico. Not a direct answer to your question, but have a look at the Bioconductor package GEOmetadb. (There is also an online version.) We have parsed all of GEO metadata into a SQLite database and made it available within R. However, the SQLite database can be used standalone and python has built in support for SQLite, as of late. http://gbnci.abcc.ncifcrf.gov/geo/ http://gbnci.abcc.ncifcrf.gov/geo/GEOmetadb.sqlite.gz http://watson.nci.nih.gov/bioc_mirror/packages/2.6/bioc/html/GEOmetadb.html Also, as for the data, if you are inclined to use R for anything (or rpy2), the GEOquery package can download and parse all the record types in GEO into objects within R and the number of tools for data analysis of microarray data in R/Bioconductor is enormous. http://watson.nci.nih.gov/bioc_mirror/packages/2.6/bioc/html/GEOquery.html Sorry for the advertisement-like email.... Sean > If yes, how do I contribute ? > > > best, > > Nico > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Thu Mar 25 16:25:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Mar 2010 16:25:01 +0000 Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez? Message-ID: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com> Hi all, The NCBI recently announced revised guidlines for the Entrez utilities, which we've started discussing on the OBF mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007499.html http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000623.html As part of this I decided to look at the peak hour rules: http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000644.html The old guideline was: http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements "Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests." This doesn't define a series - for example, would it be OK to run a script making 75 requests every two hours? This could be regarded as multiple separate series each under 100 requests, but the cumulative count over the 8 peak hours is 600 requests. Sadly the new guidelines are even more vague: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen "... and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays." Not very helpful. Also neither version raises the issue of summer/winter time (daylight savings times) but simply gives Eastern Time (EST). While we may get clarification from the NCBI, the following patch to Bio.Entrez may be worth considering. It simply counts the number of Entrez requestes during peak hours, and issues a warning if this exceeds 100 (based on a strict interpretation of the older guidelines). Does this seem worth checking in, or should we try to get some clarification from the NCBI first? Peter diff --git a/Bio/Entrez/__init__.py b/Bio/Entrez/__init__.py index 33d8d14..f670354 100644 --- a/Bio/Entrez/__init__.py +++ b/Bio/Entrez/__init__.py @@ -285,6 +285,26 @@ def _open(cgi, params={}, post=False): ? ? ? ? _open.previous = current + wait ? ? else: ? ? ? ? _open.previous = current + + ? ?# Max 100 requests from 09:00 to 17:00 Eastern Time (EST), which is + ? ?# 5 hours behind Coordinated Universal Time (UTC) aka Greenwich + ? ?# Mean Time (GMT), thus 14:00 to 22:00 UTC/GMT. The NCBI don't + ? ?# mention summer/winter time (daylight saving time), so ignore that. + ? ?if 14 <= time.gmtime(current).tm_hour < 22 \ + ? ?and time.gmtime(current).tm_wday <= 5: + ? ? ? ?# Peak time (Monday = 0, Friday = 5) + ? ? ? ?_open.peak_requests += 1 + ? ? ? ?if _open.peak_requests > 100: + ? ? ? ? ? ?import warnings + ? ? ? ? ? ?warnings.warn("The NCBI request you make at most 100 Entrez " + ? ? ? ? ? ? ? ? ? ? ? ? ?"requests during the peak time 9AM to 5PM EST " + ? ? ? ? ? ? ? ? ? ? ? ? ?"(which is 14:00 to 22:00 UTC/GMT). " + ? ? ? ? ? ? ? ? ? ? ? ? ?"You have exceeded this limit.") + ? ?else: + ? ? ? ?# Off peak + ? ? ? ?# Reset the counter (in case this is a long running script) + ? ? ? ?_open.peak_requests = 0 + ? ? # Remove None values from the parameters ? ? for key, value in params.items(): ? ? ? ? if value is None: @@ -368,3 +388,4 @@ E-utilities.""", UserWarning) ? ? return uhandle _open.previous = 0 +_open.peak_requests = 0 From eric.talevich at gmail.com Thu Mar 25 20:27:23 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 25 Mar 2010 16:27:23 -0400 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> Message-ID: <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> On Wed, Mar 24, 2010 at 11:16 AM, Peter wrote: > On Mon, Mar 22, 2010 at 9:48 PM, Peter > wrote: > >> In Bio.Nexus, would you normally have handled this with the method > >> root_with_outgroup? I intend to port that method to Bio.Phylo once I > >> understand it, but the existing code has been kind of hard for me to > figure > >> out. > > > > I've just got a quick answer for you now tonight: I've not used Bio.Nexus > > to try and do this - I'll try to get back to you in more depth tomorrow. > > Here is an example using Bio.Nexus.Trees to reroot with an outgroup. > > [...] > > In my example, the outgroup originally has a branch length of 0.00145. > A new root node was created (here #12) with two children, one with a > branch length of zero (#5, the outgroup) and one with the full length > (#3, branch length 0.00145). Essentially this new root node (#12) and > the outgroup (#5) are now both right at the base of the tree. > > There is more than one what to do this though. For example FigTree > seems to introduce a new root node half way along the outgroup branch > (replacing the edge with two edges of half its length). This way the > new root node represents the last common ancestor of the outgroup and > the ingroup (everything else), although putting it at the mid point is > perhaps a little arbitrary. > > Peter > I looked up this section in *Inferring Phylogenies* and found no decisive statement on how it should be done. I gathered: 1. The new root can be placed anywhere along the branch between the outgroup and its ancestor. 2. Another way to root a tree is by assuming a molecular clock -- place the root so that the distances to all the tips are roughly equal. So FigTree and Bio.Nexus are both doing reasonable things. (PyCogent doesn't seem to support this operation, as far as I can tell.) Thinking of this operation as extending the tree further back in time, where the (monophyletic) tree without the outgroup is a sub-clade of the larger rooted tree we're introducing -- it makes sense to me that the branch length of the outgroup should represent the total evolutionary distance from the root of the monophyletic sub-clade to the outgroup. Based on that, I'm tempted to do the opposite of Bio.Nexus, letting the outgroup keep its original branch length, and assigning a length of 0 to the branch leading to the remaining sub-clade. Then by default we get something resembling a trifucating root, and the user can shift the actual location of the root further back without too much difficulty. Alternatives: - Take a hint from the molecular clock, and try to equalize the distance from the root to the outgroup and the farthest tip of the main subclade. Problem: in your example the outgroup is not the longest branch, so this would be equivalent to the version I proposed above. The root->subclade branch would only be nonzero sometimes, and it might surprise you when that happens. - Offer a separate method, root_by_clock, which does the expected thing, and can be used to determine good branch lengths at the root after the outgroup operation, if desired. - Combine: add a keyword argument to root_with_outgroup (like molecular_clock=False) which triggers Alternative #1. I'll play with this some more and post an example implementation for you to review. Thanks for your help, Eric From mjldehoon at yahoo.com Fri Mar 26 01:37:26 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 25 Mar 2010 18:37:26 -0700 (PDT) Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez? In-Reply-To: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com> Message-ID: <996381.46391.qm@web62407.mail.re1.yahoo.com> I have no objections, but basically I think that this can be left to the responsibility of the end user. --Michiel. --- On Thu, 3/25/10, Peter wrote: > From: Peter > Subject: NCBI E-utility 100 requests rule in Bio.Entrez? > To: "Biopython-Dev Mailing List" , "Michiel de Hoon" > Date: Thursday, March 25, 2010, 12:25 PM > Hi all, > > The NCBI recently announced revised guidlines for the > Entrez > utilities, which we've started discussing on the OBF > mailing list: > http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007499.html > http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000623.html > > As part of this I decided to look at the peak hour rules: > http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000644.html > > The old guideline was: > > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements > "Run retrieval scripts on weekends or between 9 pm and 5 > am > Eastern Time weekdays for any series of more than 100 > requests." > > This doesn't define a series - for example, would it be OK > to run > a script making 75 requests every two hours? This could be > regarded > as multiple separate series each under 100 requests, but > the > cumulative count over the 8 peak hours is 600 requests. > > Sadly the new guidelines are even more vague: > > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils?=chapter2#chapter2.Usage_Guidelines_and_Requiremen > "... and limit large jobs to either weekends or between > 9:00 PM > and 5:00 AM Eastern time during weekdays." > > Not very helpful. > > Also neither version raises the issue of summer/winter > time > (daylight savings times) but simply gives Eastern Time > (EST). > > While we may get clarification from the NCBI, the following > patch > to Bio.Entrez may be worth considering. It simply counts > the > number of Entrez requestes during peak hours, and issues a > warning if this exceeds 100 (based on a strict > interpretation of > the older guidelines). > > Does this seem worth checking in, or should we try to get > some > clarification from the NCBI first? > > Peter > > diff --git a/Bio/Entrez/__init__.py > b/Bio/Entrez/__init__.py > index 33d8d14..f670354 100644 > --- a/Bio/Entrez/__init__.py > +++ b/Bio/Entrez/__init__.py > @@ -285,6 +285,26 @@ def _open(cgi, params={}, > post=False): > ? ? ? ? _open.previous = current + wait > ? ? else: > ? ? ? ? _open.previous = current > + > + ? ?# Max 100 requests from 09:00 to 17:00 Eastern Time > (EST), which is > + ? ?# 5 hours behind Coordinated Universal Time (UTC) > aka Greenwich > + ? ?# Mean Time (GMT), thus 14:00 to 22:00 UTC/GMT. The > NCBI don't > + ? ?# mention summer/winter time (daylight saving time), > so ignore that. > + ? ?if 14 <= time.gmtime(current).tm_hour < 22 \ > + ? ?and time.gmtime(current).tm_wday <= 5: > + ? ? ? ?# Peak time (Monday = 0, Friday = 5) > + ? ? ? ?_open.peak_requests += 1 > + ? ? ? ?if _open.peak_requests > 100: > + ? ? ? ? ? ?import warnings > + ? ? ? ? ? ?warnings.warn("The NCBI request you make > at most 100 Entrez " > + ? ? ? ? ? ? ? ? ? ? ? ? ?"requests during > the peak time 9AM to 5PM EST " > + ? ? ? ? ? ? ? ? ? ? ? ? ?"(which is 14:00 to > 22:00 UTC/GMT). " > + ? ? ? ? ? ? ? ? ? ? ? ? ?"You have exceeded > this limit.") > + ? ?else: > + ? ? ? ?# Off peak > + ? ? ? ?# Reset the counter (in case this is a long > running script) > + ? ? ? ?_open.peak_requests = 0 > + > ? ? # Remove None values from the parameters > ? ? for key, value in params.items(): > ? ? ? ? if value is None: > @@ -368,3 +388,4 @@ E-utilities.""", UserWarning) > ? ? return uhandle > > _open.previous = 0 > +_open.peak_requests = 0 > From cy at cymon.org Fri Mar 26 11:38:56 2010 From: cy at cymon.org (Cymon Cox) Date: Fri, 26 Mar 2010 11:38:56 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> Message-ID: <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com> Hi Eric and Peter, On 25 March 2010 20:27, Eric Talevich wrote: > On Wed, Mar 24, 2010 at 11:16 AM, Peter >wrote: > > > On Mon, Mar 22, 2010 at 9:48 PM, Peter > > wrote: > > >> In Bio.Nexus, would you normally have handled this with the method > > >> root_with_outgroup? I intend to port that method to Bio.Phylo once I > > >> understand it, but the existing code has been kind of hard for me to > > figure > > >> out. > > > > > > I've just got a quick answer for you now tonight: I've not used > Bio.Nexus > > > to try and do this - I'll try to get back to you in more depth > tomorrow. > > > > Here is an example using Bio.Nexus.Trees to reroot with an outgroup. > > > > [...] > > > > In my example, the outgroup originally has a branch length of 0.00145. > > A new root node was created (here #12) with two children, one with a > > branch length of zero (#5, the outgroup) and one with the full length > > (#3, branch length 0.00145). Essentially this new root node (#12) and > > the outgroup (#5) are now both right at the base of the tree. > > > > There is more than one what to do this though. For example FigTree > > seems to introduce a new root node half way along the outgroup branch > > (replacing the edge with two edges of half its length). This way the > > new root node represents the last common ancestor of the outgroup and > > the ingroup (everything else), although putting it at the mid point is > > perhaps a little arbitrary. > Yes, what FigTree is doing is arbitrary, it introduces information into the displayed tree that is not present, and is open to misinterpretation. But it's doing so purely for the graphical presentation because you are trying to root on a terminal branch. Thankfully, if you save this tree in FigTree it writes the original trifurcating tree. > I looked up this section in *Inferring Phylogenies* and found no decisive > statement on how it should be done. I gathered: > > 1. The new root can be placed anywhere along the branch between the > outgroup > and its ancestor. > The root may in biological reality be anywhere along that branch but, in the absence of further information, the question is where do you place it in this situation ie, rooting (making a bifurcating root node) on that terminal branch. > 2. Another way to root a tree is by assuming a molecular clock -- place the > root so that the distances to all the tips are roughly equal. > > So FigTree and Bio.Nexus are both doing reasonable things. (PyCogent > doesn't > seem to support this operation, as far as I can tell.) > > Thinking of this operation as extending the tree further back in time, > where > the (monophyletic) tree without the outgroup is a sub-clade of the larger > rooted tree we're introducing -- it makes sense to me that the branch > length > of the outgroup should represent the total evolutionary distance from the > root of the monophyletic sub-clade to the outgroup. Yes, the outgroup taxa are included in analyses to orientate the relationships (including br lens) of the ingroup. In this case, with a single outgroup taxon you do not a very good estimate of the ingroup br len (its presumably not the immediate ancestor of the ingroup), but its all you've got given the way the experiment was set up - including more outgroups would have been a good idea. Based on that, I'm > tempted to do the opposite of Bio.Nexus, Curious, because given that I think Bio.Nexus is doing the right thing ;) By using this function you are rooting (making a dichotomous root node) using an outgroup (1 taxon in this case), and the biological interpretation is that the length belongs to the ingroup. letting the outgroup keep its > original branch length, and assigning a length of 0 to the branch leading > to > the remaining sub-clade. Then by default we get something resembling a > trifucating root, and the user can shift the actual location of the root > further back without too much difficulty. > I dont understand what you are getting at here... Other points: They way that FigTree displays the rooted tree from root_with_outgroup() is how I would expect the tree to be presented if you only had a single outgroup taxon. There is a case to be made for not making a dichotomous root, but making the nearest trifurcating node to the designated outgroup the root node - this is what PAUP does (it wont write at dichotomously rooted tree even if you tell it to root it). I think the whole problem stems from only having a single outgroup (which when you root to it ends up 'looking' like the immediate ancestor of the ingroup). Typically, you would include multiple ougroups and present/display the tree with a trifurcating root node, one of which lineages is the ingroup - unless you are using a non-reversible model you dont need dichotomously rooted trees. Cheers, C. -- From bugzilla-daemon at portal.open-bio.org Fri Mar 26 22:28:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 18:28:10 -0400 Subject: [Biopython-dev] [Bug 3036] New: PhyloXML cannot read node colors created by PhyloXML Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3036 Summary: PhyloXML cannot read node colors created by PhyloXML Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: joelb at lanl.gov Using a simple example file provided: >>> tree = Phylo.read('bcl_2.xml','phyloxml') >>> tree.clade[0].color = Phylo.PhyloXML.BranchColor(255,0,255) >>> tree.clade[0].color BranchColor(blue='255', green='0', red='255') Phylo.write(tree,'colored.phyloxml','phyloxml') 1 >>> tree2=Phylo.read('colored.phyloxml','phyloxml') Traceback (innermost last): File "", line 1, in File "/usr/lib64/python2.6/site-packages/Bio/Phylo/_io.py", line 57, in read tree = tree_gen.next() File "/usr/lib64/python2.6/site-packages/Bio/Phylo/_io.py", line 42, in parse for tree in getattr(supported_formats[format], 'parse')(file): File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 317, in parse yield self._parse_phylogeny(elem) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 342, in _parse_phylogeny phylogeny.root = self._parse_clade(elem) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 388, in _parse_clade clade.clades.append(self._parse_clade(elem)) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 410, in _parse_clade setattr(clade, tag, getattr(self, tag)(elem)) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 518, in color return PX.BranchColor(red, green, blue) File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXML.py", line 432, in __init__ ), "Color values must be integers between 0 and 255." AssertionError: Color values must be integers between 0 and 255. This is not a problem with an example file not written by biopython: >>> tree = Phylo.parse('made_up.xml','phyloxml').next() >>> tree.clade[0].color BranchColor(blue='28', green='220', red='128') Also, forester/archaeoptryx is able to correctly read colors written by biopython. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Mar 26 22:30:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 26 Mar 2010 18:30:36 -0400 Subject: [Biopython-dev] [Bug 3037] New: PhyloXMLIO creates extremely ugly xml Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3037 Summary: PhyloXMLIO creates extremely ugly xml Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P3 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: joelb at lanl.gov This is a request for an enhancement. The xml code PhyloXMLIO creates has no linefeeds. It would be very helpful for debugging if the XML is prettyprinted. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Mar 27 12:45:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 27 Mar 2010 12:45:28 +0000 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com> References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com> <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com> <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com> <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com> <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com> <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com> Message-ID: <320fb6e01003270545l7e43c3bbu7a1174397a45ce99@mail.gmail.com> On Fri, Mar 26, 2010 at 11:38 AM, Cymon Cox wrote: > > I think the whole problem stems from only having a single outgroup (which > when you root to it ends up 'looking' like the immediate ancestor of the > ingroup). Typically, you would include multiple ougroups and present/display > the tree with a trifurcating root node, one of which lineages is the ingroup > - unless you are using a non-reversible model you dont need dichotomously > rooted trees. > And I thought it would be simpler from this example to use a single out group ;) Thanks for the comments both of you. Peter From bugzilla-daemon at portal.open-bio.org Sun Mar 28 02:58:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 27 Mar 2010 22:58:59 -0400 Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml In-Reply-To: Message-ID: <201003280258.o2S2wxut000915@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3037 ------- Comment #1 from eric.talevich at gmail.com 2010-03-27 22:58 EST ------- (In reply to comment #0) > This is a request for an enhancement. > > The xml code PhyloXMLIO creates has no linefeeds. It would be very helpful for > debugging if the XML is prettyprinted. > This is a shortcoming of the ElementTree module in the Python standard library -- the writer doesn't have an option for setting whitespace. But I agree it would be nice to have this feature, so I'll leave the bug open as a reminder to look for other ways to do this. In the meantime I recommend using some external tool to reformat the XML if you want to look at the raw data. XML Starlet can do this: http://xmlstar.sourceforge.net/ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Mar 28 16:35:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 28 Mar 2010 12:35:34 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003281635.o2SGZYuv009361@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 k.okonechnikov at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biopython-dev at biopython.org |k.okonechnikov at gmail.com Status|NEW |ASSIGNED ------- Comment #6 from k.okonechnikov at gmail.com 2010-03-28 12:35 EST ------- Created an attachment (id=1468) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1468&action=view) This patch solves this issue and also Bug 2951 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Mar 28 18:03:21 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 28 Mar 2010 14:03:21 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003281803.o2SI3LhC011261@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython-dev at biopython.org -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Sun Mar 28 18:18:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 28 Mar 2010 14:18:24 -0400 Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml In-Reply-To: Message-ID: <201003281818.o2SIIOLM011602@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3037 ------- Comment #2 from chapmanb at 50mail.com 2010-03-28 14:18 EST ------- Eric, check out Fredrik Lundh's indent function for ElementTree. I'm not sure this ever made it into the source, but it's small enough to copy/paste: http://effbot.org/zone/element-lib.htm#prettyprint -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Mar 29 10:58:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 11:58:28 +0100 Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez? In-Reply-To: <996381.46391.qm@web62407.mail.re1.yahoo.com> References: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com> <996381.46391.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e01003290358y1e30fc6eme2028a126a36cdb@mail.gmail.com> On Fri, Mar 26, 2010 at 2:37 AM, Michiel de Hoon wrote: > I have no objections, but basically I think that this can be left to the responsibility of the end user. > > --Michiel. OK, unless the NCBI decide to clarify what exactly they mean, then let's just leave this as is it (the responsibility of the end user). Peter From biopython at maubp.freeserve.co.uk Mon Mar 29 11:05:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 12:05:25 +0100 Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally Message-ID: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com> Hi Michiel et al, The NCBI looks to be encouraging more use of the email and tool parameters in their revised guidelines. To make this easy to use we have a global setting for the email - I think we should do the same for the tool (for when users are building their own application or script on top of Biopython). Something like this patch? What do you think? Peter ------------------------ diff --git a/Bio/Entrez/__init__.py b/Bio/Entrez/__init__.py index 33d8d14..f64015c 100644 --- a/Bio/Entrez/__init__.py +++ b/Bio/Entrez/__init__.py @@ -12,6 +12,9 @@ http://www.ncbi.nlm.nih.gov/Entrez/ A list of the Entrez utilities is available at: http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html +Variables: +email Set the Entrez email parameter globally (default is not set). +tool Set the Entrez tool parameter globally (defaults to biopython). Functions: efetch Retrieves records in the requested format from a list of one or @@ -50,7 +53,7 @@ from Bio import File email = None - +tool = "biopython" # XXX retmode? def epost(db, **keywds): @@ -275,6 +278,7 @@ def _open(cgi, params={}, post=False): This function also enforces the "up to three queries per second rule" to avoid abusing the NCBI servers. """ + global tool, email # NCBI requirement: At most three queries per second. # Equivalently, at least a third of second between queries delay = 0.333333334 @@ -291,7 +295,7 @@ def _open(cgi, params={}, post=False): del params[key] # Tell Entrez that we are using Biopython if not "tool" in params: - params["tool"] = "biopython" + params["tool"] = tool # Tell Entrez who we are if not "email" in params: if email!=None: From biopython at maubp.freeserve.co.uk Mon Mar 29 12:36:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 13:36:19 +0100 Subject: [Biopython-dev] Deprecating PropertyManager, Encodings and Bio.utils? Message-ID: <320fb6e01003290536p4c61e1d1u8bc6e2ad83f1c9e5@mail.gmail.com> Hi all, I think we've done pretty well at carefully removing, fixing or replacing most of the dusty bits of code Biopython had acquired over the years. There are still things to clean up though... in particular modules Bio.PropertyManager and Bio.Encodings seem rather unnecessary. Bio.Encodings is tied into the old (and now deprecated) Bio.Translate and Bio.Transcribe code. Once they are removed (after the next release) we can at least cut a lot of Bio.Encodings. Bio.PropertyManager and Bio.Encodings only seem to be used by Bio.utils, which I would also like to deprecate. This is an undocumented module with no unit tests. It offers a few bits of sequence related functionality which would be better off in Bio.Seq or Bio.SeqUtils, and some fairly trivial functions we could just deprecate. These strike me as the only bits of functionality worth keeping in Bio.utils: Function verify_alphabet (which is being used by the code in Bio.NeuralNetwork.Gene) just checks a Seq object's sequence obeys the alphabet letters. This essentially is something I think the Seq object should do itself, during initialisation (Bug 2597). With that done, then Bio.utils.verify_alphabet could be deprecated. There are a few functions for getting molecular weights via the IUPAC alphabet objects. These could be reimplemented by using weight tables belonging to the IUPAC alphabet classes explicitly, perhaps exposed as new functions under Bio.SeqUtils. It would be interesting to look at refinements like handling the start/end of the sequence explicitly (i.e. the 5' and 3' ends of a nucleotide sequence, or the N and C terminals of a peptide). Function reduce_sequence (linked to Bio.Alphabet.Reduced) is for things like mapping a protein sequence to a simplified sequence using the Murphy alphabet (e.g. using a single letter for all the aliphatics: I,L,V). This is perhaps interesting enough to retain - again perhaps under Bio.SeqUtils. It does need documentation and unit tests though. Is anyone interested in updating, documenting and then testing the molecular weight and reduced alphabet code? [I suggest starting a new thread if you are.] If not, should we consider just deprecating Bio.utils, Bio.PropertyManager and Bio.Encodings in the next release? Peter From mjldehoon at yahoo.com Mon Mar 29 13:54:01 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 29 Mar 2010 06:54:01 -0700 (PDT) Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally In-Reply-To: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com> Message-ID: <614002.50617.qm@web62406.mail.re1.yahoo.com> Basically I think that this patch is OK, but why do tool and email need to be global inside the _open function? --Michiel --- On Mon, 3/29/10, Peter wrote: > From: Peter > Subject: Setting the NCBI Entrez tool parameter globally > To: "Michiel de Hoon" , "Biopython-Dev Mailing List" > Date: Monday, March 29, 2010, 7:05 AM > Hi Michiel et al, > > The NCBI looks to be encouraging more use of the email and > tool > parameters in their revised guidelines. To make this easy > to use > we have a global setting for the email - I think we should > do the > same for the tool (for when users are building their own > application > or script on top of Biopython). Something like this patch? > > What do you think? > > Peter > > ------------------------ > > diff --git a/Bio/Entrez/__init__.py > b/Bio/Entrez/__init__.py > index 33d8d14..f64015c 100644 > --- a/Bio/Entrez/__init__.py > +++ b/Bio/Entrez/__init__.py > @@ -12,6 +12,9 @@ http://www.ncbi.nlm.nih.gov/Entrez/ > A list of the Entrez utilities is available at: > http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html > > +Variables: > +email? ? ? ? Set the Entrez email > parameter globally (default is not set). > +tool? ? ? ???Set the Entrez > tool parameter globally (defaults to biopython). > > Functions: > efetch? ? ???Retrieves records in > the requested format from a list of one or > @@ -50,7 +53,7 @@ from Bio import File > > > email = None > - > +tool = "biopython" > > # XXX retmode? > def epost(db, **keywds): > @@ -275,6 +278,7 @@ def _open(cgi, params={}, post=False): > ? ???This function also enforces the > "up to three queries per second rule" > ? ???to avoid abusing the NCBI > servers. > ? ???""" > +? ? global tool, email > ? ???# NCBI requirement: At most three > queries per second. > ? ???# Equivalently, at least a third > of second between queries > ? ???delay = 0.333333334 > @@ -291,7 +295,7 @@ def _open(cgi, params={}, post=False): > ? ? ? ? ? ???del > params[key] > ? ???# Tell Entrez that we are using > Biopython > ? ???if not "tool" in params: > -? ? ? ? params["tool"] = "biopython" > +? ? ? ? params["tool"] = tool > ? ???# Tell Entrez who we are > ? ???if not "email" in params: > ? ? ? ???if email!=None: > From chapmanb at 50mail.com Mon Mar 29 13:50:23 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 29 Mar 2010 09:50:23 -0400 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo Message-ID: <20100329135023.GF42657@sobchak.mgh.harvard.edu> Eric, Peter and Cymon; > I've got a real example of a simple tree manipulation that I would > like to handle via your new module. I have a (small) unrooted tree from a > gene family in Newick format, which by construction includes an > out-group (the same gene but from a more distant organism). I would like to > reroot the tree so that this out-group is at the basal level. Really enjoying the discussion on this. It's a bit outside my area of expertise but I stumbled across DendroPy this weekend: http://packages.python.org/DendroPy/index.html which has a reroot_at function that might be worth looking into: http://github.com/jeetsukumaran/DendroPy/blob/master/dendropy/dataobject/tree.py Hope this helps, Brad From biopython at maubp.freeserve.co.uk Mon Mar 29 15:22:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 16:22:11 +0100 Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally In-Reply-To: <614002.50617.qm@web62406.mail.re1.yahoo.com> References: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com> <614002.50617.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e01003290822u62190365j9c147df71a9de46e@mail.gmail.com> On Mon, Mar 29, 2010 at 2:54 PM, Michiel de Hoon wrote: > Basically I think that this patch is OK, but why do tool and email > need to be global inside the _open function? I just thought it was clearer than implicit scope rules, I'll omit that line and commit the rest. Peter From biopython at maubp.freeserve.co.uk Mon Mar 29 15:35:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 16:35:01 +0100 Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo In-Reply-To: <20100329135023.GF42657@sobchak.mgh.harvard.edu> References: <20100329135023.GF42657@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01003290835n25dd2a35kcc8c40587dd10c05@mail.gmail.com> On Mon, Mar 29, 2010 at 2:50 PM, Brad Chapman wrote: > Eric, Peter and Cymon; > >> I've got a real example of a simple tree manipulation that I would >> like to handle via your new module. I have a (small) unrooted tree from a >> gene family in Newick format, which by construction includes an >> out-group (the same gene but from a more distant organism). I would like to >> reroot the tree so that this out-group is at the basal level. > > Really enjoying the discussion on this. It's a bit outside my area > of expertise but I stumbled across DendroPy this weekend: > > http://packages.python.org/DendroPy/index.html > > which has a reroot_at function that might be worth looking into: > > http://github.com/jeetsukumaran/DendroPy/blob/master/dendropy/dataobject/tree.py > > Hope this helps, > Brad Hey Brad, I also spotted DendroPy recently (via a blog post or something), but hadn't yet looked to see how they handled this. It looks like their reroot_at function takes an *internal* node as the argument to specify the new root. This neatly avoids the problem about having to introduce a new node when rerooting with a given terminal node (taxon) as the out group. Peter From bugzilla-daemon at portal.open-bio.org Mon Mar 29 16:07:12 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Mar 2010 12:07:12 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003291607.o2TG7CDf013375@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-29 12:07 EST ------- (In reply to comment #6) > Created an attachment (id=1468) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1468&action=view) [details] > This patch solves this issue and also Bug 2951 > Just by eye there is something wrong with your indentation in that patch. Maybe you have mixed tabs and spaces? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Mon Mar 29 17:28:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Mar 2010 13:28:18 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003291728.o2THSIAf015768@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #8 from k.okonechnikov at gmail.com 2010-03-29 13:28 EST ------- Created an attachment (id=1469) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1469&action=view) Improved version of the patch Added default value for serial_num in Model constructor, fixed indentation issues. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Mon Mar 29 17:39:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Mar 2010 13:39:14 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201003291739.o2THdEoU016014@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #9 from k.okonechnikov at gmail.com 2010-03-29 13:39 EST ------- Created an attachment (id=1470) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1470&action=view) Simple test script It downloads NMR structure, checks model serial numbers and writes structure to file. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From biopython at maubp.freeserve.co.uk Mon Mar 29 21:41:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Mar 2010 22:41:14 +0100 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> Message-ID: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> On Fri, Mar 19, 2010 at 11:08 PM, Sebastian Bassi wrote: > On Fri, Mar 19, 2010 at 7:45 AM, Peter wrote: >> Give an inch and they'll take a mile ;) > > In Spanish we say: Give a hand and they'll take the whole arm :) I think I like that version more :) >> that if they don't specify the format that Biopython will >> determine it automatically - which it won't. > > In this respect, Python zen favours being explicit,so I see your point. > >> Also, could you clarify if you are in favour of relaxing the >> requirement that the write function takes a list/iterator of >> records/alignments to allow a single SeqRecord or alignment? > > Is OK for me to allow a single record instead of a iterable, this > change will not break any existing code so it is OK for me. That sounds like you don't object, but are not strongly in favour either. No-one else has commented (other than Eric and Marshall who were in favour). Maybe it would be prudent to leave it? [Will this suggestion provoke any further comments I wonder?] Peter From eric.talevich at gmail.com Tue Mar 30 03:05:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 29 Mar 2010 23:05:51 -0400 Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions? In-Reply-To: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com> <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com> <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com> <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com> <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com> <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com> <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com> Message-ID: <3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com> On Mon, Mar 29, 2010 at 5:41 PM, Peter wrote: > On Fri, Mar 19, 2010 at 11:08 PM, Sebastian Bassi wrote: > > On Fri, Mar 19, 2010 at 7:45 AM, Peter > wrote: > >> Also, could you clarify if you are in favour of relaxing the > >> requirement that the write function takes a list/iterator of > >> records/alignments to allow a single SeqRecord or alignment? > > > > Is OK for me to allow a single record instead of a iterable, this > > change will not break any existing code so it is OK for me. > > That sounds like you don't object, but are not strongly in > favour either. > > No-one else has commented (other than Eric and Marshall > who were in favour). > > Maybe it would be prudent to leave it? [Will this suggestion > provoke any further comments I wonder?] I know I've already voted, but here's another thought: if we're going to make this change eventually, it would be nice if the very first release of Bio.Phylo had the right behavior and retained the same behavior through later releases. Otherwise we'd have one or more isolated releases where Phylo.write doesn't handle single trees directly, and when documentation is updated to track later releases that do handle single trees, that could cause some confusion for some folks still using Biopython 1.54. -Eric > From bugzilla-daemon at portal.open-bio.org Tue Mar 30 05:17:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 30 Mar 2010 01:17:44 -0400 Subject: [Biopython-dev] [Bug 3036] PhyloXML cannot read node colors created by PhyloXML In-Reply-To: Message-ID: <201003300517.o2U5HiVN001772@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3036 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from eric.talevich at gmail.com 2010-03-30 01:17 EST ------- (In reply to comment #0) > Using a simple example file provided: > ... > This is not a problem with an example file not written by biopython: > >>> tree = Phylo.parse('made_up.xml','phyloxml').next() > >>> tree.clade[0].color > BranchColor(blue='28', green='220', red='128') Thanks for catching this! I pushed a fix to GitHub: http://github.com/biopython/biopython/commit/6e2eac9612f600507491c3bb45fc19ffdc987169 The problem was occurring for color values of 0 -- PhyloXMLIO was using an inline and-or test instead of if-else (Py2.4 compatibility hack) to check and convert the node text to an integer. Since 0 evaluates as boolean False, the expression was returning None instead of integer 0, causing the BranchColor constructor to vom. > Also, forester/archaeoptryx is able to correctly read colors written by > biopython. Good to know. Thanks again for testing this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Mar 30 05:24:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 30 Mar 2010 01:24:09 -0400 Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml In-Reply-To: