From bugzilla-daemon at portal.open-bio.org Mon Mar 1 13:14:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Mar 2010 13:14:45 -0500
Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic
alignment, e.g. align[1:2, 5:-5]
In-Reply-To:
Message-ID: <201003011814.o21IEjcK024496@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2551
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-01 13:14 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj
This now covers:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]
Bug 2552 - Adding alignments
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bioinformed at gmail.com Mon Mar 1 18:22:42 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Mon, 1 Mar 2010 18:22:42 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
Message-ID: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
On Thu, Feb 11, 2010 at 12:29 AM, Peter wrote:
> On Mon, Jan 11, 2010 at 5:11 PM, Peter
> wrote:
> > I didn't want to rush the SFF support into Biopython 1.53, but its been
> > waiting "ready" for a while now. Any objections or comments about
> > me merging this now?
>
> There were no objections, and I ran this by Brad and Michiel and
> have just merged this into the master branch. Time for some more
> testing!
>
>
I've tried out the recently landed SFF SeqIO code and am pleased to report
that it works very well. I am parsing gsMapper 454PairAlign.txt output and
converting it to SAM/BAM format to view in IGV (among other things) and
wanted to include per-based quality score information from the SFF files.
The only glitch so far is that the indexed access mode yields sequences
with no alphabet assigned. The solution is to add the following to the
beginning of SffDict.__init__:
if alphabet is None:
alphabet = Alphabet.generic_dna
My only other comment is that several file reads and struct.unpacks can be
merged in _sff_read_seq_record. Given the number of records in most 454 SFF
files, I suspect the micro-optimization effort will be worth the slight cost
in code clarity.
Thanks to Peter and Jose for all of their hard work!
Best regards,
-Kevin
From biopython at maubp.freeserve.co.uk Tue Mar 2 05:08:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 10:08:27 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
Message-ID: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs
wrote:
> On Thu, Feb 11, 2010 at 12:29 AM, Peter
> wrote:
>>
>> On Mon, Jan 11, 2010 at 5:11 PM, Peter
>> wrote:
>> > I didn't want to rush the SFF support into Biopython 1.53, but its been
>> > waiting "ready" for a while now. Any objections or comments about
>> > me merging this now?
>>
>> There were no objections, and I ran this by Brad and Michiel and
>> have just merged this into the master branch. Time for some more
>> testing!
>>
>
> I've tried out the recently landed SFF SeqIO code and am pleased to
> report that it works very well.
Great :)
If you have suggestions for the documentation please voice them.
Also did the handling of trimmed reads seem sensible? Until we
release this we can tweak the API.
> I am parsing gsMapper 454PairAlign.txt output and
> converting it to SAM/BAM format to view in IGV (among other things) and
> wanted to include per-based quality score information from the SFF files.
Are you reading and writing SAM/BAM format with Python? Looking
into this is on my (long) todo list.
>?The only glitch so far is that the indexed access mode yields sequences
> with no alphabet assigned. ?The solution is to add the following to the
> beginning of SffDict.__init__:
> ?? ? ? ?if alphabet is None:
> ?? ? ? ? ?alphabet = Alphabet.generic_dna
Thanks - I'll look at that.
> My only other comment is that several file reads and struct.unpacks can be
> merged in?_sff_read_seq_record. ?Given the number of records in most 454 SFF
> files, I suspect the micro-optimization effort will be worth the slight cost
> in code clarity.
I did try and spend some effort on the run time, but it wouldn't
surprise me that there was still room for improvement. I found
that since most of my SFF files were only up to 2GB with under
a million reads, that this wasn't such an issue (compared to
FASTQ files with Solexa data).
I guess you mean the flowgram values, flowgram index, bases
and qualities might be loaded with a single read? That would
be worth trying.
> Thanks to Peter and Jose for all of their hard work!
> Best regards,
> -Kevin
And thanks for the feedback :)
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 07:02:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 12:02:53 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
Message-ID: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com>
On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote:
> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote:
>>?The only glitch so far is that the indexed access mode yields sequences
>> with no alphabet assigned. ?The solution is to add the following to the
>> beginning of SffDict.__init__:
>> ?? ? ? ?if alphabet is None:
>> ?? ? ? ? ?alphabet = Alphabet.generic_dna
>
> Thanks - I'll look at that.
Yes, that looks sensible - change commited. Would you like to be credited
in our NEWS and CONTRIB file for this little bug fix?
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 07:25:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 12:25:05 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20091028121833.GC22395@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman wrote:
>Peter wrote:
>> My rough work in progress in on github - at the moment I'm still trying
>> things out, and don't assume anything is set in stone. If you want to
>> have a play with this code, feedback is very welcome - probably best
>> on the dev list rather than here. See:
>>
>> http://github.com/peterjc/biopython/tree/seqrecords
>>
>> (a lot of the alignment things I want to support, like slicing and adding
>> are very closely linked to doing the same operations to SeqRecords)
Here is a new branch implementing a multiple-sequence-alignment
class (living under Bio.Align for now) based on the recent support
for slicing and adding SeqRecord objects:
http://github.com/peterjc/biopython/tree/alignment-obj
This handles most of the basic tasks I want to be able to easily do
with classical alignments, based on previous discussions on the
mailing list and/or bugzilla:
http://bugzilla.open-bio.org/show_bug.cgi?id=2551
http://bugzilla.open-bio.org/show_bug.cgi?id=2552
http://bugzilla.open-bio.org/show_bug.cgi?id=2553
http://bugzilla.open-bio.org/show_bug.cgi?id=2554
At its core, the alignment is still held as a list of SeqRecord objects,
which should mean minimal problems with backwards compatibility.
If anyone would like to try out the code, comments would be very
welcome. There are plenty of doctests in the docstrings which
should explain how I expect things to work.
> The bx-python alignment object is nice and goes to/from MAF
> and AXT formats:
>
> http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py
>
> This supports slicing by alignment coordinates and by reference
> coordinates for a species in the alignment. Some other useful
> features are limiting the alignment to specific species and removing
> all gap columns that can result. The representation is a high level
> Alignment object containing multiple Components.
My code does not (yet) attempt to deal with next-gen sequencing
alignments, which would require padding all the (short) reads with
leading and trailing gaps to ensure all rows of the alignment have
the same length. Doing this in a memory efficient way could be
done with a PaddedSeq object, or a very different alignment object
(hold read and their offsets in memory). I'm not sure what is best,
but the bx-python model looks worth understanding to help decide.
Perhaps until this is settled, it would be premature to merge my
alignment class to the trunk. After all, we may need to tweak the
alignment object class heirachy.
Peter
From bioinformed at gmail.com Tue Mar 2 07:29:38 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 07:29:38 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com>
Message-ID: <2e1434c11003020429y37343796oddf02ad433ab82ea@mail.gmail.com>
On Tue, Mar 2, 2010 at 7:02 AM, Peter wrote:
> On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote:
> > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote:
> >> The only glitch so far is that the indexed access mode yields sequences
> >> with no alphabet assigned. The solution is to add the following to the
> >> beginning of SffDict.__init__:
> >> if alphabet is None:
> >> alphabet = Alphabet.generic_dna
> >
> > Thanks - I'll look at that.
>
> Yes, that looks sensible - change commited. Would you like to be credited
> in our NEWS and CONTRIB file for this little bug fix?
>
>
I'm happy to contribute and be listed in the credits.
Thanks,
-Kevin
From bioinformed at gmail.com Tue Mar 2 07:36:27 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 07:36:27 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
Message-ID: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote:
> On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman
> wrote:My code does not (yet) attempt to deal with next-gen sequencing
> alignments, which would require padding all the (short) reads with
> leading and trailing gaps to ensure all rows of the alignment have
> the same length. Doing this in a memory efficient way could be
> done with a PaddedSeq object, or a very different alignment object
> (hold read and their offsets in memory). I'm not sure what is best,
> but the bx-python model looks worth understanding to help decide.
>
> Perhaps until this is settled, it would be premature to merge my
> alignment class to the trunk. After all, we may need to tweak the
> alignment object class heirachy.
Hi Peter,
I'm just jumping in here and have not yet read all of the background
material. However, I am working with next-gen alignments and am curious as
to what you have in mind. At first glance, it sounds like you want to
access aligned reads in a 'pileup' format (i.e., an object model akin to
http://samtools.sourceforge.net/pileup.shtml). Or are you thinking of
something different entirely?
Best regards,
-Kevin
From bioinformed at gmail.com Tue Mar 2 07:28:22 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 07:28:22 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
Message-ID: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
On Tue, Mar 2, 2010 at 5:08 AM, Peter wrote:
> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs
> wrote:
> > I've tried out the recently landed SFF SeqIO code and am pleased to
> > report that it works very well.
>
> Great :)
>
> If you have suggestions for the documentation please voice them.
> Also did the handling of trimmed reads seem sensible? Until we
> release this we can tweak the API.
I only looked at the module documentation and it was more than sufficient to
get started. I've never really used BioPython before, so I was pleasantly
surprised at how easy it was to get started. The BioPython SFF parser and
indexed access replaced a hairy process of extracting data using 454's
sffinfo and packing it into a BDB file.
> > I am parsing gsMapper 454PairAlign.txt output and
> > converting it to SAM/BAM format to view in IGV (among other things) and
> > wanted to include per-based quality score information from the SFF files.
>
> Are you reading and writing SAM/BAM format with Python? Looking
> into this is on my (long) todo list.
>
Yes-- so far I have code to populate the basic data for unpaired reads, but
none of the optional annotations. My script reads the 454 pairwise
alignment data, finds each read in the source SFF file, figures out if extra
trimming was applied by gsMapper, and extracts the matching PHRED quality
scores. Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
FAQ). The script can output SAM records or create a subprocess to sort the
records and recode to BAM format using samtools. I've attached the current
version script and you are welcome to use it for any purpose.
> My only other comment is that several file reads and struct.unpacks can be
> > merged in _sff_read_seq_record. Given the number of records in most 454
> SFF
> > files, I suspect the micro-optimization effort will be worth the slight
> cost
> > in code clarity.
>
> [...]I guess you mean the flowgram values, flowgram index, bases
and qualities might be loaded with a single read? That would
> be worth trying.
>
Exactly! Also, flowgrams do not need to be unpacked when trimming. My own
bias is to encode the quality scores and flowgrams in numpy arrays rather
than lists, however I understand that the goal is to keep the external
dependencies to a minimum (although NumPy is required elsewhere).
Also, the test "chr(0)*padding != handle.read(padding)" could be written
just as clearly as "handle.read(padding).count('\0') != padding" and not
generate as many temporary objects.
Best regards,
-Kevin
-------------- next part --------------
# -*- coding: utf-8 -*-
# Convert 454PairAlign.txt and the corresponding SFF files into SAM/BAM format
import re
import sys
from operator import getitem, itemgetter
from itertools import izip, imap, groupby, repeat
from subprocess import Popen, PIPE
import numpy as np
try:
# Import fancy versions of basic IO functions from my GLU package
# see http://code.google.com/p/glu-genetics
from glu.lib.fileutils import autofile,hyphen,table_writer,table_reader
except ImportError:
import csv
# The real version handles automatic gz/bz2 (de)compression
autofile = file
def hyphen(filename,default):
if filename=='-' and default is not None:
return default
return filename
# Write a tab-delimited ASCII file
# The real version handles many more formats (CSV, XLS, Stata), column
# selection, header optionds, row filters, and other toys.
def table_writer(filename,hyphen=None):
if filename=='-' and hyphen is not None:
dest = hyphen
else:
dest = autofile(filename,'wb')
return csv.writer(dest, dialect='excel-tab')
# Read a tab-delimited ASCII file
# The real version handles many more formats (CSV, XLS, Stata), column
# selection, header optionds, row filters, and other toys.
def table_reader(filename,hyphen=None):
if filename=='-' and hyphen is not None:
dest = hyphen
else:
dest = autofile(filename,'rb')
return csv.reader(dest, dialect='excel-tab')
CIGAR_map = { ('-','-'):'P' }
for a in 'NACGTacgt':
CIGAR_map[a,'-'] = 'I'
CIGAR_map['-',a] = 'D'
for b in 'NACGTacgt':
CIGAR_map[a,b] = 'M'
def make_cigar_py(query,ref):
assert len(query)==len(ref)
igar = imap(getitem, repeat(CIGAR_map), izip(query,ref))
cigar = ''.join('%d%s' % (len(list(run)),code) for code,run in groupby(igar))
return cigar
# Try to import the optimized Cython version
# The Python version is pretty fast, but I wanted to play with Cython.
try:
from cigar import make_cigar
except ImportError:
make_cigar = make_cigar_py
class SFFIndex(object):
def __init__(self, sfffiles):
self.sffindex = sffindex = {}
for sfffile in sfffiles:
from Bio import SeqIO
prefix,ext = sfffile[-13:].split('.')
assert ext=='sff'
print >> sys.stderr,'Loading SFF index for',sfffile
reads = SeqIO.index(sfffile, 'sff-trim')
sffindex[prefix] = reads
def get_quality(self, qname, query, qstart, qstop):
prefix = qname[:9]
sff = self.sffindex.get(prefix)
if not sff:
return '*'
rec = sff[qname]
phred = rec.letter_annotations['phred_quality']
sffqual = np.array(phred,dtype=np.uint8)
sffqual += 33
sffqual = sffqual.tostring()
# Align the query to the original read to find the matching quality
# score information. This is complicated by the extra trimming done by
# gsMapper. We could obtain this information by parsing the
# 454TrimStatus.txt, but it is easier to search for the sub-sequence in
# the reference. Ones hopes the read maps uniquely, but this is not
# checked.
# CASE 1: Forward read alignment
if qstart> sys.stderr,'MATCHED TYPE F2: name=%s, qstart=%d(%d), qstop=%d, qlen=%d, len.query=%d' % (qname,start+1,qstart,qstop,qlen,len(query))
qual = sffqual[start:start+len(query)]
# CASE 2: Backward read alignment
else:
# Try using specified cut-points
read = str(rec.seq.complement())
seq = read[qstop-1:qstart][::-1]
read = read[::-1]
# If it matches, then compute quality
if seq==query:
qual = sffqual[qstop-1:qstart][::-1]
else:
# otherwise gsMapper applied extra trimming, so we have to manually find the offset
start = read.index(query)
seq = read[start:start+len(query)]
if seq==query:
#print >> sys.stderr,'MATCHED TYPE R2: name=%s, qstart=%d, qstop=%d(%d), qlen=%d, len.query=%d' % (qname,qstart,start+1,qstop,qlen,len(query))
qual = sffqual[::-1][start:start+len(query)]
assert seq==query
assert len(qual) == len(query)
return qual
def pair_align(filename, sffindex):
records = autofile(filename)
split = re.compile('[\t ,.]+').split
mrnm = '*'
mpos = 0
isize = 0
mapq = 60
for line in records:
assert line.startswith('>')
fields = split(line)
qname = fields[0][1:]
qstart = int(fields[1])
qstop = int(fields[2])
#qlen = int(fields[4])
rname = fields[6]
rstart = int(fields[7])
rstop = int(fields[8])
#rlen = int(fields[10])
query = split(records.next())[2]
qq = query.replace('-','')
ref = split(records.next())[2]
cigar = make_cigar(query,ref)
qual = sffindex.get_quality(qname, qq, qstart, qstop)
flag = 0
if qstart>qstop:
flag |= 0x10
if rstart>rstop:
flag |= 0x20
yield [qname, flag, rname, rstart, mapq, cigar, mrnm, mpos, isize, qq, qual]
def option_parser():
import optparse
usage = 'usage: %prog [options] 454PairAlign.txt[.gz] [SFFfiles.sff..]'
parser = optparse.OptionParser(usage=usage)
parser.add_option('-r', '--reflist', dest='reflist', metavar='FILE',
help='Reference genome contig list')
parser.add_option('-o', '--output', dest='output', metavar='FILE', default='-',
help='Output SAM file')
return parser
def main():
parser = option_parser()
options,args = parser.parse_args()
if not args:
parser.print_help(sys.stderr)
sys.exit(2)
sffindex = SFFIndex(args[1:])
alignment = pair_align(hyphen(args[0],sys.stdin), sffindex)
write_bam = options.output.endswith('.bam')
if write_bam:
if not options.reflist:
raise ValueError('Conversion to BAM format requires a reference genome contig list (-r/--reflist)')
# Creating the following two-stage pipeline deadlocks due to problems with subprocess
# -- use the shell method below instead
#sammer = Popen(['samtools','import',options.reflist,'-','-'],stdin=PIPE,stdout=PIPE)
#bammer = Popen(['samtools','sort','-', options.output[:-4]], stdin=sammer.stdout)
cmd = 'samtools import "%s" - - | samtools sort - "%s"' % (options.reflist,options.output[:-4])
bammer = Popen(cmd,stdin=PIPE,shell=True,bufsize=-1)
out = table_writer(bammer.stdin)
else:
out = table_writer(options.output,hyphen=sys.stdout)
out.writerow(['@HD', 'VN:1.0'])
if options.reflist:
reflist = table_reader(options.reflist)
for row in reflist:
if len(row)<2:
continue
contig_name = row[0]
contig_len = int(row[1])
out.writerow(['@SQ', 'SN:%s' % contig_name, 'LN:%d' % contig_len])
print >> sys.stderr, 'Generating alignment from %s to %s' % (args[0],options.output)
for qname,qalign in groupby(alignment,itemgetter(0)):
qalign = list(qalign)
if len(qalign)>1:
# Set MAPQ to 0 for multiply aligned reads
for row in qalign:
row[4] = 0
out.writerow(row)
else:
out.writerow(qalign[0])
if write_bam:
print >> sys.stderr,'Finishing BAM encoding...'
bammer.communicate()
if __name__=='__main__':
if 1:
main()
else:
try:
import cProfile as profile
except ImportError:
import profile
import pstats
prof = profile.Profile()
try:
prof.runcall(main)
finally:
stats = pstats.Stats(prof)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats(25)
From biopython at maubp.freeserve.co.uk Tue Mar 2 08:01:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 13:01:53 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
Message-ID: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
Kevin wrote:
> I only looked at the module documentation and it was more than sufficient to
> get started. ?I've never really used BioPython before, so I was pleasantly
> surprised at how easy it was to get started. ?The BioPython SFF parser and
> indexed access replaced a hairy process of extracting data using 454's
> sffinfo and packing it into a BDB file.
Great :)
>> > I am parsing gsMapper 454PairAlign.txt output and
>> > converting it to SAM/BAM format to view in IGV (among other things) and
>> > wanted to include per-based quality score information from the SFF
>> > files.
>>
>> Are you reading and writing SAM/BAM format with Python? Looking
>> into this is on my (long) todo list.
>
> Yes-- so far I have code to populate the basic data for unpaired reads, but
> none of the optional annotations. ?My script reads the 454 pairwise
> alignment data, finds each read in the source SFF file, figures out if extra
> trimming was applied by gsMapper, and extracts the matching PHRED quality
> scores. ?Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
> non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
> FAQ). ?The script can output SAM records or create a subprocess to sort the
> records and recode to BAM format using samtools. ?I've attached the current
> version script and you are welcome to use it for any purpose.
I'll take a look...
>> [...] I guess you mean the flowgram values, flowgram index, bases
>> and qualities might be loaded with a single read? That would
>> be worth trying.
>
> Exactly!
If I recall I felt the unpacking was more complicated (and not needed
for the sequence bases), but I agree it this is faster it is worthwhile.
> Also, flowgrams do not need to be unpacked when trimming.
True, that shouldn't make the function much more complex. I'll try
to look at that later today.
> My own bias is to encode the quality scores and flowgrams in numpy
> arrays rather than lists, however I understand that the goal is to keep
> the external dependencies to a minimum (although NumPy is required
> elsewhere).
Yes, I did wonder about using NumPy here but wanted to ensure that
the core of Biopython remains without an external dependency here.
> Also, the test "chr(0)*padding != handle.read(padding)" could be written
> just as clearly as "handle.read(padding).count('\0') != padding" and not
> generate as many temporary objects.
Good point, done - and you're in the contributors list now ;)
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 09:34:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 14:34:07 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
Message-ID: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
On Tue, Mar 2, 2010 at 12:36 PM, Kevin Jacobs wrote:
> On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote:
>> My code does not (yet) attempt to deal with next-gen sequencing
>> alignments, which would require padding all the (short) reads with
>> leading and trailing gaps to ensure all rows of the alignment have
>> the same length. Doing this in a memory efficient way could be
>> done with a PaddedSeq object, or a very different alignment object
>> (hold read and their offsets in memory). I'm not sure what is best,
>> but the bx-python model looks worth understanding to help decide.
>>
>> Perhaps until this is settled, it would be premature to merge my
>> alignment class to the trunk. After all, we may need to tweak the
>> alignment object class heirachy.
>
>
> Hi Peter,
>
> I'm just jumping in here and have not yet read all of the background
> material. ?However, I am working with next-gen alignments and am
> curious as to what you have in mind. ?At first glance, it sounds like
> you want to access aligned reads in a 'pileup' format (i.e., an object
> model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are
> you thinking of something different entirely?
Probably something different. My general concern boils down to
the fact that the current Alignment model as an enhanced "list of
SeqRecord objects" is potentially limiting.
The alignment code in Biopython (and my branch which is basically
an extension to that) deals with classical multiple sequence alignments
like ClustalW etc. You can think of the alignment as a matrix of letters,
each row is a sequence (e.g. a gene), and there will be some gap
characters for insertions, and padding for leading/trailing commissions.
There may or may not be a consensus sequence too.
With assembles you have a (long) consensus with many (short) reads
aligned to it. In order to hold this as a "matrix" representation, all the
(short) reads would require (lots of) leading/trailing padding. The same
applies when mapping reads to a reference genome.
So, while the current object model may work, all this extra padding
might mean too much of a memory overhead (especially as all the
rows are currently stored as SeqRecord objects). Instead, we might
just store the (short) read sequence, name, and its offset (and
perhaps the strand). We can then reconstruct columns or rows
mimicking the "matrix" interpretation on demand. However, the
API should make it easy to get the unpadded reads and their
offsets too - so the current alignment API might either be extended
or perhaps changed.
Related to this, a "Lite" version of the alignment object might
be useful when there is no annotation requiring using SeqRecord
objects. e.g. For ClustalW, FASTA, PHYLIP alignments all we
need is the sequence and identifiers.
Regarding one of your points, accessing aligned reads (or rows)
from an alignment - currently this is only supported by index
(row number). In most cases the reads (rows) have a unique
identifier/name, and thus one idea I am considering for this
branch is overloading the align[...] syntax further to allow a
record's id to be used as an alternative. i.e. More like a dictionary.
Other ideas for enhancements on this branch including sorting
the rows (with a list like sort method, defaulting to sorting on the
record's id strings), per-column annotation (useful for PFAM
alignments and the match string in pairwise alignments), and
a general annotations dictionary (like we have on SeqRecord
objects).
Peter
From bioinformed at gmail.com Tue Mar 2 09:36:32 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 09:36:32 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
<320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
Message-ID: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote:
> Kevin wrote:> My own bias is to encode the quality scores and flowgrams in
> numpy
> > arrays rather than lists, however I understand that the goal is to keep
> > the external dependencies to a minimum (although NumPy is required
> > elsewhere).
>
> Yes, I did wonder about using NumPy here but wanted to ensure that
> the core of Biopython remains without an external dependency here.
>
In addition to not creating many little objects, my leanings toward using
NumPy are also due to the generality of tricks like the following to recode
quality scores to Sanger ASCII-33 format:
sffqual =
np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
sffqual += 33
sffqual = sffqual.tostring()
That said, the alternatives aren't that slow and small integers are shared
from a pre-allocated pool, so this is not as big a concern.
-Kevin
From biopython at maubp.freeserve.co.uk Tue Mar 2 09:44:13 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 14:44:13 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
<320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
<2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
Message-ID: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com>
On Tue, Mar 2, 2010 at 2:36 PM, Kevin Jacobs
wrote:
> On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote:
>> Yes, I did wonder about using NumPy here but wanted to ensure that
>> the core of Biopython remains without an external dependency here.
>
> In addition to not creating many little objects, my leanings toward using
> NumPy are also due to the generality of tricks like the following to recode
> quality scores to Sanger ASCII-33 format:
>
> ? ?sffqual ?=
> np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
> ? ?sffqual += 33
> ? ?sffqual ?= sffqual.tostring()
>
Yeah - I had this kind of thing in mind for the qualities, both when
looking at the SFF files and earlier when doing the FASTQ and
QUAL stuff.
You can probably make that more efficient with one line:
sffqual = (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
+ 33).tostring()
Not sure if it will make a measurable difference mind you ;)
> That said, the alternatives aren't that slow and small integers are shared
> from a pre-allocated pool, so this is not as big a concern.
Indeed.
Peter
From bioinformed at gmail.com Tue Mar 2 09:51:04 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 09:51:04 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
<320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
<2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
<320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com>
Message-ID: <2e1434c11003020651y541ce3e5q92fb0fea308a59e9@mail.gmail.com>
On Tue, Mar 2, 2010 at 9:44 AM, Peter wrote:
> You can probably make that more efficient with one line:
>
> sffqual =
> (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
> + 33).tostring()
>
> Not sure if it will make a measurable difference mind you ;)
>
I haven't measured, but my understanding is that the inplace "+= 33" will
avoid creating a temporary copy and thus be quicker. But as you said, not
likely to make a difference in practice.
-Kevin
From chapmanb at 50mail.com Tue Mar 2 10:03:08 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 2 Mar 2010 10:03:08 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
Message-ID: <20100302150308.GP98028@sobchak.mgh.harvard.edu>
Peter and Kevin;
> >> My code does not (yet) attempt to deal with next-gen sequencing
> >> alignments,
[...]
> >> Perhaps until this is settled, it would be premature to merge my
> >> alignment class to the trunk. After all, we may need to tweak the
> >> alignment object class heirachy.
My vote would be to merge what you've done in for handling
standard multiple alignments, and then look at next-generation read
representation as an analogous but separate problem. All of the
SeqRecord objects which are useful for drilling in on multiple
alignments are likely going to be memory hogs for any real world
next gen work.
> > I'm just jumping in here and have not yet read all of the background
> > material. ?However, I am working with next-gen alignments and am
> > curious as to what you have in mind. ?At first glance, it sounds like
> > you want to access aligned reads in a 'pileup' format (i.e., an object
> > model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are
> > you thinking of something different entirely?
This is a good way to go. SAM is at least an emerging standard that
people are adopting, and samtools and the pysam module do a good job
of dealing with them:
http://code.google.com/p/pysam/
pysam exposes a Pileup style API from sorted and indexed BAM files
and scales great for large alignment files:
http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html
This is a good starting point for providing interoperability with
Biopython; it would be great to re-use what we can from these
projects.
Brad
From biopython at maubp.freeserve.co.uk Tue Mar 2 10:28:45 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 15:28:45 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
Message-ID: <320fb6e01003020728v760e8208h5da4288dfaef7ed7@mail.gmail.com>
On Tue, Mar 2, 2010 at 12:28 PM, Kevin Jacobs wrote:
>
>?Also, flowgrams do not need to be unpacked when trimming.
>
True - change made on the trunk, should make parsing SFF files
as trimmed records a little bit faster.
Thanks
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 11:43:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 16:43:18 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003020843n72a23176wa023786c46ffb7b3@mail.gmail.com>
On Tue, Mar 2, 2010 at 3:03 PM, Brad Chapman wrote:
> Peter and Kevin;
>
>> >> My code does not (yet) attempt to deal with next-gen sequencing
>> >> alignments,
> [...]
>> >> Perhaps until this is settled, it would be premature to merge my
>> >> alignment class to the trunk. After all, we may need to tweak the
>> >> alignment object class heirachy.
>
> My vote would be to merge what you've done in for handling
> standard multiple alignments, and then look at next-generation read
> representation as an analogous but separate problem. All of the
> SeqRecord objects which are useful for drilling in on multiple
> alignments are likely going to be memory hogs for any real world
> next gen work.
OK - that is what I was leaning towards.
What do you think about the fact I am introducing an "improved"
version of the existing Bio.Align.Generic.Alignment class under
Bio.Align.MultipleSeqAlignment?
That's actually several questions in one - should this be a new
object or just enhance the old one? I favour a new object here
because I want to *enforce* the fact that all the rows are the
same length, but I doubt people are using the flexibility of
the current alignment object in this way.
Next where should the new object live? I find the current use
of Bio.Align.Generic somewhat hidden away, thus my
suggestion of using Bio.Align directly.
Next, what should the new object be called? We could reuse
the old name of Alignment but it is a bit vague and would
cause confusion given the existing object is also called that.
I have used MultipleSeqAlignment but am open to suggestions
(e.g. MulSeqAlignment is shorter).
Peter
From bioinformed at gmail.com Tue Mar 2 12:07:03 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 12:07:03 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
Message-ID: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
On Tue, Mar 2, 2010 at 10:03 AM, Brad Chapman wrote:
> Kevin;
> > > I'm just jumping in here and have not yet read all of the background
> > > material. However, I am working with next-gen alignments and am
> > > curious as to what you have in mind. At first glance, it sounds like
> > > you want to access aligned reads in a 'pileup' format (i.e., an object
> > > model akin to http://samtools.sourceforge.net/pileup.shtml). Or are
> > > you thinking of something different entirely?
>
> This is a good way to go. SAM is at least an emerging standard that
> people are adopting, and samtools and the pysam module do a good job
> of dealing with them:
>
> http://code.google.com/p/pysam/
>
>
I find pysam pretty limited for doing more than reading and subsetting
SAM/BAM files. I'm planning to add a constructor and helper functions for
creating new aligned reads. The current AlignedRead object is also
read-only, which will need to be relaxed for many serious applications.
Until then, I'm writing (text) SAM records and piping them to samtools to
encode in BAM format (see the script attached to one of my earlier emails).
> pysam exposes a Pileup style API from sorted and indexed BAM files
> and scales great for large alignment files:
>
> http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html
Scalability is okay for conversion to pileup format, but not what I'd
consider great. But I agree, pysam is a good starting point. I just wish
that the read identifiers and attributes were available via the C API,
since those are often needed when, e.g., writing a genotype caller.
-Kevin
From chapmanb at 50mail.com Wed Mar 3 09:12:15 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Mar 2010 09:12:15 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
Message-ID: <20100303141215.GZ98028@sobchak.mgh.harvard.edu>
Kevin and Peter;
> I find pysam pretty limited for doing more than reading and subsetting
> SAM/BAM files. I'm planning to add a constructor and helper functions for
> creating new aligned reads. The current AlignedRead object is also
> read-only, which will need to be relaxed for many serious applications.
> Until then, I'm writing (text) SAM records and piping them to samtools to
> encode in BAM format (see the script attached to one of my earlier emails).
Agreed. These sound like good improvements.
> Scalability is okay for conversion to pileup format, but not what I'd
> consider great. But I agree, pysam is a good starting point. I just wish
> that the read identifiers and attributes were available via the C API,
> since those are often needed when, e.g., writing a genotype caller.
Do you think we could build off of what pysam has? The project hasn't
seemed especially active, but it would be great to have a unified
code base in python for dealing with BAM files. They use mercurial
for revision control, so worst case we can always fork this on
bitbucket and work off of that. Galaxy has a fork for their use:
http://bitbucket.org/kanwei/kanwei-pysam/
The bioconductor folks also seem to be standardizing around SAM/BAM for
their analysis pipelines, so practically we may be able to borrow
some of their APIs once they have a released version of Rsamtools.
> What do you think about the fact I am introducing an "improved"
> version of the existing Bio.Align.Generic.Alignment class under
> Bio.Align.MultipleSeqAlignment?
Yes please. I don't think Generic is that great and am happy to see
it improved upon.
> That's actually several questions in one - should this be a new
> object or just enhance the old one? I favour a new object here
> because I want to *enforce* the fact that all the rows are the
> same length, but I doubt people are using the flexibility of
> the current alignment object in this way.
>
> Next where should the new object live? I find the current use
> of Bio.Align.Generic somewhat hidden away, thus my
> suggestion of using Bio.Align directly.
>
> Next, what should the new object be called? We could reuse
> the old name of Alignment but it is a bit vague and would
> cause confusion given the existing object is also called that.
> I have used MultipleSeqAlignment but am open to suggestions
> (e.g. MulSeqAlignment is shorter).
I like MultipleSeqAlignment, and agree it should be as top level as
possible in Bio.Align. If you think a new object is better, go for
that and we can move Generic on a deprecation path. It's great you
are cleaning this up.
Brad
From biopython at maubp.freeserve.co.uk Wed Mar 3 10:03:38 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Mar 2010 15:03:38 +0000
Subject: [Biopython-dev] EMBOSS eprimer3 parser
In-Reply-To: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com>
References: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com>
Message-ID: <320fb6e01003030703k691fdbe8i3ab3dfd5ba1640a6@mail.gmail.com>
On Mon, Jan 18, 2010 at 4:33 PM, Peter wrote:
> Hi all,
>
> Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in
> Biopython? I'd like someone to look over Leighton's proposed enhancements
> to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968
>
> There are two main issues. First, the current code doesn't cope with multiple
> primer sets (so Leighton introduces read/parse functions in line with other
> modules for single or multiple sets of primers). This seems entirely sensible
> to me, and worthwhile in itself.
I've made changes on github to do this based on Leighton's code.
> Second, Leighton makes some changes to the primer record objects.
> I'm not so sure about the necessity here, even if it is backwards
> compatible, but I haven't really used this code. What do the rest of
> you think?
I expect to doing some work with eprimer3 this month, so will feel I
can make a more informed choice later.
Peter
From bugzilla-daemon at portal.open-bio.org Wed Mar 3 10:06:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Mar 2010 10:06:47 -0500
Subject: [Biopython-dev] [Bug 2968] Modifications to Emboss eprimer3 parser
and associated files
In-Reply-To:
Message-ID: <201003031506.o23F6lgb005243@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2968
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-03 10:06 EST -------
(In reply to comment #0)
> The existing Emboss primer3/eprimer3 code has a couple of issues, and some
> scope for improvement:
>
> - The existing Primer3.py parser code can only parse output when eprimer3 is
> applied to a single sequence. When eprimer3 is applied to multiple sequence
> input, it groups all primers for all sequences into a single record, which may
> incorrectly associate primers with the wrong sequences in downstream analysis.
> - The current parser lacks an iterator for iterating over multiple sequence
> output
I've made changes on github to support multiple targets (with a read and
a parse function) this based on Leighton's code which addresses the above
issues.
> - The current parser creates 'ghost' primers for all primer pairs, with length
> zero and sequence as an empty string; it does not do this for internal oligos.
> A more intuitive solution might be to return None for absent primers/oligos
> - The current data model stores all primer data as individual attributes. It
> might be more useful to group the attributes of individual primers into their
> natural associations
Regarding the object changes, I'll be doing some work with eprimer3 this month,
so will feel I can make a more informed choice later.
See also:
http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007255.html
http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007398.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Mar 3 10:57:09 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Mar 2010 15:57:09 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100303141215.GZ98028@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
On Wed, Mar 3, 2010 at 2:12 PM, Brad Chapman wrote:
> Kevin and Peter;
>
>> I find pysam pretty limited for doing more than reading and subsetting
>> SAM/BAM files. ?I'm planning to add a constructor and helper functions for
>> creating new aligned reads. ?The current AlignedRead object is also
>> read-only, which will need to be relaxed for many serious applications.
>> ?Until then, I'm writing (text) SAM records and piping them to samtools to
>> encode in BAM format (see the script attached to one of my earlier emails).
>
> Agreed. These sound like good improvements.
>
>> Scalability is okay for conversion to pileup format, but not what I'd
>> consider great. ?But I agree, pysam is a good starting point. ?I just wish
>> that the read identifiers and attributes were ?available via the C API,
>> since those are often needed when, e.g., writing a genotype caller.
>
> Do you think we could build off of what pysam has? The project hasn't
> seemed especially active, but it would be great to have a unified
> code base in python for dealing with BAM files. They use mercurial
> for revision control, so worst case we can always fork this on
> bitbucket and work off of that. Galaxy has a fork for their use:
>
> http://bitbucket.org/kanwei/kanwei-pysam/
>
> The bioconductor folks also seem to be standardizing around
> SAM/BAM for their analysis pipelines, so practically we may be
> able to borrow some of their APIs once they have a released
> version of Rsamtools.
I agree that we should work towards supporting SAM (and perhaps
also BAM) in Biopython, and other projects APIs can be very
useful for inspiration or guidance.
I was aware of pysam but am concerned about the dependencies:
pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
itself - which may all be fine on Linux, but will likely be trouble for
us on other platforms (especially Windows).
Is anyone aware of any other SAM/BAM parser in Python?
>> What do you think about the fact I am introducing an "improved"
>> version of the existing Bio.Align.Generic.Alignment class under
>> Bio.Align.MultipleSeqAlignment?
>
> Yes please. I don't think Generic is that great and am happy to see
> it improved upon.
>
>> That's actually several questions in one - should this be a new
>> object or just enhance the old one? I favour a new object here
>> because I want to *enforce* the fact that all the rows are the
>> same length, but I doubt people are using the flexibility of
>> the current alignment object in this way.
>>
>> Next where should the new object live? I find the current use
>> of Bio.Align.Generic somewhat hidden away, thus my
>> suggestion of using Bio.Align directly.
>>
>> Next, what should the new object be called? We could reuse
>> the old name of Alignment but it is a bit vague and would
>> cause confusion given the existing object is also called that.
>> I have used MultipleSeqAlignment but am open to suggestions
>> (e.g. MulSeqAlignment is shorter).
>
> I like MultipleSeqAlignment, and agree it should be as top level as
> possible in Bio.Align. If you think a new object is better, go for
> that and we can move Generic on a deprecation path. It's great you
> are cleaning this up.
OK then - I've been wanting to "clean this up" for some time.
I'll make time to merge what I have so far (which shouldn't be
controversial) and update the tutorial.
I would also like to investigate moving the useful bits of the
SummaryInfo class into methods of the main alignment class.
Testing would be very welcome!
Peter
From biopython at maubp.freeserve.co.uk Wed Mar 3 12:51:41 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Mar 2010 17:51:41 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
Message-ID: <320fb6e01003030951n261c124bq31578bc9cc5814c9@mail.gmail.com>
On Wed, Mar 3, 2010 at 3:57 PM, Peter wrote:
>
> OK then - I've been wanting to "clean this up" for some time.
> I'll make time to merge what I have so far (which shouldn't be
> controversial) and update the tutorial.
The merge is done, updates to the tutorial to show how to
use the new object pending (but already in the doctests).
Peter
From bioinformed at gmail.com Wed Mar 3 13:30:49 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Wed, 3 Mar 2010 13:30:49 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
Message-ID: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
On Wed, Mar 3, 2010 at 10:57 AM, Peter wrote:
> I agree that we should work towards supporting SAM (and perhaps
> also BAM) in Biopython, and other projects APIs can be very
> useful for inspiration or guidance.
>
>
Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully
between samtools and Picard source code, I've been able to work out most of
the tricky bits. I'm glad to know that the R folks are also working on
this, since they're usually very good about generating clear documentation.
> I was aware of pysam but am concerned about the dependencies:
> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
> itself - which may all be fine on Linux, but will likely be trouble for
> us on other platforms (especially Windows).
>
> Is anyone aware of any other SAM/BAM parser in Python?
Parsing SAM is pretty simple and I can certainly help with gluing it into
Biopython (with some help on the Biopython side, since I'm still a newb).
I'm about half-way to having a BAM reader and writer for my own purposes.
I'm coding the time-critical parts in Cython with a fallback to pure
Python, so it may not be ideal for use in Biopython.
-Kevin
From chapmanb at 50mail.com Thu Mar 4 08:13:52 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 4 Mar 2010 08:13:52 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
Message-ID: <20100304131352.GB19053@sobchak.mgh.harvard.edu>
Kevin and Peter;
> I was aware of pysam but am concerned about the dependencies:
> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
> itself - which may all be fine on Linux, but will likely be trouble for
> us on other platforms (especially Windows).
I believe you can remove the pyrex requirement by shipping the
generated C file with the distribution. Samtools itself may be an
issue; however, right now it is probably a practical need for dealing
with SAM/BAM since it implements a lot of BAM generation, sorting,
merging and indexing you need in workflows. Also, the C code is
included with the distribution so it is more a matter of getting it
compiled than introducing extra dependencies. The bioconductor work
appears to do the same thing.
> > I agree that we should work towards supporting SAM (and perhaps
> > also BAM) in Biopython, and other projects APIs can be very
> > useful for inspiration or guidance.
All of my work converts SAM directly into sorted and indexed BAM,
and then build from that. For me, direct SAM parsing wouldn't be as
useful as BAM.
> Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully
> between samtools and Picard source code, I've been able to work out most of
> the tricky bits. I'm glad to know that the R folks are also working on
> this, since they're usually very good about generating clear documentation.
Agreed, but at least we are converging on something instead of
having to write a parser every time you use a new aligner. The
bioconductor SVN is here:
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/
(user: readonly, pass: readonly)
I think the pysam API does a decent job for reading and exposing
this. The higher level things that would be nice to add are:
- Converting the CIGAR string into something more useful.
- Smartly dealing with the X? fields from various aligners. These
often contain very useful information missing from the SAM
specification. Where the data actually is will be aligner
specific.
- More generally easing dealing with the optional fields.
> Parsing SAM is pretty simple and I can certainly help with gluing it into
> Biopython (with some help on the Biopython side, since I'm still a newb).
> I'm about half-way to having a BAM reader and writer for my own purposes.
> I'm coding the time-critical parts in Cython with a fallback to pure
> Python, so it may not be ideal for use in Biopython.
Cool. Does the BAM reader require samtools C code or is it
independent of that?
Brad
From aaronquinlan at gmail.com Thu Mar 4 08:33:40 2010
From: aaronquinlan at gmail.com (Aaron Quinlan)
Date: Thu, 4 Mar 2010 08:33:40 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
Message-ID:
Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/). The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc.
I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries. Save for a few minor bugs along the way, it was rather straightforward to include.
Aaron
Aaron Quinlan, Ph.D.
NRSA Postdoctoral Fellow
Hall Laboratory
University of Virginia
Biochem. & Mol. Genetics
aaronquinlan at gmail.com
On Mar 4, 2010, at 8:13 AM, Brad Chapman wrote:
> Kevin and Peter;
>
>> I was aware of pysam but am concerned about the dependencies:
>> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
>> itself - which may all be fine on Linux, but will likely be trouble for
>> us on other platforms (especially Windows).
>
> I believe you can remove the pyrex requirement by shipping the
> generated C file with the distribution. Samtools itself may be an
> issue; however, right now it is probably a practical need for dealing
> with SAM/BAM since it implements a lot of BAM generation, sorting,
> merging and indexing you need in workflows. Also, the C code is
> included with the distribution so it is more a matter of getting it
> compiled than introducing extra dependencies. The bioconductor work
> appears to do the same thing.
>
>>> I agree that we should work towards supporting SAM (and perhaps
>>> also BAM) in Biopython, and other projects APIs can be very
>>> useful for inspiration or guidance.
>
> All of my work converts SAM directly into sorted and indexed BAM,
> and then build from that. For me, direct SAM parsing wouldn't be as
> useful as BAM.
>
>> Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully
>> between samtools and Picard source code, I've been able to work out most of
>> the tricky bits. I'm glad to know that the R folks are also working on
>> this, since they're usually very good about generating clear documentation.
>
> Agreed, but at least we are converging on something instead of
> having to write a parser every time you use a new aligner. The
> bioconductor SVN is here:
>
> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/
> (user: readonly, pass: readonly)
>
> I think the pysam API does a decent job for reading and exposing
> this. The higher level things that would be nice to add are:
>
> - Converting the CIGAR string into something more useful.
> - Smartly dealing with the X? fields from various aligners. These
> often contain very useful information missing from the SAM
> specification. Where the data actually is will be aligner
> specific.
> - More generally easing dealing with the optional fields.
>
>> Parsing SAM is pretty simple and I can certainly help with gluing it into
>> Biopython (with some help on the Biopython side, since I'm still a newb).
>> I'm about half-way to having a BAM reader and writer for my own purposes.
>> I'm coding the time-critical parts in Cython with a fallback to pure
>> Python, so it may not be ideal for use in Biopython.
>
> Cool. Does the BAM reader require samtools C code or is it
> independent of that?
>
> Brad
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From bioinformed at gmail.com Thu Mar 4 08:44:39 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Thu, 4 Mar 2010 08:44:39 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
Message-ID: <2e1434c11003040544j278ffb0fya984cd2668a6d278@mail.gmail.com>
On Thu, Mar 4, 2010 at 8:13 AM, Brad Chapman wrote:
> All of my work converts SAM directly into sorted and indexed BAM,
> and then build from that. For me, direct SAM parsing wouldn't be as
> useful as BAM.
Same here-- I construct and unserialize alignment data into SAM-like
records, but it would be foolish to actually store them natively to disk.
>
> > Parsing SAM is pretty simple and I can certainly help with gluing it into
> > Biopython (with some help on the Biopython side, since I'm still a newb).
> > I'm about half-way to having a BAM reader and writer for my own purposes.
> > I'm coding the time-critical parts in Cython with a fallback to pure
> > Python, so it may not be ideal for use in Biopython.
>
> Cool. Does the BAM reader require samtools C code or is it
> independent of that?
>
It is intended to be independent of the samtools distribution, though some
of the C code is currently duplicated (e.g., bgzf). Of course, a
Cython/Python re-write would be simple enough, though obviously extra work.
-Kevin
From bioinformed at gmail.com Thu Mar 4 08:52:33 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Thu, 4 Mar 2010 08:52:33 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To:
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
Message-ID: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote:
> Just an FYI for those interested in developing tools to work with BAM: it
> may also be worth looking into the BamTools C++ API developed by Derek
> Barnett at Boston College (http://sourceforge.net/projects/bamtools/).
> The API is quite nice and has much of the necessary functionality for
> iterators, getters/setters, etc.
>
> I added BAM support for my BEDTools package (
> http://code.google.com/p/bedtools/) using the BAMTools libraries. Save
> for a few minor bugs along the way, it was rather straightforward to
> include.
>
Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools.
The bamtools code looks well designed and quite similar to my emerging
Cython/Python rendition.
-Kevin
From bioinformed at gmail.com Thu Mar 4 09:07:03 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Thu, 4 Mar 2010 09:07:03 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
<2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
Message-ID: <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com>
On Thu, Mar 4, 2010 at 8:52 AM, Kevin Jacobs <
bioinformed at gmail.com> wrote:
> On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote:
>
>> Just an FYI for those interested in developing tools to work with BAM: it
>> may also be worth looking into the BamTools C++ API developed by Derek
>> Barnett at Boston College (http://sourceforge.net/projects/bamtools/).
>> The API is quite nice and has much of the necessary functionality for
>> iterators, getters/setters, etc.
>>
>> I added BAM support for my BEDTools package (
>> http://code.google.com/p/bedtools/) using the BAMTools libraries. Save
>> for a few minor bugs along the way, it was rather straightforward to
>> include.
>>
>
> Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools.
> The bamtools code looks well designed and quite similar to my emerging
> Cython/Python rendition.
>
>
Ouch-- never mind. The bamtools code isn't endian-clean -- it will only
work correctly on native little-endian architectures.
-Kevin
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:47:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:47:36 -0500
Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic
alignment, e.g. align[1:2, 5:-5]
In-Reply-To:
Message-ID: <201003051047.o25Ala5W006656@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2551
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:47 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:18 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:18 -0500
Subject: [Biopython-dev] [Bug 2552] Adding alignments
In-Reply-To:
Message-ID: <201003051048.o25AmIoF006689@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2552
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:34 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:34 -0500
Subject: [Biopython-dev] [Bug 2553] Adding SeqRecord objects to an alignment
(append or extend)
In-Reply-To:
Message-ID: <201003051048.o25AmYYH006723@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2553
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:36 -0500
Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of
SeqRecord objects
In-Reply-To:
Message-ID: <201003051048.o25AmaIn006735@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2554
Bug 2554 depends on bug 2553, which changed state.
Bug 2553 Summary: Adding SeqRecord objects to an alignment (append or extend)
http://bugzilla.open-bio.org/show_bug.cgi?id=2553
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:48:50 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:50 -0500
Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of
SeqRecord objects
In-Reply-To:
Message-ID: <201003051048.o25AmoWN006761@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2554
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 05:50:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:50:45 -0500
Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM
In-Reply-To:
Message-ID: <201003051050.o25Aojkg006835@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2905
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|Short read alignment format |Short read alignment format
| |SAM / BAM
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:50 EST -------
Updating summary to include SAM and BAM keywords. See also recent mailing list
discussions such as this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007397.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 06:40:05 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 06:40:05 -0500
Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory
In-Reply-To:
Message-ID: <201003051140.o25Be532008197@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3010
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 06:40 EST -------
I suspect any memory leak is within KDTree.c function KDTree_set_data. Looking
at this I wondered how the memory allocated by KDTree_add_point gets freed.
The following *might* help, but even if I am right, this is at best only a
partial fix:
diff --git a/Bio/KDTree/KDTree.c b/Bio/KDTree/KDTree.c
index d074f26..07cdc1f 100644
--- a/Bio/KDTree/KDTree.c
+++ b/Bio/KDTree/KDTree.c
@@ -621,9 +621,14 @@ int KDTree_set_data(struct KDTree* tree, float *coords,
long
tree->_radius_list = NULL;
}
tree->_count=0;
+ if (tree->_data_point_list) {
+ free(tree->_data_point_list);
+ tree->_data_point_list = NULL;
+ tree->_data_point_list_size = 0;
+ }
/* keep pointer to coords to delete it */
tree->_coords=coords;
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Wed Mar 10 09:30:57 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 10 Mar 2010 14:30:57 +0000
Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc)
Message-ID: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com>
Dear Biopythoneers,
The Open Bioinformatics Foundation (the Bio* umbrella
organisation) is preparing an application for the 2010
Google Summer of Code (GSoC).
http://code.google.com/soc/
If you are interested in becoming a mentor for a Biopython
related project, you can join us in the application. If you are
a student and are interested in a project (or would like to
propose one), please take a look at these pages:
http://www.open-bio.org/wiki/Google_Summer_of_Code
http://biopython.org/wiki/Google_Summer_of_Code
Regards,
Brad & Peter
From biopython at maubp.freeserve.co.uk Thu Mar 11 06:21:50 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Mar 2010 11:21:50 +0000
Subject: [Biopython-dev] Bio.Phylo.Applications?
Message-ID: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com>
Hi Eric et al,
We have started a collection of command line tool wrappers for
multiple sequence alignments under Bio.Align.Applications, so I was
thinking about where to put wrappers for phylogenetic tree command
line tools. How does Bio.Phylo.Applications sound (following the same
structure as the Bio.Align.Applications module). The kind of things I
am thinking about include:
QuickTree (neighbour joining, NJ)
http://www.sanger.ac.uk/resources/software/quicktree/
QuickJoin (NJ)
http://www.daimi.au.dk/~mailund/quick-join.html
RaxML (maximum likelihood, ML),
http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm
[We should talk to Biopython contributor Frank Kauff as he uses this
with Python]
And so on. Plus pointers in the documentation to the EMBOSS module for
PHYLIP tools.
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 11 06:30:04 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Mar 2010 11:30:04 +0000
Subject: [Biopython-dev] Adding format method to phylo tree object?
Message-ID: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
Hi Eric (et al),
Are you familiar with the format method of the SeqRecord and alignment
object (plus the __format__ method which does the same thing aiming to
work nicely with the Python 2.6 built in function format)? This allows
the user to turn their data into a string in a specified output
format. Internally the method calls Bio.SeqIO.write (or AlignIO) with
a StringIO handle.
Do you think it would it make sense to have this for the tree objects
in Bio.Phylo, allowing easy access to the object as a Newick tree
format etc?
For people using IPython, the __pretty__ method looks related. I know
the Bio.Nexus tree has a "prity print" method which might be exposed
like this. I wonder if this convention will become more widespread?
http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html
Peter
From p.j.a.cock at googlemail.com Thu Mar 11 10:34:07 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 15:34:07 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
Message-ID: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Hi all,
It is probably time to starting getting ready for Biopython 1.54,
perhaps aiming to release within about a months time?
This means not landing any major additions to the trunk for now (keep
things like GFF and Geography on branches for now).
Other than finishing up any documentation for new stuff (especially
the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
are there any important issues we should address before the release?
Regards,
Peter
From tiagoantao at gmail.com Thu Mar 11 10:42:21 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 11 Mar 2010 15:42:21 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com>
On Thu, Mar 11, 2010 at 3:34 PM, Peter Cock wrote:
> Other than finishing up any documentation for new stuff (especially
> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
> are there any important issues we should address before the release?
I think I will be able to commit my code around the 20th. Currently I
need to address the issue of supporting thousands of markers in the
genepop parser as people do complain about that (like a couple of
times a month or so, not more).
--
"Heavier than air flying machines are impossible"
Lord Kelvin, President, Royal Society, c. 1895
From andrea at biocomp.unibo.it Thu Mar 11 12:11:00 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 11 Mar 2010 18:11:00 +0100 (CET)
Subject: [Biopython-dev] Planning for Biopython 1.54
Message-ID: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
What about the Uniprot XML format parser?
The code is functional, and was reviewd, but it would be nice to have some
beta testing.
The only remaining "issue" is where to save the comment fields.
The actual implementation will work for biosql schema, and store most
of the data in the comment fields.
Andrea
From p.j.a.cock at googlemail.com Thu Mar 11 12:31:08 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 17:31:08 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
On Thu, Mar 11, 2010 at 5:11 PM, Andrea Pierleoni
wrote:
> What about the Uniprot XML format parser?
> The code is functional, and was reviewd, but it would be nice to have some
> beta testing.
> The only remaining "issue" is where to save the comment fields.
> The actual implementation will work for biosql schema, and store most
> of the data in the comment fields.
>
> Andrea
Hi Andrea,
Your UnitProt XML parser was one of the things I thought we should
delay until after getting Biopython 1.54 out the door, but I would
expect it to be included in Biopython 1.55.
There are at least two remaining issues, (1) where to save the comment
fields, and (2) what to call the format in SeqIO. Both of these should
ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to
ensure the OBF projects which use simple strings for file formats are
consistent. Would you like me to start a discussion there regarding
the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe
even "unitprotxml". Personally, "uniprot" seems fine provided this is
going to be the primary file format for UniProt records in the short
to medium term.
Also I don't think any of the current Biopython developers have sat
down to review the code. As the Bio.SeqIO maintainer, I will do this,
but right now I think getting Biopython 1.54 out should be
prioritised. From a very quick look just now, the recent merging of
the SFF support to the trunk will require a few tweaks in
test_SeqIO.py (e.g. an empty file is not valid for SFF files as well
as the UniProt XML). Also including a UniProt XML file in
test_BioSQL_SeqIO.py would be worthwhile.
Regards,
Peter
From andrea at biocomp.unibo.it Thu Mar 11 12:43:13 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 11 Mar 2010 18:43:13 +0100 (CET)
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
<320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
Message-ID: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it>
>
> Hi Andrea,
>
> Your UnitProt XML parser was one of the things I thought we should
> delay until after getting Biopython 1.54 out the door, but I would
> expect it to be included in Biopython 1.55.
>
> There are at least two remaining issues, (1) where to save the comment
> fields, and (2) what to call the format in SeqIO. Both of these should
> ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to
> ensure the OBF projects which use simple strings for file formats are
> consistent. Would you like me to start a discussion there regarding
> the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe
> even "unitprotxml". Personally, "uniprot" seems fine provided this is
> going to be the primary file format for UniProt records in the short
> to medium term.
>
Of course you are free to open a discussion. I used 'uniprot' for sake of
simplicity, but then I noticed that the format is called 'uniprotxml' in
EBI
REST web services.
A common name will easier for everybody.
> Also I don't think any of the current Biopython developers have sat
> down to review the code.
The code was reviewed by Mauro Amico, I don't know if he is one of the
"current Biopython developers", anyhow any additional review is welcome.
> As the Bio.SeqIO maintainer, I will do this,
> but right now I think getting Biopython 1.54 out should be
> prioritised. From a very quick look just now, the recent merging of
> the SFF support to the trunk will require a few tweaks in
> test_SeqIO.py (e.g. an empty file is not valid for SFF files as well
> as the UniProt XML). Also including a UniProt XML file in
> test_BioSQL_SeqIO.py would be worthwhile.
>
Mauro also added some unit testing that should be useful for this.
Let me know if you need any help/info.
Bests,
Andrea
From p.j.a.cock at googlemail.com Thu Mar 11 12:49:50 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 17:49:50 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it>
References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
<320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
<4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01003110949v206a1868g6360002198a41ddd@mail.gmail.com>
On Thu, Mar 11, 2010 at 5:43 PM, Andrea Pierleoni
wrote:
>
>>
>> Hi Andrea,
>>
>> Your UnitProt XML parser was one of the things I thought we should
>> delay until after getting Biopython 1.54 out the door, but I would
>> expect it to be included in Biopython 1.55.
>>
>> There are at least two remaining issues, (1) where to save the comment
>> fields, and (2) what to call the format in SeqIO. Both of these should
>> ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to
>> ensure the OBF projects which use simple strings for file formats are
>> consistent. Would you like me to start a discussion there regarding
>> the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe
>> even "unitprotxml". Personally, "uniprot" seems fine provided this is
>> going to be the primary file format for UniProt records in the short
>> to medium term.
>
> Of course you are free to open a discussion. I used 'uniprot' for sake of
> simplicity, but then I noticed that the format is called 'uniprotxml' in
> EBI REST web services. A common name will easier for everybody.
In that case, given the EBI REST convention, uniprotxml may be wise.
>> Also I don't think any of the current Biopython developers have sat
>> down to review the code.
>
> The code was reviewed by Mauro Amico, I don't know if he is one of the
> "current Biopython developers", anyhow any additional review is welcome.
I don't recall Mauro Amico contributing to Biopython in the past, but
as you say, the more eyes on the code the better :)
Peter
From eric.talevich at gmail.com Thu Mar 11 17:54:38 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 11 Mar 2010 17:54:38 -0500
Subject: [Biopython-dev] Adding format method to phylo tree object?
In-Reply-To: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
Message-ID: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com>
On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote:
> Hi Eric (et al),
>
> Are you familiar with the format method of the SeqRecord and alignment
> object (plus the __format__ method which does the same thing aiming to
> work nicely with the Python 2.6 built in function format)? This allows
> the user to turn their data into a string in a specified output
> format. Internally the method calls Bio.SeqIO.write (or AlignIO) with
> a StringIO handle.
>
> Do you think it would it make sense to have this for the tree objects
> in Bio.Phylo, allowing easy access to the object as a Newick tree
> format etc?
>
Sure, I could do that. It makes a lot of sense for Newick trees, and could
be useful with the XML formats for debugging.
>
> For people using IPython, the __pretty__ method looks related. I know
> the Bio.Nexus tree has a "prity print" method which might be exposed
> like this. I wonder if this convention will become more widespread?
>
> http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html
>
I didn't know about that. I also have a pretty_print method in Bio.Phylo
which does something much different from the Bio.Nexus printer -- the Nexus
one looks more like it's more useful for debugging the Tree object's
internal structure in terms of references, so (highly biased judgment) I'm
inclined to use the code from Bio.Phylo._utils.pretty_print to implement
__pretty__ for IPython. But I'll play with this IPython feature to see how
it's supposed to behave in general.
-Eric
From eric.talevich at gmail.com Thu Mar 11 18:03:59 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 11 Mar 2010 18:03:59 -0500
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com>
On Thu, Mar 11, 2010 at 10:34 AM, Peter Cock wrote:
> Hi all,
>
> It is probably time to starting getting ready for Biopython 1.54,
> perhaps aiming to release within about a months time?
>
> This means not landing any major additions to the trunk for now (keep
> things like GFF and Geography on branches for now).
>
> Other than finishing up any documentation for new stuff (especially
> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
> are there any important issues we should address before the release?
>
Is it all right to leave the documentation for Bio.Phylo on the wiki for
now, or should I try to add something to the main tutorial?
-Eric
From p.j.a.cock at googlemail.com Thu Mar 11 18:18:18 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 23:18:18 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<3f6baf361003111503m4656258av721852264516818f@mail.gmail.com>
Message-ID: <320fb6e01003111518o3f50b95bw6b2446611fbb9bf5@mail.gmail.com>
On Thu, Mar 11, 2010 at 11:03 PM, Eric Talevich wrote:
>> Other than finishing up any documentation for new stuff (especially
>> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
>> are there any important issues we should address before the release?
>
> Is it all right to leave the documentation for Bio.Phylo on the wiki
> for now, or should I try to add something to the main tutorial?
I would like at least a short section in the tutorial mentioning
the new module with a link to the wiki. That way people just
browsing the tutorial to get an idea of what Biopython covers
will be made aware of it. In the long term I think the module
deserves a chapter (which can be based on the wiki text).
Are you familiar with LaTeX? (The mark up language the
tutorial is written in).
Also, I think it would be great to have a post on the news
server (which we can link to in the release announcement)
talking about what Bio.Phylo adds (and thank GSoC and
NESCent etc). A little advertising ;) How does that sound?
Regards,
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 11 18:23:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Mar 2010 23:23:54 +0000
Subject: [Biopython-dev] Adding format method to phylo tree object?
In-Reply-To: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com>
References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
<3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com>
Message-ID: <320fb6e01003111523r4fe5f4c7va9f77e089385ba0c@mail.gmail.com>
On Thu, Mar 11, 2010 at 10:54 PM, Eric Talevich wrote:
> On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote:
>
>> Hi Eric (et al),
>>
>> Are you familiar with the format method of the SeqRecord and alignment
>> object (plus the __format__ method which does the same thing aiming to
>> work nicely with the Python 2.6 built in function format)? This allows
>> the user to turn their data into a string in a specified output
>> format. Internally the method calls Bio.SeqIO.write (or AlignIO) with
>> a StringIO handle.
>>
>> Do you think it would it make sense to have this for the tree objects
>> in Bio.Phylo, allowing easy access to the object as a Newick tree
>> format etc?
>>
>
> Sure, I could do that. It makes a lot of sense for Newick trees, and could
> be useful with the XML formats for debugging.
>
Great.
>> For people using IPython, the __pretty__ method looks related. I know
>> the Bio.Nexus tree has a "pretty print" method which might be exposed
>> like this. I wonder if this convention will become more widespread?
>>
>> http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html
>>
>
> I didn't know about that.
I only read about it recently myself - it may not be worth doing.
(I'm not trying to invent work here *grin*, just looking for things
we can polish before your code gets its first proper release.)
Thanks,
Peter
From lpritc at scri.ac.uk Fri Mar 12 03:18:09 2010
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 12 Mar 2010 08:18:09 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID:
On 11/03/2010 Thursday, March 11, 15:34, "Peter Cock"
wrote:
> Other than finishing up any documentation for new stuff (especially
> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
> are there any important issues we should address before the release?
There are those updates to ePrimer3/PrimerSearch EMBOSS interaction (that
you'll need for that differential primer script, BTW...)
Cheers,
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From biopython at maubp.freeserve.co.uk Fri Mar 12 08:22:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Mar 2010 13:22:55 +0000
Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML)
Message-ID: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com>
Hi all,
Back in November I set up a simple pair of cron jobs to update the code
snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour:
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html
I've just added another job which takes the latest Tutorial.tex file and
compiles it with pdflatex (already installed) and hevea (installed from
source under my user account) to make the PDF and HTML files.
These are then copied to the webserver and published as:
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf
These are currently updated once a day (at 2:40am which shouldn't
be too busy whichever USA timezone the server uses). Assuming
I got my crontab settings right - in the short term I'll keep an eye on
it to check ;)
In comparison the "official" versions at the following URLs are
generally updated only for releases:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
I know that not everyone has latex or hevea installed (installing
hevea from source is a bit of a hassle even on Linux), and further
more proof reading the raw markup in Tutorial.tex isn't that easy.
So, the point of all this effort is now anyone can help proofread
the latest version of the tutorial - this should also be of use to
those users/contributors actually running the latest code from
git rather than the official releases.
Regards,
Peter
From biopython at maubp.freeserve.co.uk Fri Mar 12 08:32:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Mar 2010 13:32:32 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
References:
<200911250945.20870.jblanca@btc.upv.es>
<320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com>
<200911251220.53881.jblanca@btc.upv.es>
<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
<320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
Message-ID: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
Hi all,
I'd like to proceed as outlined below for Biopython 1.54,
i.e. don't change the current Seq equality but add a warning
that we plan to change it.
Should we have a discussion on the main list first?
Peter
On Mon, Feb 22, 2010 at 2:48 PM, Peter wrote:
> Hi all,
>
> I've just got back from Japan - Brad and I were fortunate to be
> able to attend the DBCLS BioHackathon 2010 held in Tokyo,
> http://hackathon3.dbcls.jp/
>
> As Brad already mentioned in passing, we also managed to have
> dinner one evening with Michiel, and had an informal chat about
> Biopython plans. Expect a few more emails on other topics to
> follow.
>
> One of the short term aims we agreed on was to press ahead
> with the Seq equality changes outlined on this thread late last
> year. Mailing list archive link:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html
>
> To recap, the agreed best behaviour was to make Seq equality
> act like string equality, but to raise a Python warning when
> incompatible alphabets are compared (e.g. DNA to Protein).
> This also applies to all the other comparison operators:
> not equal, less than, greater than, less than or equal, and
> greater than or equal.
>
> This is my outline plan for the change:
>
> For Biopython up to 1.53, Seq class uses object equality,
> seq1==seq2 acts as id(seq1)==id(seq2)
>
> For Biopython 1.54 (and perhaps a few more releases),
> the Seq classes will still use object equality but will trigger
> a warning suggesting explicit use of ?id(seq1)==id(seq2)
> or str(seq1)==str(seq2) as appropriate.
>
> For Biopython 1.xx (maybe 1.55 or 1.56?) the Seq classes
> will switch to using string equality (with an alphabet aware
> warning for comparing DNA to RNA etc), but will also trigger
> a warning that this is a change from previous releases, and
> suggest in the short term the continued explicit use of either
> id(seq1)==id(seq2) for object identity or str(seq1)==str(seq2)
> for string identity.
>
> For Biopython 1.yy (maybe 1.57?) the Seq classes will
> use string equality (with an alphabet aware warning for
> comparing DNA to RNA etc), without any warning about
> this being a change from historic behaviour.
>
> These warning messages could also point at a wiki page,
> and we'd need a FAQ entry in the tutorial as well. The
> aim of this slightly drawn out switch is to try and make
> sure all users are aware of the change, even if they
> only update their copy of Biopython every few releases.
>
> Does that all sound sensible? If so, we should probably
> have an announcement on the main mailing list, in case
> there are any other views.
>
> Other more complex options include a flag for switching
> between the modes - but that complexity doesn't seem
> such a good idea to me. All my own code and most of
> the unit tests use str(seq1)==str(seq2) explicitly anyway.
> The only exception is some of the genetic algorithm unit
> tests which do seem to want explicit object identity.
>
> Regards,
>
> Peter
>
From kellrott at gmail.com Fri Mar 12 13:00:45 2010
From: kellrott at gmail.com (Kyle)
Date: Fri, 12 Mar 2010 10:00:45 -0800
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID:
>
>
> It is probably time to starting getting ready for Biopython 1.54,
> perhaps aiming to release within about a months time?
>
> This means not landing any major additions to the trunk for now (keep
> things like GFF and Geography on branches for now).
>
I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I
don't think it counts as a major addition. I think to finish it off, we
just needed to finalize the driver names.
For post 1.54 stuff, I have some HMMER3, Pfam, and GO parsing code (Chris
Lasher has a GO fork as well). But I need some community feedback to fill in
the interface holes.
Kyle
From p.j.a.cock at googlemail.com Fri Mar 12 13:09:39 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Mar 2010 18:09:39 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To:
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
On Fri, Mar 12, 2010 at 6:00 PM, Kyle wrote:
>>
>> It is probably time to starting getting ready for Biopython 1.54,
>> perhaps aiming to release within about a months time?
>>
>> This means not landing any major additions to the trunk for now (keep
>> things like GFF and Geography on branches for now).
>
> I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I
> don't think it counts as a major addition. ?I think to finish it off, we
> just needed to finalize the driver names.
Oh yeah - I confess I'd forgotten about that. Has there been any news
on the Jython front about SQLite support?
> For post 1.54 stuff, I have some HMMER3, Pfam, and GO??parsing code?(Chris
> Lasher has a GO fork as well). But I need some community feedback to fill in
> the interface holes.
> Kyle
Lots of exciting stuff to come then :)
Peter
From kellrott at gmail.com Fri Mar 12 13:28:45 2010
From: kellrott at gmail.com (Kyle)
Date: Fri, 12 Mar 2010 10:28:45 -0800
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
Message-ID:
>
>
> Oh yeah - I confess I'd forgotten about that. Has there been any news
> on the Jython front about SQLite support?
>
There is no official support, but you can always work through existing Java
packages (
http://old.nabble.com/SQLite-%2B-JDBC-%2B-Jython.-Example-td13322270.html ).
Kyle
From eric.talevich at gmail.com Fri Mar 12 14:14:51 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 12 Mar 2010 14:14:51 -0500
Subject: [Biopython-dev] Bio.Phylo.Applications?
In-Reply-To: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com>
References: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com>
Message-ID: <3f6baf361003121114v36b8a311i5b4dc9cee27961c2@mail.gmail.com>
On Thu, Mar 11, 2010 at 6:21 AM, Peter wrote:
> Hi Eric et al,
>
> We have started a collection of command line tool wrappers for
> multiple sequence alignments under Bio.Align.Applications, so I was
> thinking about where to put wrappers for phylogenetic tree command
> line tools. How does Bio.Phylo.Applications sound (following the same
> structure as the Bio.Align.Applications module).
>
Sounds great to me! I don't have any code that would go there yet, but feel
free to add the directory and any new code you have.
-Eric
From bugzilla-daemon at portal.open-bio.org Fri Mar 12 16:57:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Mar 2010 16:57:53 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To:
Message-ID: <201003122157.o2CLvrtP008861@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3000
------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2010-03-12 16:57 EST -------
Hi Peter,
I finally got back to this. Thank your for all your work. I would be glad if
one could use the accession without the trailing ".1", etc for get_raw() and
get(). I think just any version of the record should be returned, and maybe a
list if there were multiple versions of the same.
>>> print data.get_raw("BC035166")
Traceback (most recent call last):
File "", line 1, in
File "Bio/SeqIO/_index.py", line 280, in get_raw
handle.seek(dict.__getitem__(self, key))
KeyError: 'BC035166'
>>>
Similarly, if I loop over the entries I have to do:
>>> mylist = ['ACC1', 'ACC2', 'ACC3']
>>> sequences = []
>>> for acc in data.keys():
... if data.get(acc).id.split('.')[0] in mylist:
... sequences.append(data.get(acc))
Oh no, this is not what I wanted, in full:
from Bio import SeqIO
data = SeqIO.index("full.gb", "gb")
mylist = ['AC11111.1', 'AC2222.2', 'AC3333.3']
sequences = []
for acc in mylist:
if acc in map(lambda x: x.split('.')[0], data.keys()):
print "Found %s" % acc
if data.get(acc + '.1'):
sequences.append(data.get(acc + '.1'))
else:
if data.get(acc + '.2'):
sequences.append(data.get(acc + '.2'))
else:
sequences.append(data.get(acc + '.3'))
else:
print "Missing %s" % acc
output_handle = open("filtered.gb", "w")
SeqIO.write(sequences, output_handle, "genbank")
There was already a discussing on the user mailing list, I do not think forcing
uppercase letters for genbank files is a good idea. Just stick with what was
supplied. Myself, I use mixed typically to emphasize, ORFs, but sometimes in
lower-case low-quality regions. Anyway, I provided original NCBI-web GenBank
file of an EST and the DNA sequence was in lowercase, biopython returned
uppercase. In turn, diff(1) command returns too many changed lines,
unnecessarily. I suggest giving use an opportunity to specify on input parsing
whether to keep mixed-case/lower-case or force uppercase. Also, protein
sequences I have often seen in lower-case, which is ugly to my eyes, btw.
Finally, the remaining differences are here (probably the first is in bug
#2578):
--- /tmp/orig.gb 2010-03-12 21:09:24.000000000 +0100
+++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
@@ -1,4 +1,4 @@
-LOCUS CR603932 1625 bp mRNA linear HTC
16-OCT-2008
+LOCUS CR603932 1625 bp DNA HTC
16-OCT-2008
DEFINITION full-length cDNA clone CS0DK007YH24 of HeLa cells Cot
25-normalized
of Homo sapiens (human).
ACCESSION CR603932
@@ -29,39 +29,39 @@
division of Invitrogen.
FEATURES Location/Qualifiers
source 1..1625
- /organism="Homo sapiens"
/mol_type="mRNA"
- /db_xref="taxon:9606"
/clone="CS0DK007YH24"
+ /db_xref="taxon:9606"
/tissue_type="HeLa cells Cot 25-normalized"
/plasmid="pCMVSPORT_6"
+ /organism="Homo sapiens"
ORIGIN
Thanks for all you work on this, it is a great service. ;-) Next, I will try to
filter by .features['tissue_type'] but sadly will have to search for the very
same string through COMMENT string as well.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 12 17:05:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Mar 2010 17:05:39 -0500
Subject: [Biopython-dev] [Bug 3026] New:
Bio.SeqIO.InsdcIO._split_multi_line(): Your description
cannot be broken into nice lines!
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3026
Summary: Bio.SeqIO.InsdcIO._split_multi_line(): Your description
cannot be broken into nice lines!
Product: Biopython
Version: 1.53
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: mmokrejs at ribosome.natur.cuni.cz
Traceback (most recent call last):
File "/home/mmokrejs/bin/filter-accessions.py", line 22, in
SeqIO.write(sequences, output_handle, "genbank")
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 363, in
write
count = writer_class(handle).write_file(sequences)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 271, in
write_file
count = self.write_records(records)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 256, in
write_records
self.write_record(record)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 691, in
write_record
self._write_comment(record)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 579, in
_write_comment
self._write_multi_line("", line)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 335, in
_write_multi_line
lines = self._split_multi_line(text, max_len)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 279, in
_split_multi_line
"Your description cannot be broken into nice lines!"
AssertionError: Your description cannot be broken into nice lines!
Please fix the message so it prints out the accession/version number. ;-)
LOCUS BF378302 501 bp mRNA linear EST 27-NOV-2000
DEFINITION CM0-UM0001-060300-270-g07 UM0001 Homo sapiens cDNA, mRNA sequence.
ACCESSION BF378302
VERSION BF378302.1 GI:11367336
KEYWORDS EST.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 501)
AUTHORS Dias Neto,E., Garcia Correa,R., Verjovski-Almeida,S., Briones,M.R.,
Nagai,M.A., da Silva,W. Jr., Zago,M.A., Bordin,S., Costa,F.F.,
Goldman,G.H., Carvalho,A.F., Matsukuma,A., Baia,G.S., Simpson,D.H.,
Brunstein,A., deOliveira,P.S., Bucher,P., Jongeneel,C.V., O'Hare
,M.J., Soares,F., Brentani,R.R., Reis,L.F., de Souza,S.J. and
Simpson,A.J.
TITLE Shotgun sequencing of the human transcriptome with ORF expressed
sequence tags
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 97 (7), 3491-3496 (2000)
PUBMED 10737800
COMMENT Contact: Simpson A.J.G.
Laboratory of Cancer Genetics
Ludwig Institute for Cancer Research
Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP,
Brazil
Tel: +55-11-2704922
Fax: +55-11-2707001
Email: asimpson at ludwig.org.br
This sequence was derived from the FAPESP/LICR Human Cancer Genome
Project. This entry can be seen in the following URL
(http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001-060300-270-g07&t3=2000-03-06&t4=1
)
Seq primer: puc 18 forward.
FEATURES Location/Qualifiers
[cut]
I have few more example slike this from some dbEST data, I think all from a
same project, though.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Sat Mar 13 08:43:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 13 Mar 2010 13:43:53 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
<4B6995D0.3030405@fold.natur.cuni.cz>
<320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>
<4B9AA432.2050407@fold.natur.cuni.cz>
Message-ID: <320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com>
On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ?
wrote:
>
> I finally got back to this. Thank your for all your work.
> I would be glad if one could use the accession without
> the trailing ".1", etc for get_raw() and get(). I think
> just any version of the record should be returned,
> and maybe a list if there were multiple versions of
> the same.
This is just a quick reply to answer this part of your email.
It would be unwise to try and be clever with the key
matching - in this case yes, for GenBank files we know
what the names means, accession.version - but this is
not true in general.
In this case the answer for your needs would be to use
the Bio.SeqIO.index optional argument to specify the
keys. e.g. something like this:
from Bio import SeqIO
def strip_version(identifier):
return identifier.rsplit(".",1)[0]
my_dict = SeqIO.index(filename, "gb", key_function=strip_version)
That way all the keys will have just the accession
without the version (assuming there are no clashes
which I think will raise an error).
Peter
From sbassi at clubdelarazon.org Sun Mar 14 03:16:25 2010
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Sun, 14 Mar 2010 04:16:25 -0300
Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc)
In-Reply-To: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com>
References: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com>
Message-ID: <9e2f512b1003132316j55a95ca7u6a87191ff877898d@mail.gmail.com>
On Wed, Mar 10, 2010 at 11:30 AM, Peter Cock wrote:
> related project, you can join us in the application. If you are
> a student and are interested in a project (or would like to
> propose one), please take a look at these pages:
> http://www.open-bio.org/wiki/Google_Summer_of_Code
> http://biopython.org/wiki/Google_Summer_of_Code
Regarding GSoC call in Biopython, I found the PDB-Tidy task pretty
interesting. I will study the proposal and write back to you. I am
working currently with microRNA but I use Bio.PDB a lot to help my
wife who does antigen structure prediction and works with modeller,
PyMol and PDB files. A tool like the proposed PDB-Tidy could come
handily.
Best,
SB.
From biopython at maubp.freeserve.co.uk Sun Mar 14 09:50:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 14 Mar 2010 13:50:52 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To: <4B9BB1F6.9000505@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
<4B6995D0.3030405@fold.natur.cuni.cz>
<320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>
<4B9AA432.2050407@fold.natur.cuni.cz>
<320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com>
<4B9BB1F6.9000505@fold.natur.cuni.cz>
Message-ID: <320fb6e01003140650o54a8eea2h66ea87abc42c754@mail.gmail.com>
On Sat, Mar 13, 2010 at 3:40 PM, Martin MOKREJ? wrote:
>
> Thanks Peter,
> ?yes, that is what I already ended-up with in a more awkward way. ;-)
> But basically I have the same workaround.
> Best,
> M.
So does using the Bio.SeqIO.index() function's key_function
argument seem like a good solution to your key problem?
Peter
From biopython at maubp.freeserve.co.uk Sun Mar 14 16:30:45 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 14 Mar 2010 20:30:45 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
<4B6995D0.3030405@fold.natur.cuni.cz>
<320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>
<4B9AA432.2050407@fold.natur.cuni.cz>
Message-ID: <320fb6e01003141330t199bbbcfm6bf32c5357b9fd77@mail.gmail.com>
On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ? wrote:
>
> Finally, the remaining differences are here (probably the first is in bug #2578):
>
> --- /tmp/orig.gb ? ? ? ?2010-03-12 21:09:24.000000000 +0100
> +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
> @@ -1,4 +1,4 @@
> -LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?mRNA ? ?linear ? HTC 16-OCT-2008
> +LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?DNA ? ? ? ? ? ? ?HTC 16-OCT-2008
> ?DEFINITION ?full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized
> ? ? ? ? ? ? of Homo sapiens (human).
> ?ACCESSION ? CR603932
> @@ -29,39 +29,39 @@
> ? ? ? ? ? ? division of Invitrogen.
> ?FEATURES ? ? ? ? ? ? Location/Qualifiers
> ? ? ?source ? ? ? ? ?1..1625
> - ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens"
> ? ? ? ? ? ? ? ? ? ? ?/mol_type="mRNA"
> - ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606"
> ? ? ? ? ? ? ? ? ? ? ?/clone="CS0DK007YH24"
> + ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606"
> ? ? ? ? ? ? ? ? ? ? ?/tissue_type="HeLa cells Cot 25-normalized"
> ? ? ? ? ? ? ? ? ? ? ?/plasmid="pCMVSPORT_6"
> + ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens"
> ?ORIGIN
>
Yes, the LOCUS line issue would be part of Bug 2578.
As to the order of the feature qualifiers, these are stored
in a Python dictionary which does not preserve the order.
I personally don't think the order of the qualifiers is
important and thus don't care that is can change like
this. Assuming the NCBI have a defined sort order for
the qualifiers (I'm not aware one), then we could sort
the feature qualifiers on output. Another option would
be to store the qualifiers in an ordered-dictionary. Or
just leave it as it is ;)
Peter
From bugzilla-daemon at portal.open-bio.org Sun Mar 14 19:31:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 14 Mar 2010 19:31:51 -0400
Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line():
Your description cannot be broken into nice lines!
In-Reply-To:
Message-ID: <201003142331.o2ENVp3v015452@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3026
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-14 19:31 EST -------
I just used the Entrez web interface, and it comes with the URL split already
to meet the 80 column limit. Also doing it via the API:
>>> from Bio import Entrez
>>> data = Entrez.efetch("nucest", id="BF378302", rettype="gb").read()
>>> print data[1095:1800]
PUBMED 10737800
COMMENT Contact: Simpson A.J.G.
Laboratory of Cancer Genetics
Ludwig Institute for Cancer Research
Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP,
Brazil
Tel: +55-11-2704922
Fax: +55-11-2707001
Email: asimpson at ludwig.org.br
This sequence was derived from the FAPESP/LICR Human Cancer Genome
Project. This entry can be seen in the following URL
(http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001-
060300-270-g07&t3=2000-03-06&t4=1)
Seq primer: puc 18 forward.
FEATURES Location/Qualifiers
In this particular case, it looks like splitting the string on a hyphen would
be a reasonable option (i.e. copy what the NCBI seems to be doing).
Did you just cut and paste it from the NCBI's HTML page where it does seem
to be shown with the URL is shown unbroken (giving a line more than 80
characters)? Or can we download a "broken" GenBank file from the NCBI
somewhere (maybe the FTP site)?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Mar 14 20:44:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 14 Mar 2010 20:44:59 -0400
Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line():
Your description cannot be broken into nice lines!
In-Reply-To:
Message-ID: <201003150044.o2F0ixwP017517@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3026
------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2010-03-14 20:44 EST -------
Most I copy&pasted from their web, so this is probably the case.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Mar 15 11:40:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 15 Mar 2010 15:40:20 +0000
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
Message-ID: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
Hi all (especially Eric),
As recently discussed SeqIO and AlignIO will now take filenames as
well as handles. This matches the existing behaviour of Bio.Nexus,
Eric's Bio.Phylo, and several big 3rd partly libraries like ReportLab.
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007352.html
I've updated most of the tutorial to take advantage of this, and quickly
got used less typing when working at the Python prompt. It does make
things easier, and I probably should have conceded this earlier.
It made me wonder about relaxing another restraint of the SeqIO
and AlignIO write functions - they currently insist on a list or iterator
of records or alignments. Giving a single object raises an error,
but we could handle this unambiguously. Amusingly Eric just
updated Bio.Phylo to match this strict behaviour - one reason I
sat down and wrote this email.
So, should we continue to insist on:
record = SeqRecord(...)
SeqIO.write([record], filename, format)
or should be relax a little more and allow this too?:
record = SeqRecord(...)
SeqIO.write(record, filename, format)
For SeqIO and AlignIO we can do a simple isinstance check
for a SeqRecord or alignment object - there isn't really a
problem with ambiguity here. Probably also try for Phylo?
What's the general consensus on the dev list?
Peter
From updates at feedmyinbox.com Tue Mar 16 02:16:42 2010
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Tue, 16 Mar 2010 02:16:42 -0400
Subject: [Biopython-dev] 3/16 BioStar - Biopython Questions
Message-ID: <0ef45bfc18dff2fe627af99c71f3b412@74.63.51.88>
==================================================
1. Compare two protein sequences using local BLAST
==================================================
March 15, 2010 at 7:24 PM
Hi,
I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful!
Cheers
http://biostar.stackexchange.com/questions/302/compare-two-protein-sequences-using-local-blast
--------------------------------------------------
===========================================================
Source: http://biostar.stackexchange.com/questions/tagged/biopython
This email was sent to biopython-dev at lists.open-bio.org.
Account Login:
https://www.feedmyinbox.com/members/login/
Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/
-----------------------------------------------------------
This email was carefully delivered by FeedMyInbox.com.
230 Franklin Road Suite 814 Franklin, TN 37064
From mhampton at d.umn.edu Tue Mar 16 12:01:41 2010
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Tue, 16 Mar 2010 11:01:41 -0500 (CDT)
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To:
References:
Message-ID:
I'm strongly in favor of such relaxations. It would also be convenient if
SeqRecords had a write function.
-Marshall Hampton
>So, should we continue to insist on:
>
>record = SeqRecord(...)
>SeqIO.write([record], filename, format)
>or should be relax a little more and allow this too?:
>record = SeqRecord(...)
>SeqIO.write(record, filename, format)
>For SeqIO and AlignIO we can do a simple isinstance check
>for a SeqRecord or alignment object - there isn't really a
>problem with ambiguity here. Probably also try for Phylo?
>What's the general consensus on the dev list?
From rodrigo_faccioli at uol.com.br Tue Mar 16 15:24:58 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Tue, 16 Mar 2010 16:24:58 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
Message-ID: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Hi all,
I want to know the primary sequence (fasta file) of all proteins. In other
the words, I would like a database which contain the fasta files of all
proteins.
I'm a computer scientist and I don't know how hard it is. However, we have
worked with SEQRES section of PDB files and BioPython. So, we want to work
with fasta files and BioPython to check our results.
I searched the NCBI web-site where I found a lot of databases. I confess I'm
lost with them :)
Sorry if my email is a basic question. But, I'm very lost.
Thanks in advance,
--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
From biopython at maubp.freeserve.co.uk Tue Mar 16 15:42:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Mar 2010 19:42:43 +0000
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli
wrote:
>
> Hi all,
>
> I want to know the primary sequence (fasta file) of all proteins. In other
> the words, I would like a database which contain the fasta files of all
> proteins.
>
> I'm a computer scientist and I don't know how hard it is. However, we have
> worked with SEQRES section of PDB files and BioPython. So, we want to work
> with fasta files and BioPython to check our results.
A single FASTA file of all know proteins would be enormous. Even the
non-redundant ("nr") dataset used by the NCBI for their hugely popular
BLAST search is pretty big.
It sounds like many all you need is a FASTA file containing all the
sequences with structures in the PDB - something you may be
able to download directly from the PDB FTP site.
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 16 15:42:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Mar 2010 19:42:43 +0000
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli
wrote:
>
> Hi all,
>
> I want to know the primary sequence (fasta file) of all proteins. In other
> the words, I would like a database which contain the fasta files of all
> proteins.
>
> I'm a computer scientist and I don't know how hard it is. However, we have
> worked with SEQRES section of PDB files and BioPython. So, we want to work
> with fasta files and BioPython to check our results.
A single FASTA file of all know proteins would be enormous. Even the
non-redundant ("nr") dataset used by the NCBI for their hugely popular
BLAST search is pretty big.
It sounds like many all you need is a FASTA file containing all the
sequences with structures in the PDB - something you may be
able to download directly from the PDB FTP site.
Peter
From rodrigo_faccioli at uol.com.br Tue Mar 16 21:01:01 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Tue, 16 Mar 2010 22:01:01 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
<320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
Message-ID: <3715adb71003161801n294d15ccwb3a52f6d5ea83c23@mail.gmail.com>
Peter,
Thank you for your reply.
Actually, we want to store the sequence of the fasta files in a relational
database which has been developed by my research group. So, we have
developed some calculations with primary sequence of proteins.
We did not download the PDB database because our computation of protein
properties are based on their primary sequence. Therefore, our idea is to
work with the primary sequence of all proteins. My understanding is the PDB
database contains the proteins which is known their tearty structure. The
others are in other database.
Thanks in advance,
--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
On Tue, Mar 16, 2010 at 4:42 PM, Peter wrote:
> On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli
> wrote:
> >
> > Hi all,
> >
> > I want to know the primary sequence (fasta file) of all proteins. In
> other
> > the words, I would like a database which contain the fasta files of all
> > proteins.
> >
> > I'm a computer scientist and I don't know how hard it is. However, we
> have
> > worked with SEQRES section of PDB files and BioPython. So, we want to
> work
> > with fasta files and BioPython to check our results.
>
> A single FASTA file of all know proteins would be enormous. Even the
> non-redundant ("nr") dataset used by the NCBI for their hugely popular
> BLAST search is pretty big.
>
> It sounds like many all you need is a FASTA file containing all the
> sequences with structures in the PDB - something you may be
> able to download directly from the PDB FTP site.
>
> Peter
>
From bugzilla-daemon at portal.open-bio.org Wed Mar 17 07:33:09 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 17 Mar 2010 07:33:09 -0400
Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS
6.1.0 arguments
In-Reply-To:
Message-ID: <201003171133.o2HBX9kO004765@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2966
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 07:33 EST -------
(In reply to comment #2)
> I also found an issue with the PrimerSearchCommandline. The command line
> options -sequences and -primers do not appear to be used in EMBOSS6.1.0, having
> been replaced by -seqall and -infile, respectively. I changed the options
> accordingly, and the modified files are available at
> http://github.com/widdowquinn/biopython/tree/emboss-branch.
I've merged that fix on the master,
http://github.com/biopython/biopython/commit/39708be130eb771eacccf96eed3e8ce0a44ea4f0
Will have a look at eprimer3 next.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Mar 17 08:13:46 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 17 Mar 2010 08:13:46 -0400
Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS
6.1.0 arguments
In-Reply-To:
Message-ID: <201003171213.o2HCDkf4006396@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2966
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 08:13 EST -------
(In reply to comment #1)
> I have made changes to Primer3Commandline that involve adding the EMBOSS 6.1.0
> arguments, and docstrings to each argument. I have also added doctests.
>
> The proposed code can be inspected at my GitHub repository:
>
> http://github.com/widdowquinn/biopython/commit/9c0643e333b0cafb4e356426fb4902e0e9d2385c
>
Cherry picked to merge to the trunk.
Marking bug as fixed - thanks.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From sbassi at clubdelarazon.org Wed Mar 17 14:32:17 2010
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Wed, 17 Mar 2010 15:32:17 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Message-ID: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com>
On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli
wrote:
> I want to know the primary sequence (fasta file) of all proteins. In other
> the words, I would like a database which contain the fasta files of all
> proteins.
You don't need Biopython to get this file. Just download NR database y
use "fastacmd", a program found in the blast suite.
BLAST FTP is not working for me right now so I can't give you the
exact URL to download, but look from here:
ftp://ftp.ncbi.nih.gov/blast/
Here is how to use fastacmd to retrieve sequences from NR database:
http://pwet.fr/man/linux/commandes/fastacmd
From kellrott at gmail.com Wed Mar 17 18:14:25 2010
From: kellrott at gmail.com (Kyle)
Date: Wed, 17 Mar 2010 15:14:25 -0700
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
Message-ID:
>
>
> > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I
> > don't think it counts as a major addition. I think to finish it off, we
> > just needed to finalize the driver names.
>
> Oh yeah - I confess I'd forgotten about that.
>
I've posted a fork from the master branch on github (
http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes
related to zxjdbc. I've added two driver requests, "MySQL" and "PostgreSQL",
that select the appropriate driver based on the platform.
Kyle
From tiagoantao at gmail.com Wed Mar 17 18:28:36 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 17 Mar 2010 22:28:36 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com>
Message-ID: <6d941f121003171528p1e60fbb8q419485f6c6f171c2@mail.gmail.com>
Hi,
2010/3/11 Tiago Ant?o :
> I think I will be able to commit my code around the 20th. Currently I
> need to address the issue of supporting thousands of markers in the
> genepop parser as people do complain about that (like a couple of
> times a month or so, not more).
I am going to add this and support for haploid markers also. I would
like to ask, when its done (soon!) a code review on the part of
support of thousands of markers (The parser will change in nature, and
files will be maintained open during the whole existence of the parser
object). No need for domain knowledge, just comments on code quality.
Also some help with merging with the main trunk would be appreciated,
as I don' t use github for my stuff (bazaar fan here ;) ).
Thanks,
Tiago
--
"Heavier than air flying machines are impossible"
Lord Kelvin, President, Royal Society, c. 1895
From rodrigo_faccioli at uol.com.br Wed Mar 17 20:59:49 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Wed, 17 Mar 2010 21:59:49 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
<9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com>
Message-ID: <3715adb71003171759p7107f2cbod85339a5335374d5@mail.gmail.com>
Sebastian,
Thank you for your reply. I'll study it.
--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
On Wed, Mar 17, 2010 at 3:32 PM, Sebastian Bassi
wrote:
> On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli
> wrote:
> > I want to know the primary sequence (fasta file) of all proteins. In
> other
> > the words, I would like a database which contain the fasta files of all
> > proteins.
>
> You don't need Biopython to get this file. Just download NR database y
> use "fastacmd", a program found in the blast suite.
> BLAST FTP is not working for me right now so I can't give you the
> exact URL to download, but look from here:
> ftp://ftp.ncbi.nih.gov/blast/
> Here is how to use fastacmd to retrieve sequences from NR database:
> http://pwet.fr/man/linux/commandes/fastacmd
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From biopython at maubp.freeserve.co.uk Thu Mar 18 07:19:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 11:19:03 +0000
Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54
Message-ID: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com>
On Wed, Mar 17, 2010 at 10:14 PM, Kyle wrote:
>>
>>> I think the zxJDBC support (Jython MySQL for BioSQL) was almost
>>> done. I don't think it counts as a major addition. ?I think to finish it off,
>>> we just needed to finalize the driver names.
>>
>> Oh yeah - I confess I'd forgotten about that.
>
> I've posted a fork from the master branch on github (
> http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes
> related to zxjdbc. I've added two driver requests, "MySQL" and
> "PostgreSQL", that select the appropriate driver based on the platform.
> Kyle
Hmm. I think it might be cleaner to have a new optional argument like
batabase back end (MySQL, PostgreSQL, SQLite3). If the back end
is specified without the driver (which would be the encouraged usage)
then we will pick the driver at run time (based on if in Jython, or for
PostgreSQL which drivers are installed). Existing scripts can continue
to specify the driver directly (but we can eventually deprecated this?).
Peter
From anaryin at gmail.com Thu Mar 18 07:33:05 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 18 Mar 2010 04:33:05 -0700
Subject: [Biopython-dev] Small Typo in PDBParser
Message-ID:
Hello All,
There's a small typo in the Bio.PDB PDBParser module. Line 159:
"PDBContructionError" should be "PDBConstructionError"
So that I learn, how do I submit a bug and a patch to the project, such as
in this case?
Best!
Jo?o [...] Rodrigues
@ http://stanford.edu/~joaor/
From anaryin at gmail.com Thu Mar 18 07:36:15 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 18 Mar 2010 04:36:15 -0700
Subject: [Biopython-dev] Small Typo in PDBParser
In-Reply-To:
References:
Message-ID:
Well, actually, PDBConstructionError is not even defined.. It should likely
be PDBConstructionException.
Jo?o [...] Rodrigues
@ http://stanford.edu/~joaor/
On Thu, Mar 18, 2010 at 4:33 AM, Jo?o Rodrigues wrote:
> Hello All,
>
> There's a small typo in the Bio.PDB PDBParser module. Line 159:
>
> "PDBContructionError" should be "PDBConstructionError"
>
> So that I learn, how do I submit a bug and a patch to the project, such as
> in this case?
>
> Best!
>
> Jo?o [...] Rodrigues
> @ http://stanford.edu/~joaor/
>
>
From biopython at maubp.freeserve.co.uk Thu Mar 18 08:02:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 12:02:32 +0000
Subject: [Biopython-dev] Small Typo in PDBParser
In-Reply-To:
References:
Message-ID: <320fb6e01003180502w573baa84od9924f4b8e2486c8@mail.gmail.com>
On Thu, Mar 18, 2010 at 11:33 AM, Jo?o Rodrigues wrote:
> Hello All,
>
> There's a small typo in the Bio.PDB PDBParser module. Line 159:
>
> "PDBContructionError" should be "PDBConstructionError"
>
> So that I learn, how do I submit a bug and a patch to the project, such as
> in this case?
>
> Best!
Hi Jo?o,
I've you've found a bug in a release, and worked out how to fix it, one
of the first steps would be to try the latest code from the repository to
see if the bug is still there (and if you fix would need changing). In this
case the problem has already been fixed (February 23, 2010), see:
http://github.com/biopython/biopython/commits/master/Bio/PDB/PDBParser.py
For a simple change like this, you can use the command line tool diff
to generate a patch file (see "man diff" for details), which you can then
attach to a bug report on our bugzilla. The basic diff usage would be:
diff original_file.py fixed_file.py > bug_fix.patch
For more complex changes, I would suggest you look at learning git.
If you make a change locally you can get a patch file with this:
git diff > bug_fix.patch
Or, publish the fix to a public copy of the repository (e.g. on github).
See also http://biopython.org/wiki/GitUsage
I hope that helps, and that you'll have more patches for us in future :)
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 18 15:01:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 19:01:32 +0000
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
Message-ID: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
On Mon, Mar 15, 2010 at 5:26 PM, Eric Talevich wrote:
> On Mon, Mar 15, 2010 at 11:40 AM, Peter wrote:
>>
>> So, should we continue to insist on:
>>
>> record = SeqRecord(...)
>> SeqIO.write([record], filename, format)
>>
>> or should be relax a little more and allow this too?:
>>
>> record = SeqRecord(...)
>> SeqIO.write(record, filename, format)
>>
>> For SeqIO and AlignIO we can do a simple isinstance check
>> for a SeqRecord or alignment object - there isn't really a
>> problem with ambiguity here. Probably also try for Phylo?
>>
>> What's the general consensus on the dev list?
>
> Sounds good to me! The code I just deleted from Bio.Phylo._io
> was doing something foolish anyway (testing whether the
> argument is iterable) -- now that Bio.Phylo has a single legitimate
> base class, I can restore the feature with an isinstance(trees,
> BaseTree.Tree) check if we have a consensus here.
>
> -Eric
There was another +1 vote from Marshall Hampton, and no
comments against (so far). Let's leave it a few days, but unless
anyone speaks out in favour of the status-quo (keep the
current strict check in the write function), then make the change.
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 18 15:04:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 19:04:10 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
References:
<320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com>
<200911251220.53881.jblanca@btc.upv.es>
<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
<320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
<320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
Message-ID: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com>
On Fri, Mar 12, 2010 at 1:32 PM, Peter wrote:
> Hi all,
>
> I'd like to proceed as outlined below for Biopython 1.54,
> i.e. don't change the current Seq equality but add a warning
> that we plan to change it.
I've done that to Bio/Seq.py on the trunk (added two
FutureWarnings and docstring explanation). Assuming
this doesn't trigger any regressions, we'd need to work
on the documentation (in particular the tutorial, but also
perhaps a news post?) and fix the GA unit test before
the release.
If anyone on the dev list thinks this is a bad idea,
please speak up (sooner rather than later).
Thanks,
Peter
From kellrott at gmail.com Thu Mar 18 15:28:58 2010
From: kellrott at gmail.com (Kyle)
Date: Thu, 18 Mar 2010 12:28:58 -0700
Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54
In-Reply-To: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com>
References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com>
Message-ID:
What should the parameter by called? Possibilities: 'backend', 'dbtype', ...
ideas anyone?
Kyle
On Thu, Mar 18, 2010 at 4:19 AM, Peter wrote:
> Hmm. I think it might be cleaner to have a new optional argument like
> batabase back end (MySQL, PostgreSQL, SQLite3). If the back end
> is specified without the driver (which would be the encouraged usage)
> then we will pick the driver at run time (based on if in Jython, or for
> PostgreSQL which drivers are installed). Existing scripts can continue
> to specify the driver directly (but we can eventually deprecated this?).
>
> Peter
>
From biopython at maubp.freeserve.co.uk Thu Mar 18 15:34:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 19:34:39 +0000
Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54
In-Reply-To:
References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com>
Message-ID: <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com>
On Thu, Mar 18, 2010 at 7:28 PM, Kyle wrote:
> What should the parameter be called? Possibilities:
> 'backend', 'dbtype', ... ideas anyone?
Just database would be too vague. I quite like backend.
Peter
From sbassi at clubdelarazon.org Thu Mar 18 15:39:40 2010
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Thu, 18 Mar 2010 16:39:40 -0300
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
Message-ID: <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
On Thu, Mar 18, 2010 at 4:01 PM, Peter wrote:
> There was another +1 vote from Marshall Hampton, and no
> comments against (so far). Let's leave it a few days, but unless
> anyone speaks out in favour of the status-quo (keep the
> current strict check in the write function), then make the change.
If we are going to change this, why not setting "fasta" as default
input/output format? This would also results in less typing when
processing fasta files (most of the time in my workflow at least).
From bugzilla-daemon at portal.open-bio.org Thu Mar 18 17:27:48 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 18 Mar 2010 17:27:48 -0400
Subject: [Biopython-dev] [Bug 3029] New: PhyloXML.Phylogeny.is_preterminal()
fails
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3029
Summary: PhyloXML.Phylogeny.is_preterminal() fails
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: joelb at lanl.gov
tree.is_preterminal()
raises an AttributeError
"'Phylogeny' object has no attribute 'clades'"
File BaseTree.py line 442.
git fetch on Feb. 22.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Thu Mar 18 18:03:09 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 18 Mar 2010 22:03:09 +0000
Subject: [Biopython-dev] Google Summer of Code is *ON* for OBF projects!
In-Reply-To: <4BA29706.8040606@cornell.edu>
References: <4BA29706.8040606@cornell.edu>
Message-ID: <320fb6e01003181503j7e3030aao7bce7ebf4d8be06@mail.gmail.com>
Good news for GSoC 2010 :)
---------- Forwarded message ----------
From: Robert Buels
Date: Thu, Mar 18, 2010 at 9:11 PM
Subject: Google Summer of Code is *ON* for OBF projects!
Hi all,
Great news: Google announced today that the Open Bioinformatics
Foundation has been accepted as a mentoring organization for this
summer's Google Summer of Code!
GSoC is a Google-sponsored student internship program for open-source
projects, open to students from around the world (not just US
residents). ? Students are paid a $5000 USD stipend to work as a
developer on an open-source project for the summer. For more on GSoC,
see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo
Student applications are due April 9, 2010 at 19:00 UTC. ?Students who
are interested in participating should look at the OBF's GSoC page at
http://open-bio.org/wiki/Google_Summer_of_Code, which lists project
ideas, and who to contact about applying.
For current developers on OBF projects, please consider volunteering
to be a mentor if you have not already, and contribute project ideas.
Just list your name and project ideas on OBF wiki and on the relevant
project's GSoC wiki page.
Thanks to all who helped make OBF's application to GSoC a success, and
let's have a great, productive summer of code!
Rob Buels
OBF GSoC 2010 Administrator
From biopython at maubp.freeserve.co.uk Fri Mar 19 06:45:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 19 Mar 2010 10:45:55 +0000
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
Message-ID: <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
Hi Sebastian,
On Thu, Mar 18, 2010 at 7:39 PM, Sebastian Bassi
wrote:
> On Thu, Mar 18, 2010 at 4:01 PM, Peter wrote:
>> There was another +1 vote from Marshall Hampton, and no
>> comments against (so far). Let's leave it a few days, but unless
>> anyone speaks out in favour of the status-quo (keep the
>> current strict check in the write function), then make the change.
>
> If we are going to change this, why not setting "fasta" as default
> input/output format? This would also results in less typing when
> processing fasta files (most of the time in my workflow at least).
Give an inch and they'll take a mile ;)
I agree that FASTA is likely to be the most common file format
for most users, but I don't think we should make it the default.
One specific reason is because the FASTA parser will allow and
ignore a header comment, you will get confusing results if the
file is not actually a FASTA file (typically it will parse other
text files like GenBank, EMBL or FASTQ with no errors, but
will return no records). I am worried that people will assume
that if they don't specify the format that Biopython will
determine it automatically - which it won't.
[Yes, I'm talking about the read/parse functions here, but it
would be odd if the write function defaulted to FASTA but they
did not.]
Also, could you clarify if you are in favour of relaxing the
requirement that the write function takes a list/iterator of
records/alignments to allow a single SeqRecord or alignment?
Thanks,
Peter
From bugzilla-daemon at portal.open-bio.org Fri Mar 19 09:22:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 19 Mar 2010 09:22:51 -0400
Subject: [Biopython-dev] [Bug 3029] PhyloXML.Phylogeny.is_preterminal() fails
In-Reply-To:
Message-ID: <201003191322.o2JDMpYW015069@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3029
eric.talevich at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from eric.talevich at gmail.com 2010-03-19 09:22 EST -------
(In reply to comment #0)
> tree.is_preterminal()
>
> raises an AttributeError
> "'Phylogeny' object has no attribute 'clades'"
> File BaseTree.py line 442.
>
> git fetch on Feb. 22.
>
Thanks for catching this. It's fixed on the trunk now.
I also checked the rest of TreeMixin for other occurrences of the same problem
(accessing self.clades directly instead of going through self.root.clades) and
found none, so it shouldn't happen again.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From sbassi at clubdelarazon.org Fri Mar 19 18:08:17 2010
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Fri, 19 Mar 2010 19:08:17 -0300
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
<320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
Message-ID: <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
On Fri, Mar 19, 2010 at 7:45 AM, Peter wrote:
> Give an inch and they'll take a mile ;)
In Spanish we say: Give a hand and they'll take the whole arm :)
> that if they don't specify the format that Biopython will
> determine it automatically - which it won't.
In this respect, Python zen favours being explicit,so I see your point.
> Also, could you clarify if you are in favour of relaxing the
> requirement that the write function takes a list/iterator of
> records/alignments to allow a single SeqRecord or alignment?
Is OK for me to allow a single record instead of a iterable, this
change will not break any existing code so it is OK for me.
Best,
SB.
From bugzilla-daemon at portal.open-bio.org Sat Mar 20 02:24:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 20 Mar 2010 02:24:39 -0400
Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE
handling
In-Reply-To:
Message-ID: <201003200624.o2K6OdCd010209@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
------- Comment #1 from crosvera at gmail.com 2010-03-20 02:24 EST -------
Created an attachment (id=1463)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1463&action=view)
propose patch bug2948.patch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Mar 20 02:26:10 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 20 Mar 2010 02:26:10 -0400
Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE
handling
In-Reply-To:
Message-ID: <201003200626.o2K6QAoV010279@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
crosvera at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crosvera at gmail.com
------- Comment #2 from crosvera at gmail.com 2010-03-20 02:26 EST -------
Here I show an example about what Paul says:
bash-4.0$ python
Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34)
[GCC 4.4.1 (CRUX)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.PDB import *
>>> parser=PDBParser()
>>> structure = parser.get_structure("2beg", "2BEG.pdb")
>>> structure.header.keys()
['structure_method', 'head', 'journal', 'journal_reference', 'compound',
'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source',
'resolution', 'structure_reference']
>>> structure.header['head']
'protein fibril'
>>> structure.header['name']
" d structure of alzheimer's abeta(1-42) fibrils"
I made a patch, which change the regex.
From: tail=re.sub("\A\w+\s+\d*\s*","",h)
TO: tail=re.sub("\A\w+\s+\d*\s+","",h
Seems that this patch works. The result I got is this:
bash-4.0$ python
Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34)
[GCC 4.4.1 (CRUX)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.PDB import *
>>> parser=PDBParser()
>>> structure = parser.get_structure("2beg", "2BEG.pdb")
>>> structure.header.keys()
['structure_method', 'head', 'journal', 'journal_reference', 'compound',
'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source',
'resolution', 'structure_reference']
>>> structure.header['head']
'protein fibril'
>>> structure.header['name']
" 3d structure of alzheimer's abeta(1-42) fibrils"
>>>
I propose this patch. (my first one).
--
Carlos R??os V.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Mar 20 02:56:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 20 Mar 2010 02:56:53 -0400
Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for
oldest entry.
In-Reply-To:
Message-ID: <201003200656.o2K6urZa011050@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2949
crosvera at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crosvera at gmail.com
------- Comment #1 from crosvera at gmail.com 2010-03-20 02:56 EST -------
(In reply to comment #0)
> [...]
> elif key=="REVDAT":
> #Modified by Paul T. Bathen to get most recent date instead of
> oldest date.
> #Also added additional dict entries
> if dict['release_date'] == "1909-01-08": #set in init
> rr=re.search("\d\d-\w\w\w-\d\d",tail)
> if rr!=None:
> dict['release_date']=_format_date(_nice_case(rr.group()))
>
> dict['mod_number'] = hh[7:10].strip()
> dict['mod_id'] = hh[23:28].strip()
> dict['mod_type'] = hh[31:32].strip()
The Protein Data Bank Contents Guide (Version 3.20,
http://www.wwpdb.org/documentation/format32/sect2.html#REVDAT) says that
modNum use the colums: 8-10. modId use the colums: 24-27. And modType use the
colum 32.
So the last part of your code should change to:
dict['mod_number'] = hh[7:9]
dict['mod_id'] = hh[23:26]
dict['mod_type'] = hh[31]
Regards.
--
Carlos Rios V.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Mar 20 19:02:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 20 Mar 2010 19:02:16 -0400
Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for
oldest entry.
In-Reply-To:
Message-ID: <201003202302.o2KN2GFb006461@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2949
------- Comment #2 from crosvera at gmail.com 2010-03-20 19:02 EST -------
Currently I got this with the actual code:
bash-4.0$ python
Python 2.6.4 (r264:75706, Mar 10 2010, 15:54:34)
[GCC 4.4.1 (CRUX)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.PDB import *
>>> parser=PDBParser()
>>> structure = parser.get_structure("2beg", "../2BEG.pdb")
>>> structure.header.keys()
['structure_method', 'head', 'journal', 'journal_reference', 'compound',
'keywords', 'name', 'author', 'deposition_date', 'release_date', 'source',
'resolution', 'structure_reference']
>>> structure.header['release_date']
'2005-11-22'
>>>
but the grep command returns this:
bash-4.0$ grep REVDAT ../2BEG.pdb
REVDAT 3 24-FEB-09 2BEG 1 VERSN
REVDAT 2 20-DEC-05 2BEG 1 JRNL
REVDAT 1 22-NOV-05 2BEG 0
So, the actual code is showing the oldest date from REVDAT. I don't know if you
(the developer) are trying to say with 'release_date' if is the first version
or the last. But I think, as Paul said, that should be the most current date.
By the way, my previous comment I said that the last part of the code pasted by
Paul should be:
dict['mod_number'] = hh[7:9]
dict['mod_id'] = hh[23:26]
dict['mod_type'] = hh[31]
But it has to be:
dict['mod_number'] = hh[7:10]
dict['mod_id'] = hh[23:27]
dict['mod_type'] = hh[31]
Other thing, instead to add 'mod_number', 'mod_id', 'mod_type' directly into
dict. I think that these keys should be inside a 'release_data' key:
dict={'name':"",
[...]
'release_date' : "1909-01-08",
'release_data' : {'mod_number' : "", 'mod_id' : "", 'mod_type' :
""},
'structure_method' : "unknown",
[...]
}
Please comment :)
Regards.
--
Carlos Rios V.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From k.okonechnikov at gmail.com Sun Mar 21 00:29:30 2010
From: k.okonechnikov at gmail.com (Konstantin Okonechnikov)
Date: Sun, 21 Mar 2010 10:29:30 +0600
Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project
Message-ID: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com>
Dear BioPython developers,
my name is Konstantin. I am first-year master's student at Novosibirsk State
University, Russia.
The subject of my bachelor diploma work was development of 3D biological
macromolecular structure visualization tool for open-source bioinformatics
project called UGENE . This work was successfully
finished about a year
ago.
The task included a lot of work with PDB format: parsing, correctness
testing etc. For testing purposes even whole PDB database was downloaded and
tested for simple assertions. Such stress testing revealed a lot of problems
and helped to improve code significantly. So, one may say, I have some
experience with PDB format :)
I used BioPython when I was studying bioinformatics basics and really liked
it. I would like to contribute to the project by improving Bio.PDB module
and implementing a set of convenient tools to work with PDB files.
Best regards,
Konstantin
From tiagoantao at gmail.com Sun Mar 21 08:59:31 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sun, 21 Mar 2010 12:59:31 +0000
Subject: [Biopython-dev] Changes to the main repo
Message-ID: <6d941f121003210559o506b853ci381927fed3aa836f@mail.gmail.com>
Hi,
I've made some changes in the main repository (my first changes with
github), some comments:
1. Many thanks for the GitUsage wiki page. REALLY useful.
2. That being said, if I did any mistakes, they are my own fault.
3. I've added support for big genepop files, something I tend do be
asked quite a lot
4. And support for haploid data (nobody really asked this)
5. I remember Peter sending an email about needed corrections to the
code. I am afraid I've lost that email :( . If you send it to me, I
will do them ASAP
6. New test cases and test data files
7. I might add support, in the future, to Arlequin (file format and
application). Allowing for statistics over sequences and other goodies
with sequence data.
Regards,
Tiago
--
"If you want to get laid, go to college. If you want an education, go
to the library." - Frank Zappa
From eric.talevich at gmail.com Sun Mar 21 21:54:27 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 21 Mar 2010 21:54:27 -0400
Subject: [Biopython-dev] GSoC: Refining the PDB-Tidy project idea
Message-ID: <3f6baf361003211854g41a4d358pc7fc49c156dcbb7b@mail.gmail.com>
Hi GSoC'ers,
The PDB-Tidy idea on Biopython's Summer of Code page seems to have attracted
interest from a number of highly qualified students:
http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files
Please, don't let this deter you from applying! Google allocates student
slots to each organization based on the number of applications received, so
if OBF receives more applications, we can accept more students.
However, I'm also concerned that I've made the project description too
general. (Or is it too specific?) This article describes the characteristics
of a well-defined GSoC project idea:
http://en.flossmanuals.net/GSoCMentoring/SelectingProjects
In the interest of improving the opportunities for each student, I'm
suggesting that the proposals that are submitted under the PDB-Tidy theme
focus on a specific goal beyond the manipulation PDB files. At the risk of
being "That Guy", I'll give some examples of what I mean:
(a) Improve interoperability with external tools like AutoDock or Modeller;
(b) Port some MolProbity-like functionality to Biopython;
(c) Improve interoperability and consistency between Bio.PDB and the rest of
Biopython;
(d) Write a parser for some useful format.
Also, would anyone else be interested in co-mentoring one of these projects?
It's good for a GSoC project to have a secondary mentor -- not required, but
helpful -- and I think some support from a more experienced structural
biologist would be valuable here.
Thanks & best regards,
Eric
From bugzilla-daemon at portal.open-bio.org Sun Mar 21 22:50:24 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 21 Mar 2010 22:50:24 -0400
Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE
handling
In-Reply-To:
Message-ID: <201003220250.o2M2oOoP003409@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
------- Comment #3 from eric.talevich at gmail.com 2010-03-21 22:50 EST -------
(In reply to comment #2)
>
> I made a patch, which change the regex.
> From: tail=re.sub("\A\w+\s+\d*\s*","",h)
> TO: tail=re.sub("\A\w+\s+\d*\s+","",h
> Seems that this patch works. The result I got is this:
>
> ...
Thanks for triaging this, Carlos. However, I think it would be better if the
code is a direct reflection of the actual PDB specification:
http://www.wwpdb.org/documentation/format32/sect2.html
It looks like "continuation" numbers are ignored by this code, so only the text
starting in column 11 onward (hh[10:]) is ever used, also dropping leading
spaces. Similarly, the key found by regexp is just the first
whitespace-delimited word. Can you change your patch to use string methods
instead of regular expressions?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Sun Mar 21 23:22:32 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 21 Mar 2010 23:22:32 -0400
Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project
In-Reply-To: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com>
References: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com>
Message-ID: <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com>
On Sun, Mar 21, 2010 at 12:29 AM, Konstantin Okonechnikov <
k.okonechnikov at gmail.com> wrote:
> Dear BioPython developers,
> my name is Konstantin. I am first-year master's student at Novosibirsk
> State
> University, Russia.
> The subject of my bachelor diploma work was development of 3D biological
> macromolecular structure visualization tool for open-source bioinformatics
> project called UGENE . This work was successfully
> finished about a
> year
> ago.
> The task included a lot of work with PDB format: parsing, correctness
> testing etc. For testing purposes even whole PDB database was downloaded
> and
> tested for simple assertions. Such stress testing revealed a lot of
> problems
> and helped to improve code significantly. So, one may say, I have some
> experience with PDB format :)
> I used BioPython when I was studying bioinformatics basics and really liked
> it. I would like to contribute to the project by improving Bio.PDB module
> and implementing a set of convenient tools to work with PDB files.
> Best regards,
> Konstantin
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
Hi Konstantin,
That's really cool. You might also be interested in a project based on this
idea from another GSoC organization, NESCent:
https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Visualization_of_Protein_3D_Structure_Evolution
(It's OK to apply to more than one GSoC mentoring organization as a
student.)
I sent an e-mail earlier today describing some possible refinements to the
PDB-Tidy project; did any of those interest you?
While we're at it, here's a good place to start improving Bio.PDB before
Summer of Code begins:
http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED
Feel free to e-mail or gchat me with any questions you have.
Thanks,
Eric
From bugzilla-daemon at portal.open-bio.org Sun Mar 21 23:50:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 21 Mar 2010 23:50:47 -0400
Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for
oldest entry.
In-Reply-To:
Message-ID: <201003220350.o2M3olGt004920@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2949
------- Comment #3 from eric.talevich at gmail.com 2010-03-21 23:50 EST -------
(In reply to comment #2)
> So, the actual code is showing the oldest date from REVDAT. I don't know if you
> (the developer) are trying to say with 'release_date' if is the first version
> or the last. But I think, as Paul said, that should be the most current date.
It's probably an accidental result of repeatedly setting the same field. Surely
the most recent revision date is at least as important as the date of the first
revision, given that the initial deposition date is recorded separately.
I'm not the original developer, but I'd say it would be best to keep a list or
dictionary in a new "revisions" attribute, leaving release_date alone or
deprecating it in case someone is actually relying on the current behavior.
We should discuss this on biopython-dev before implementing it.
> Other thing, instead to add 'mod_number', 'mod_id', 'mod_type' directly into
> dict. I think that these keys should be inside a 'release_data' key:
That name could lead to some typo-related confusion... but yes, a list-of-dicts
or dict-of-dicts would be a nice way to store this info.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From k.okonechnikov at gmail.com Mon Mar 22 02:04:50 2010
From: k.okonechnikov at gmail.com (Konstantin Okonechnikov)
Date: Mon, 22 Mar 2010 12:04:50 +0600
Subject: [Biopython-dev] GSOC 2010 Tiny-PDB project
In-Reply-To: <3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com>
References: <1408c93c1003202129l64a4c674t8f14185845ce6e25@mail.gmail.com>
<3f6baf361003212022m73aeff9kdedd2a949871d5b@mail.gmail.com>
Message-ID: <884d1faa1003212304vcdc86d6t4a6931adce8214fc@mail.gmail.com>
Hi Eric!
Hi Konstantin,
>
> That's really cool. You might also be interested in a project based on this
> idea from another GSoC organization, NESCent:
>
> https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Visualization_of_Protein_3D_Structure_Evolution
>
> This project looks really nice, though it requires some proficiency in
Java.
Actually I don't like the idea of applying to many organizations, I would
better choose one project and concentrate my efforts.
> (It's OK to apply to more than one GSoC mentoring organization as a
> student.)
>
> I sent an e-mail earlier today describing some possible refinements to the
> PDB-Tidy project; did any of those interest you?
>
> I need some time to investigate them. There is one question so far: what
"useful formats" do you have in mind? AFAI, there are not so many data
formats for storing 3d structures. I know about PDB XML and NCBI data
format. The last one is ASN.1 variation, it is used for diffrent kinds of
data (sequences etc.).
> While we're at it, here's a good place to start improving Bio.PDB before
> Summer of Code begins:
>
>
> http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED
>
> OK, I will look at it. Fixing a couple of bugs is good way to get aquainted
with the code :)
>
>
>
Feel free to e-mail or gchat me with any questions you have.
>
> Thanks,
> Eric
>
p.s. Sorry for misprint in letter subject, I hope that the project won't be
that small :)
--
Best regards,
Okonechnikov Konstantin
From p.j.a.cock at googlemail.com Mon Mar 22 05:19:38 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 22 Mar 2010 09:19:38 +0000
Subject: [Biopython-dev] pylint, was: Changes to the main repo
Message-ID: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com>
2010/3/21 Tiago Ant?o :
> Hi,
>
> I've made some changes in the main repository (my first changes with
> github), some comments:
> 1. Many thanks for the GitUsage wiki page. REALLY useful.
> 2. That being said, if I did any mistakes, they are my own fault.
> 3. I've added support for big genepop files, something I tend do be
> asked quite a lot
> 4. And support for haploid data (nobody really asked this)
> 5. I remember Peter sending an email about needed corrections to the
> code. I am afraid I've lost that email :( . If you send it to me, I
> will do them ASAP
> 6. New test cases and test data files
> 7. I might add support, in the future, to Arlequin (file format and
> application). Allowing for statistics over sequences and other goodies
> with sequence data.
>
> Regards,
> Tiago
Hi Tiago,
That sounds good. Regarding point 5, running pylint over the
code reported some possible errors in Bio.PopGen. Have a
look at this - they are all undefined variable issues:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html
I just ran it again on the latest code, and the line numbers have
changed a tiny bit but that is all:
$ pylint --disable-msg-cat=CRW --include-ids=y
--disable-msg=E1101,E1103,E0102 -r n Bio.PopGen
No config file found, using default configuration
************* Module Bio.PopGen.Async
E0602: 78:Async.get_result: Undefined variable 'done'
E0602: 79:Async.get_result: Undefined variable 'done'
************* Module Bio.PopGen.GenePop
E0602:166:Record.split_in_pops: Undefined variable 'GenePop'
E0602:183:Record.split_in_loci: Undefined variable 'GenePop'
************* Module Bio.PopGen.GenePop.Controller
E0602: 41:_read_allele_freq_table: Undefined variable 'self'
E0602:133:_hw_func: Undefined variable 'self'
E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext'
E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable
'currrent_pop'
************* Module Bio.PopGen.GenePop.FileParser
E1120:219:FileRecord.remove_locus_by_name: No value passed for
parameter 'fw' in function call
************* Module Bio.PopGen.SimCoal.Cache
E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config'
E0602: 88: Undefined variable 'Cache'
************* Module Bio.PopGen.SimCoal.Controller
E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config'
Peter
From krother at rubor.de Mon Mar 22 11:27:30 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 22 Mar 2010 16:27:30 +0100
Subject: [Biopython-dev] RNA secondary structure parsing
Message-ID:
Hi,
Took me a while to do some basic clean up in my code - finally managed to
contribute something.
I just added a branch 'rna' with basic RNA 2D format parsers (Vienna, CT,
BPSEQ), and a module that can extract 2D structure elements (helices,
loops, bulges, junctions).
http://github.com/krother/biopython/tree/rna
Its all in:
Bio.RNA
Tests.test_RNA_*
Any kind of feedback is welcome.
Best Regards,
Kristian
From biopython at maubp.freeserve.co.uk Mon Mar 22 12:08:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 22 Mar 2010 16:08:27 +0000
Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML)
In-Reply-To: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com>
References: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com>
Message-ID: <320fb6e01003220908s264401e6s3dab9aa7f2a3f87b@mail.gmail.com>
On Fri, Mar 12, 2010 at 1:22 PM, Peter wrote:
> Hi all,
>
> Back in November I set up a simple pair of cron jobs to update the code
> snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html
>
> I've just added another job which takes the latest Tutorial.tex file and
> compiles it with pdflatex (already installed) and hevea (installed from
> source under my user account) to make the PDF and HTML files.
> These are then copied to the webserver and published as:
>
> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html
> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf
>
> These are currently updated once a day (at 2:40am which shouldn't
> be too busy whichever USA timezone the server uses). Assuming
> I got my crontab settings right - in the short term I'll keep an eye on
> it to check ;)
It looks like the PDF is working (which happens first in the script),
but not the HTML. I'll look into this...
Peter
From biopython at maubp.freeserve.co.uk Mon Mar 22 12:21:16 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 22 Mar 2010 16:21:16 +0000
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
Message-ID: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
Hi Eric,
I've got a real example of a simple tree manipulation that I would like to
handle via your new module. I have a (small) unrooted tree from a gene
family in Newick format, which by construction includes an out-group
(the same gene but from a more distant organism). I would like to reroot
the tree so that this out-group is at the basal level.
Can Bio.Phylo help me here?
Thanks,
Peter
P.S. Why is Bio.Phylo.trim_str a public method?
From bugzilla-daemon at portal.open-bio.org Mon Mar 22 12:28:24 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 22 Mar 2010 12:28:24 -0400
Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for
oldest entry.
In-Reply-To:
Message-ID: <201003221628.o2MGSOqs027450@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2949
krother at rubor.de changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |krother at rubor.de
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Mar 22 12:39:01 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 22 Mar 2010 12:39:01 -0400
Subject: [Biopython-dev] [Bug 2949] _parse_pdb_header_list: REVDAT is for
oldest entry.
In-Reply-To:
Message-ID: <201003221639.o2MGd1mk027807@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2949
------- Comment #4 from krother at rubor.de 2010-03-22 12:39 EST -------
I originally contributed the parse_pdb_header module a long time ago. I think
one or two persons added some changes in the meantime.
I like Erics idea of adding a separate 'revisions' attribute. When the code
does what is needed I think it's time for me to do some cleanup work.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Mar 22 13:24:32 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 22 Mar 2010 13:24:32 -0400
Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE
handling
In-Reply-To:
Message-ID: <201003221724.o2MHOWY7029072@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
crosvera at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1463 is|0 |1
obsolete| |
------- Comment #4 from crosvera at gmail.com 2010-03-22 13:24 EST -------
Created an attachment (id=1464)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1464&action=view)
new proposed patch for bug2948
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Mar 22 13:25:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 22 Mar 2010 13:25:14 -0400
Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE
handling
In-Reply-To:
Message-ID: <201003221725.o2MHPEbj029122@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
------- Comment #5 from crosvera at gmail.com 2010-03-22 13:25 EST -------
ok, I made other patch, this one replace some regex for string-slice methods.
what I got:
crosvera at cabernet:~/programming/biopython/Bio$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PDB import *
>>> parser = PDBParser()
>>> structure = parser.get_structure("2beg", "PDB/2BEG.pdb")
>>> structure.header['name']
" 3d structure of alzheimer's abeta(1-42) fibrils"
>>>
patch file: 0001-modified-parse_pdb_header.py.patch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Mar 22 14:17:34 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 22 Mar 2010 14:17:34 -0400
Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE
handling
In-Reply-To:
Message-ID: <201003221817.o2MIHYpm030968@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
crosvera at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1464 is|0 |1
obsolete| |
------- Comment #6 from crosvera at gmail.com 2010-03-22 14:17 EST -------
Created an attachment (id=1465)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1465&action=view)
new proposed patch for bug2948
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Mon Mar 22 16:28:21 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 22 Mar 2010 16:28:21 -0400
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
Message-ID: <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
On Mon, Mar 22, 2010 at 12:21 PM, Peter wrote:
> Hi Eric,
>
> I've got a real example of a simple tree manipulation that I would like to
> handle via your new module. I have a (small) unrooted tree from a gene
> family in Newick format, which by construction includes an out-group
> (the same gene but from a more distant organism). I would like to reroot
> the tree so that this out-group is at the basal level.
>
> Can Bio.Phylo help me here?
>
In Bio.Nexus, would you normally have handled this with the method
root_with_outgroup? I intend to port that method to Bio.Phylo once I
understand it, but the existing code has been kind of hard for me to figure
out.
Let's address it here, then. Is there a detailed plain-text description
somewhere of how this operation should work in general?
Given that the outgroup taxon is already somewhere inside the existing
unrooted tree, I would guess something like:
0. Load the tree:
tree = Phylo.read('example.nwk', 'newick')
1. Locate the outgroup in the tree, remembering the lineage for future
operations:
outgroup_path = tree.get_path({'name': 'OUTGROUP'}) # or however you can
identify it
2. Tracing the outgroup lineage backwards, reattach the subclades to new
locations under a new root (or the old root, repurposed). Picturing the
unrooted tree as an arbitrarily rooted tree, invert everything above the
outgroup in the tree, but keep the descendants of those clades as they are:
# Untested, hardly even thought through, danger danger!
root = tree.root
old_clades = root.clades # needed?
root.clades = []
new_parent = root
last = outgroup_path[-1]
for parent in outgroup_path[-2::-1]:
siblings = [kid for kid in parent.clades if kid != last]
new_parent.clades = # TODO
new_parent = last
last = parent
tree.rooted = True
Bio.Phylo does no internal bookkeeping, so it's OK (i.e. sometimes required)
to shuffle clades directly.
Is this what "root with outgroup" is supposed to do? What functionality in
Bio.Nexus.Trees.root_with_outgroup is missing here? And, do you happen to
have an example of a tree with edge cases that I could use for testing?
P.S. Why is Bio.Phylo.trim_str a public method?
>
Oops, I'll fix it.
Thanks,
Eric
From biopython at maubp.freeserve.co.uk Mon Mar 22 17:48:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 22 Mar 2010 21:48:31 +0000
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
<3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
Message-ID: <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com>
On Mon, Mar 22, 2010 at 8:28 PM, Eric Talevich wrote:
> On Mon, Mar 22, 2010 at 12:21 PM, Peter wrote:
>
>> Hi Eric,
>>
>> I've got a real example of a simple tree manipulation that I would like to
>> handle via your new module. I have a (small) unrooted tree from a gene
>> family in Newick format, which by construction includes an out-group
>> (the same gene but from a more distant organism). I would like to reroot
>> the tree so that this out-group is at the basal level.
>>
>> Can Bio.Phylo help me here?
>>
>
> In Bio.Nexus, would you normally have handled this with the method
> root_with_outgroup? I intend to port that method to Bio.Phylo once I
> understand it, but the existing code has been kind of hard for me to figure
> out.
>
> Let's address it here, then. Is there a detailed plain-text description
> somewhere of how this operation should work in general?
I've just got a quick answer for you now tonight: I've not used Bio.Nexus
to try and do this - I'll try to get back to you in more depth tomorrow.
Thanks,
Peter
From p.j.a.cock at googlemail.com Tue Mar 23 07:50:24 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 23 Mar 2010 11:50:24 +0000
Subject: [Biopython-dev] pylint, was: Changes to the main repo
In-Reply-To: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com>
References: <320fb6e01003220219u5f2020e1v6826a4e331ceb96d@mail.gmail.com>
Message-ID: <320fb6e01003230450h502adce0p27080d3a00ddda23@mail.gmail.com>
2010/3/22 Peter Cock :
> 2010/3/21 Tiago Ant?o :
>> Hi,
>>
>> I've made some changes in the main repository (my first changes with
>> github), some comments:
>> 1. Many thanks for the GitUsage wiki page. REALLY useful.
>> 2. That being said, if I did any mistakes, they are my own fault.
>> 3. I've added support for big genepop files, something I tend do be
>> asked quite a lot
>> 4. And support for haploid data (nobody really asked this)
>> 5. I remember Peter sending an email about needed corrections to the
>> code. I am afraid I've lost that email :( . If you send it to me, I
>> will do them ASAP
>> 6. New test cases and test data files
>> 7. I might add support, in the future, to Arlequin (file format and
>> application). Allowing for statistics over sequences and other goodies
>> with sequence data.
>>
>> Regards,
>> Tiago
>
> Hi Tiago,
>
> That sounds good. Regarding point 5, running pylint over the
> code reported some possible errors in Bio.PopGen. Have a
> look at this - they are all undefined variable issues:
> http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html
>
> I just ran it again on the latest code, and the line numbers have
> changed a tiny bit but that is all:
>
> $ pylint --disable-msg-cat=CRW --include-ids=y
> --disable-msg=E1101,E1103,E0102 -r n Bio.PopGen
> No config file found, using default configuration
> ************* Module Bio.PopGen.Async
> E0602: 78:Async.get_result: Undefined variable 'done'
> E0602: 79:Async.get_result: Undefined variable 'done'
> ************* Module Bio.PopGen.GenePop
> E0602:166:Record.split_in_pops: Undefined variable 'GenePop'
> E0602:183:Record.split_in_loci: Undefined variable 'GenePop'
> ************* Module Bio.PopGen.GenePop.Controller
> E0602: 41:_read_allele_freq_table: Undefined variable 'self'
> E0602:133:_hw_func: Undefined variable 'self'
> E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext'
> E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable
> 'currrent_pop'
> ************* Module Bio.PopGen.GenePop.FileParser
> E1120:219:FileRecord.remove_locus_by_name: No value passed for
> parameter 'fw' in function call
> ************* Module Bio.PopGen.SimCoal.Cache
> E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config'
> E0602: 88: Undefined variable 'Cache'
> ************* Module Bio.PopGen.SimCoal.Controller
> E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config'
>
> Peter
>
Hi Taigo,
This is looking much better after your fixes last night - just one left:
$ pylint --disable-msg-cat=CRW --include-ids=y
--disable-msg=E1101,E1103,E0102 -r n Bio.PopGen
No config file found, using default configuration
************* Module Bio.PopGen.GenePop.Controller
E0602: 41:_read_allele_freq_table: Undefined variable 'self'
Note if I turn off those particular error messages which in other
situations I had tentatively tagged as false positives, there could
be a few more issues:
$ pylint --disable-msg-cat=CRW --include-ids=y -r n Bio.PopGenNo
config file found, using default configuration
************* Module Bio.PopGen.Async
E1101: 59:Async.run_program: Instance of 'Async' has no '_run_program' member
************* Module Bio.PopGen.GenePop.Controller
E0602: 41:_read_allele_freq_table: Undefined variable 'self'
************* Module Bio.PopGen.GenePop.EasyController
E1101: 33:EasyController.get_basic_info: Module 'Bio.PopGen.GenePop'
has no 'parse' member
E1101: 43:EasyController.test_hw_pop: Instance of 'GenePopController'
has no 'test_pop_hz_prob' member
************* Module Bio.PopGen.GenePop.FileParser
E1101:197:FileRecord.remove_population: Instance of 'FileRecord' has
no 'populations' member
E1101:206:FileRecord.remove_locus_by_position: Instance of
'FileRecord' has no 'populations' member
Some of these may be harmless, for example the Async class has a
run_program method which calls _run_program, which you expect to
be implemented in any subclass. You could add a dummy method to
show the expected arguments and just raise a NotImplementedError
exception with a comment that the subclass should implement it. e.g.
def _run_program(self, program, parameters, input_files):
"""Actually run the program, handled by a subclass (PRIVATE).
This method should be replaced by any derived class to do
something useful. It will be called by the run_program method.
"""
raise NotImplementedError("This object should be subclassed")
That particular change is probably worth doing anyway from a code
clarity point of view.
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 23 11:26:56 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 23 Mar 2010 15:26:56 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com>
References:
<200911251220.53881.jblanca@btc.upv.es>
<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
<320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
<320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
<320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com>
Message-ID: <320fb6e01003230826r6080746el3327f05079f2651a@mail.gmail.com>
On Thu, Mar 18, 2010 at 7:04 PM, Peter wrote:
>
> I've done that to Bio/Seq.py on the trunk (added two
> FutureWarnings and docstring explanation). Assuming
> this doesn't trigger any regressions, we'd need to work
> on the documentation (in particular the tutorial, but also
> perhaps a news post?) and fix the GA unit test before
> the release.
>
I've fixed the GA unit tests, generally by explicit use
of string comparison when working with sequence
objects.
In the case of test_GAQueens.py, this required me
to "correct" the "abuse" of the alphabet object (letters
was a list of integers, not a string) and thus indirectly
the way the MutableSeq was being created. This has
always struck me as a very odd example - but should
perhaps be kept in mind for more complex sequence
like objects (e.g. sequences with 3-letter protein codes).
Peter
From tiagoantao at gmail.com Wed Mar 24 06:39:09 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 24 Mar 2010 10:39:09 +0000
Subject: [Biopython-dev] Spam on wiki
Message-ID: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com>
Hi,
I think we are being attacked, spam wise. The popgen_dev page was full
with external links.
I am clearing that page, but others might have the same problem. Maybe
there is some what to automate the deletion of contributions from spam
authors?
Tiago
--
"If you want to get laid, go to college. If you want an education, go
to the library." - Frank Zappa
From tiagoantao at gmail.com Wed Mar 24 06:47:43 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 24 Mar 2010 10:47:43 +0000
Subject: [Biopython-dev] Spam on wiki
In-Reply-To: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com>
References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com>
Message-ID: <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com>
2010/3/24 Tiago Ant?o :
> I think we are being attacked, spam wise. The popgen_dev page was full
> with external links.
> I am clearing that page, but others might have the same problem. Maybe
> there is some what to automate the deletion of contributions from spam
> authors?
I am clearing this
http://www.biopython.org/wiki/Special:Contributions/Wiki0808
by hand, not much.
From biopython at maubp.freeserve.co.uk Wed Mar 24 06:47:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Mar 2010 10:47:43 +0000
Subject: [Biopython-dev] Spam on wiki
In-Reply-To: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com>
References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com>
Message-ID: <320fb6e01003240347x5d10a1f3sa4c2c84fa9edcfbe@mail.gmail.com>
2010/3/24 Tiago Ant?o :
> Hi,
>
> I think we are being attacked, spam wise. The popgen_dev page was full
> with external links.
> I am clearing that page, but others might have the same problem. Maybe
> there is some what to automate the deletion of contributions from spam
> authors?
>
> Tiago
Hi,
I'm subscribed to the wiki RSS feed, but this happened overnight
so I hadn't seen it yet. This seems to happen about once a month
or so - I haven't noticed a big rise in attacks or anything. This guy
did about ten pages - normally only one or two get abused.
Dealing with it is fairly easy - you click on the page history, and
rollback to the last good page, and ban the user. Tell me your wiki
username and I should be able to give you the rights needed to
ban people.
Peter
From biopython at maubp.freeserve.co.uk Wed Mar 24 07:13:12 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Mar 2010 11:13:12 +0000
Subject: [Biopython-dev] Spam on wiki
In-Reply-To: <6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com>
References: <6d941f121003240339h19e822ebp5c41451dd2c4a07a@mail.gmail.com>
<6d941f121003240347v1ff3f9d4uf46d6a6e4f4254ff@mail.gmail.com>
Message-ID: <320fb6e01003240413g27f3d87dp4762bd9f8c32befe@mail.gmail.com>
2010/3/24 Tiago Ant?o :
> 2010/3/24 Tiago Ant?o :
>> I think we are being attacked, spam wise. The popgen_dev page was full
>> with external links.
>> I am clearing that page, but others might have the same problem. Maybe
>> there is some what to automate the deletion of contributions from spam
>> authors?
>
> I am clearing this
> http://www.biopython.org/wiki/Special:Contributions/Wiki0808
> by hand, not much.
>
The "rollback" link on the page history is the simplest route
(you are now an administrator on the wiki so should be able
to do this, and ban spammers). I don't know if there is a
shortcut to revert all a user's recent changes.
I think between us we have fixed all the pages now.
Thanks,
Peter
From peter at maubp.freeserve.co.uk Wed Mar 24 10:08:26 2010
From: peter at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Mar 2010 14:08:26 +0000
Subject: [Biopython-dev] Fwd: [Utilities-announce] NCBI Revised E-utility
Usage Policy
In-Reply-To:
References:
Message-ID: <320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com>
Hi,
This is probably of interest to all the Bio* projects offering access
to the NCBI
Entrez utilities. See forwarded message below.
I *think* the new guidelines basically say that the email & tool parameters are
optional BUT if your IP address ever gets banned for excessive use you then
have to register an email & tool combination.
Regarding the email address, the NCBI say to use the email of the developer
(not the end user). However, they do not distinguish between the developers
of a library (like us), and the developers of an application or script using a
library (who may also be the end user).
Currently we (Biopython) and I think BioPerl ask developers using our libraries
to populate the email address themselves. I *think* this is still the
right action.
Peter
---------- Forwarded message ----------
From:
Date: Wed, Mar 24, 2010 at 1:53 PM
Subject: [Utilities-announce] NCBI Revised E-utility Usage Policy
To: NLM/NCBI List utilities-announce
New E-utility documentation now on the NCBI Bookshelf
The Entrez Programming Utilities (E-Utilities) Help documentation has
been added to the NCBI Bookshelf, and so?is now fully integrated with
the Entrez search and retrieval system as a part of the Bookshelf
database. This help document has been divided into chapters for better
organization and includes several new sample Perl scripts. At present
this book covers the standard URL interface for the E-utilties;
material about the SOAP interface will be added soon and is still
available at the same URL:
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html.
Revised E-utility usage policy
In December, 2009 NCBI announced a change to the usage policy for the
E-utilities that would require all requests to contain non-null values
for both the?&email and &tool parameters. After several consultations
with our users and developers, we have decided to revise this policy
change, and the revised?policy is described in detail at the following
link:
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen
Please let us know if you have any questions or concerns about this
policy change.
Thank you,
The E-Utilities Team
NIH/NLM/NCBI
eutilities at ncbi.nlm.nih.gov.
_______________________________________________
Utilities-announce mailing list
http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce
-------------- next part --------------
_______________________________________________
Utilities-announce mailing list
http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce
From biopython at maubp.freeserve.co.uk Wed Mar 24 10:51:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Mar 2010 14:51:46 +0000
Subject: [Biopython-dev] [Bioperl-l] Fwd: [Utilities-announce] NCBI
Revised E-utility Usage Policy
In-Reply-To: <38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu>
References:
<320fb6e01003240708o48eeb30eq3b09110dcc2d1873@mail.gmail.com>
<38D43B03-4A85-48CB-913A-CD564EB5168C@illinois.edu>
Message-ID: <320fb6e01003240751v2afd5d5bwa39590afa9b13209@mail.gmail.com>
On Wed, Mar 24, 2010 at 2:37 PM, Chris Fields wrote:
>
> On Mar 24, 2010, at 9:08 AM, Peter wrote:
>
>> Hi,
>>
>> This is probably of interest to all the Bio* projects offering access
>> to the NCBI Entrez utilities. See forwarded message below.
>>
>> I *think* the new guidelines basically say that the email & tool parameters are
>> optional BUT if your IP address ever gets banned for excessive use you then
>> have to register an email & tool combination.
>>
>> Regarding the email address, the NCBI say to use the email of the developer
>> (not the end user). However, they do not distinguish between the developers
>> of a library (like us), and the developers of an application or script using a
>> library (who may also be the end user).
>>
>> Currently we (Biopython) and I think BioPerl ask developers using our libraries
>> to populate the email address themselves. I *think* this is still the
>> right action.
>>
>> Peter
>
>
> Basically, that's the same tactic I'm going with with Bio::DB::EUtilities (and I
> think with the SOAP-based ones as well). ?We're providing a specific set of
> tools for user to write up their own applications end applications. ?I can try
> contacting them regarding this to get an official response to clarify this
> somewhat.
Please give the NCBI an email - you can CC me too if you like.
> Re: the tool parameter, we currently set the tool itself to 'BioPerl' as a
> default, but always leave the email blank and issue a warning if it isn't
> set. ?We could just as easily leave both blank and issue warnings for both.
We currently leave out the email and set the tool parameter to "Biopython"
by default but this can be overridden. Currently leaving out the email does
cause Biopython to give a warning.
Peter
From biopython at maubp.freeserve.co.uk Wed Mar 24 11:16:51 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Mar 2010 15:16:51 +0000
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com>
References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
<3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
<320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com>
Message-ID: <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com>
On Mon, Mar 22, 2010 at 9:48 PM, Peter wrote:
>> In Bio.Nexus, would you normally have handled this with the method
>> root_with_outgroup? I intend to port that method to Bio.Phylo once I
>> understand it, but the existing code has been kind of hard for me to figure
>> out.
>
> I've just got a quick answer for you now tonight: I've not used Bio.Nexus
> to try and do this - I'll try to get back to you in more depth tomorrow.
Here is an example using Bio.Nexus.Trees to reroot with an outgroup.
#I have encoded the tree here as a string:
tree_string = """((gi|6273291|gb|AF191665.1|AF191:0.00418,
(gi|6273290|gb|AF191664.1|AF191:0.00189,
gi|6273289|gb|AF191663.1|AF191:0.00145)
:0.00083):0.00770,
(gi|6273287|gb|AF191661.1|AF191:0.00489,
gi|6273286|gb|AF191660.1|AF191:0.00295)
:0.00014, (gi|6273285|gb|AF191659.1|AF191:0.00094,
gi|6273284|gb|AF191658.1|AF191:0.00018)
:0.00125);"""
from Bio.Nexus import Tree
tree = Trees.Tree(tree_string)
print "Old"
print tree
print tree.display()
print
print "New"
#This acts in situ:
tree.root_with_outgroup(["gi|6273289|gb|AF191663.1|AF191"])
print tree
print tree.display()
Old
tree a_tree = ((gi|6273291|gb|AF191665.1|AF191,(gi|6273290|gb|AF191664.1|AF191,gi|6273289|gb|AF191663.1|AF191)),(gi|6273287|gb|AF191661.1|AF191,gi|6273286|gb|AF191660.1|AF191),(gi|6273285|gb|AF191659.1|AF191,gi|6273284|gb|AF191658.1|AF191));
# taxon prev succ
brlen blen (sum) support comment
0 - None [1, 6, 9]
0.0 0.0 - -
1 - 0 [2, 3]
0.0077 0.0077 - -
2 gi|6273291|gb|AF191665.1|AF191 1 []
0.00418 0.01188 - -
3 - 1 [4, 5]
0.00083 0.00853 - -
4 gi|6273290|gb|AF191664.1|AF191 3 []
0.00189 0.01042 - -
5 gi|6273289|gb|AF191663.1|AF191 3 []
0.00145 0.00998 - -
6 - 0 [7, 8]
0.00014 0.00014 - -
7 gi|6273287|gb|AF191661.1|AF191 6 []
0.00489 0.00503 - -
8 gi|6273286|gb|AF191660.1|AF191 6 []
0.00295 0.00309 - -
9 - 0 [10, 11]
0.00125 0.00125 - -
10 gi|6273285|gb|AF191659.1|AF191 9 []
0.00094 0.00219 - -
11 gi|6273284|gb|AF191658.1|AF191 9 []
0.00018 0.00143 - -
Root: 0
None
New
tree a_tree = (((((gi|6273287|gb|AF191661.1|AF191,gi|6273286|gb|AF191660.1|AF191),(gi|6273285|gb|AF191659.1|AF191,gi|6273284|gb|AF191658.1|AF191)),gi|6273291|gb|AF191665.1|AF191),gi|6273290|gb|AF191664.1|AF191),gi|6273289|gb|AF191663.1|AF191);
# taxon prev succ
brlen blen (sum) support comment
0 - 1 [6, 9]
0.0077 0.00998 - -
1 - 3 [0, 2]
0.00083 0.00228 - -
2 gi|6273291|gb|AF191665.1|AF191 1 []
0.00418 0.00646 - -
3 - 12 [1, 4]
0.00145 0.00145 - -
4 gi|6273290|gb|AF191664.1|AF191 3 []
0.00189 0.00334 - -
5 gi|6273289|gb|AF191663.1|AF191 12 []
0.0 0.0 0.0 -
6 - 0 [7, 8]
0.00014 0.01012 - -
7 gi|6273287|gb|AF191661.1|AF191 6 []
0.00489 0.01501 - -
8 gi|6273286|gb|AF191660.1|AF191 6 []
0.00295 0.01307 - -
9 - 0 [10, 11]
0.00125 0.01123 - -
10 gi|6273285|gb|AF191659.1|AF191 9 []
0.00094 0.01217 - -
11 gi|6273284|gb|AF191658.1|AF191 9 []
0.00018 0.01141 - -
12 - None [3, 5]
0.0 0.0 - -
Root: 12
None
Here the root_with_outgroup method acts in situ, and returns the new
root ID number (not applicable to Bio.Phylo). The outgroup argument
seems to be a list of taxon names (here just one).
In my example, the outgroup originally has a branch length of 0.00145.
A new root node was created (here #12) with two children, one with a
branch length of zero (#5, the outgroup) and one with the full length
(#3, branch length 0.00145). Essentially this new root node (#12) and
the outgroup (#5) are now both right at the base of the tree.
There is more than one what to do this though. For example FigTree
seems to introduce a new root node half way along the outgroup branch
(replacing the edge with two edges of half its length). This way the
new root node represents the last common ancestor of the outgroup and
the ingroup (everything else), although putting it at the mid point is
perhaps a little arbitrary.
Peter
From nicolas.rapin at bric.ku.dk Thu Mar 25 07:58:53 2010
From: nicolas.rapin at bric.ku.dk (Nicolas Rapin)
Date: Thu, 25 Mar 2010 12:58:53 +0100
Subject: [Biopython-dev] GEO database and bio-python
Message-ID:
Dear all,
I just started python, and use biopython quite a lot lately. It's a nice package, and is very convenient. Oh, and I m also new on the mailing list...
I need to get access to a lot of data from GEO, and i noticed that it might be a good idea to have the database locally, which lead me to write a little class that can download the compressed files form ncbi (the GSE/GPLxxx_family.tgz files) , and parse the MINimL sort of xml they have in there together with the actual data that is in the compressed files. In the end i have a nicely organized hdf5 file, that i can use to do data mining.
I wondered if that was for Biopython.
If yes, how do I contribute ?
best,
Nico
From biopython at maubp.freeserve.co.uk Thu Mar 25 08:22:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 25 Mar 2010 12:22:40 +0000
Subject: [Biopython-dev] GEO database and bio-python
In-Reply-To:
References:
Message-ID: <320fb6e01003250522l3c730081y143cc4799f038754@mail.gmail.com>
On Thu, Mar 25, 2010 at 11:58 AM, Nicolas Rapin
wrote:
> Dear all,
>
> I just started python, and use biopython quite a lot lately. It's a nice package,
> and is very convenient. Oh, and I m also new on the mailing list...
Great, and welcome :)
> I ?need to get access to a lot of data from GEO, and i noticed that it might be
> a good idea to have the database locally, which lead me to write ?a little class
> that can download ?the compressed files ?form ncbi (the GSE/GPLxxx_family.tgz
> files) , and parse the MINimL sort of xml they have in there together with the
> actual data that is in the compressed files. In the end i have a nicely organized
> hdf5 file, that i can use to do data mining.
Have you looked at the existing Bio.GEO module? It hasn't got an active
maintainer at the moment, as in some ways is rather simplistic. I found that
Sean Davis' GEOquery package for R/Bioconductor was much more
complete.
> I wondered if that was for Biopython.
This sounds like a useful addition.
> If yes, how do I contribute ?
First of all we use the public mailing lists to discuss things. In
terms of code,
starting a branch on github would let you show us what you are working on
and makes it easier to eventually merge things. See
http://biopython.org/wiki/GitUsage
Peter
From sdavis2 at mail.nih.gov Thu Mar 25 08:29:52 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 25 Mar 2010 08:29:52 -0400
Subject: [Biopython-dev] GEO database and bio-python
In-Reply-To:
References:
Message-ID: <264855a01003250529p7d2290f3qc441228c34f5e720@mail.gmail.com>
On Thu, Mar 25, 2010 at 7:58 AM, Nicolas Rapin wrote:
> Dear all,
>
> I just started python, and use biopython quite a lot lately. It's a nice package, and is very convenient. Oh, and I m also new on the mailing list...
>
> I ?need to get access to a lot of data from GEO, and i noticed that it might be a good idea to have the database locally, which lead me to write ?a little class that can download ?the compressed files ?form ncbi (the GSE/GPLxxx_family.tgz files) , and parse the MINimL sort of xml they have in there together with the actual data that is in the compressed files. In the end i have a nicely organized hdf5 file, that i can use to do data mining.
>
> I wondered if that was for Biopython.
Hi, Nico.
Not a direct answer to your question, but have a look at the
Bioconductor package GEOmetadb. (There is also an online version.)
We have parsed all of GEO metadata into a SQLite database and made it
available within R. However, the SQLite database can be used
standalone and python has built in support for SQLite, as of late.
http://gbnci.abcc.ncifcrf.gov/geo/
http://gbnci.abcc.ncifcrf.gov/geo/GEOmetadb.sqlite.gz
http://watson.nci.nih.gov/bioc_mirror/packages/2.6/bioc/html/GEOmetadb.html
Also, as for the data, if you are inclined to use R for anything (or
rpy2), the GEOquery package can download and parse all the record
types in GEO into objects within R and the number of tools for data
analysis of microarray data in R/Bioconductor is enormous.
http://watson.nci.nih.gov/bioc_mirror/packages/2.6/bioc/html/GEOquery.html
Sorry for the advertisement-like email....
Sean
> If yes, how do I contribute ?
>
>
> best,
>
> Nico
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From biopython at maubp.freeserve.co.uk Thu Mar 25 12:25:01 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 25 Mar 2010 16:25:01 +0000
Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez?
Message-ID: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com>
Hi all,
The NCBI recently announced revised guidlines for the Entrez
utilities, which we've started discussing on the OBF mailing list:
http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007499.html
http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000623.html
As part of this I decided to look at the peak hour rules:
http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000644.html
The old guideline was:
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements
"Run retrieval scripts on weekends or between 9 pm and 5 am
Eastern Time weekdays for any series of more than 100 requests."
This doesn't define a series - for example, would it be OK to run
a script making 75 requests every two hours? This could be regarded
as multiple separate series each under 100 requests, but the
cumulative count over the 8 peak hours is 600 requests.
Sadly the new guidelines are even more vague:
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter2#chapter2.Usage_Guidelines_and_Requiremen
"... and limit large jobs to either weekends or between 9:00 PM
and 5:00 AM Eastern time during weekdays."
Not very helpful.
Also neither version raises the issue of summer/winter time
(daylight savings times) but simply gives Eastern Time (EST).
While we may get clarification from the NCBI, the following patch
to Bio.Entrez may be worth considering. It simply counts the
number of Entrez requestes during peak hours, and issues a
warning if this exceeds 100 (based on a strict interpretation of
the older guidelines).
Does this seem worth checking in, or should we try to get some
clarification from the NCBI first?
Peter
diff --git a/Bio/Entrez/__init__.py b/Bio/Entrez/__init__.py
index 33d8d14..f670354 100644
--- a/Bio/Entrez/__init__.py
+++ b/Bio/Entrez/__init__.py
@@ -285,6 +285,26 @@ def _open(cgi, params={}, post=False):
? ? ? ? _open.previous = current + wait
? ? else:
? ? ? ? _open.previous = current
+
+ ? ?# Max 100 requests from 09:00 to 17:00 Eastern Time (EST), which is
+ ? ?# 5 hours behind Coordinated Universal Time (UTC) aka Greenwich
+ ? ?# Mean Time (GMT), thus 14:00 to 22:00 UTC/GMT. The NCBI don't
+ ? ?# mention summer/winter time (daylight saving time), so ignore that.
+ ? ?if 14 <= time.gmtime(current).tm_hour < 22 \
+ ? ?and time.gmtime(current).tm_wday <= 5:
+ ? ? ? ?# Peak time (Monday = 0, Friday = 5)
+ ? ? ? ?_open.peak_requests += 1
+ ? ? ? ?if _open.peak_requests > 100:
+ ? ? ? ? ? ?import warnings
+ ? ? ? ? ? ?warnings.warn("The NCBI request you make at most 100 Entrez "
+ ? ? ? ? ? ? ? ? ? ? ? ? ?"requests during the peak time 9AM to 5PM EST "
+ ? ? ? ? ? ? ? ? ? ? ? ? ?"(which is 14:00 to 22:00 UTC/GMT). "
+ ? ? ? ? ? ? ? ? ? ? ? ? ?"You have exceeded this limit.")
+ ? ?else:
+ ? ? ? ?# Off peak
+ ? ? ? ?# Reset the counter (in case this is a long running script)
+ ? ? ? ?_open.peak_requests = 0
+
? ? # Remove None values from the parameters
? ? for key, value in params.items():
? ? ? ? if value is None:
@@ -368,3 +388,4 @@ E-utilities.""", UserWarning)
? ? return uhandle
_open.previous = 0
+_open.peak_requests = 0
From eric.talevich at gmail.com Thu Mar 25 16:27:23 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 25 Mar 2010 16:27:23 -0400
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com>
References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
<3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
<320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com>
<320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com>
Message-ID: <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com>
On Wed, Mar 24, 2010 at 11:16 AM, Peter wrote:
> On Mon, Mar 22, 2010 at 9:48 PM, Peter
> wrote:
> >> In Bio.Nexus, would you normally have handled this with the method
> >> root_with_outgroup? I intend to port that method to Bio.Phylo once I
> >> understand it, but the existing code has been kind of hard for me to
> figure
> >> out.
> >
> > I've just got a quick answer for you now tonight: I've not used Bio.Nexus
> > to try and do this - I'll try to get back to you in more depth tomorrow.
>
> Here is an example using Bio.Nexus.Trees to reroot with an outgroup.
>
> [...]
>
> In my example, the outgroup originally has a branch length of 0.00145.
> A new root node was created (here #12) with two children, one with a
> branch length of zero (#5, the outgroup) and one with the full length
> (#3, branch length 0.00145). Essentially this new root node (#12) and
> the outgroup (#5) are now both right at the base of the tree.
>
> There is more than one what to do this though. For example FigTree
> seems to introduce a new root node half way along the outgroup branch
> (replacing the edge with two edges of half its length). This way the
> new root node represents the last common ancestor of the outgroup and
> the ingroup (everything else), although putting it at the mid point is
> perhaps a little arbitrary.
>
> Peter
>
I looked up this section in *Inferring Phylogenies* and found no decisive
statement on how it should be done. I gathered:
1. The new root can be placed anywhere along the branch between the outgroup
and its ancestor.
2. Another way to root a tree is by assuming a molecular clock -- place the
root so that the distances to all the tips are roughly equal.
So FigTree and Bio.Nexus are both doing reasonable things. (PyCogent doesn't
seem to support this operation, as far as I can tell.)
Thinking of this operation as extending the tree further back in time, where
the (monophyletic) tree without the outgroup is a sub-clade of the larger
rooted tree we're introducing -- it makes sense to me that the branch length
of the outgroup should represent the total evolutionary distance from the
root of the monophyletic sub-clade to the outgroup. Based on that, I'm
tempted to do the opposite of Bio.Nexus, letting the outgroup keep its
original branch length, and assigning a length of 0 to the branch leading to
the remaining sub-clade. Then by default we get something resembling a
trifucating root, and the user can shift the actual location of the root
further back without too much difficulty.
Alternatives:
- Take a hint from the molecular clock, and try to equalize the distance
from the root to the outgroup and the farthest tip of the main subclade.
Problem: in your example the outgroup is not the longest branch, so this
would be equivalent to the version I proposed above. The root->subclade
branch would only be nonzero sometimes, and it might surprise you when that
happens.
- Offer a separate method, root_by_clock, which does the expected thing, and
can be used to determine good branch lengths at the root after the outgroup
operation, if desired.
- Combine: add a keyword argument to root_with_outgroup (like
molecular_clock=False) which triggers Alternative #1.
I'll play with this some more and post an example implementation for you to
review.
Thanks for your help,
Eric
From mjldehoon at yahoo.com Thu Mar 25 21:37:26 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 25 Mar 2010 18:37:26 -0700 (PDT)
Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez?
In-Reply-To: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com>
Message-ID: <996381.46391.qm@web62407.mail.re1.yahoo.com>
I have no objections, but basically I think that this can be left to the responsibility of the end user.
--Michiel.
--- On Thu, 3/25/10, Peter wrote:
> From: Peter
> Subject: NCBI E-utility 100 requests rule in Bio.Entrez?
> To: "Biopython-Dev Mailing List" , "Michiel de Hoon"
> Date: Thursday, March 25, 2010, 12:25 PM
> Hi all,
>
> The NCBI recently announced revised guidlines for the
> Entrez
> utilities, which we've started discussing on the OBF
> mailing list:
> http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007499.html
> http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000623.html
>
> As part of this I decided to look at the peak hour rules:
> http://lists.open-bio.org/pipermail/open-bio-l/2010-March/000644.html
>
> The old guideline was:
>
> http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements
> "Run retrieval scripts on weekends or between 9 pm and 5
> am
> Eastern Time weekdays for any series of more than 100
> requests."
>
> This doesn't define a series - for example, would it be OK
> to run
> a script making 75 requests every two hours? This could be
> regarded
> as multiple separate series each under 100 requests, but
> the
> cumulative count over the 8 peak hours is 600 requests.
>
> Sadly the new guidelines are even more vague:
>
> http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils?=chapter2#chapter2.Usage_Guidelines_and_Requiremen
> "... and limit large jobs to either weekends or between
> 9:00 PM
> and 5:00 AM Eastern time during weekdays."
>
> Not very helpful.
>
> Also neither version raises the issue of summer/winter
> time
> (daylight savings times) but simply gives Eastern Time
> (EST).
>
> While we may get clarification from the NCBI, the following
> patch
> to Bio.Entrez may be worth considering. It simply counts
> the
> number of Entrez requestes during peak hours, and issues a
> warning if this exceeds 100 (based on a strict
> interpretation of
> the older guidelines).
>
> Does this seem worth checking in, or should we try to get
> some
> clarification from the NCBI first?
>
> Peter
>
> diff --git a/Bio/Entrez/__init__.py
> b/Bio/Entrez/__init__.py
> index 33d8d14..f670354 100644
> --- a/Bio/Entrez/__init__.py
> +++ b/Bio/Entrez/__init__.py
> @@ -285,6 +285,26 @@ def _open(cgi, params={},
> post=False):
> ? ? ? ? _open.previous = current + wait
> ? ? else:
> ? ? ? ? _open.previous = current
> +
> + ? ?# Max 100 requests from 09:00 to 17:00 Eastern Time
> (EST), which is
> + ? ?# 5 hours behind Coordinated Universal Time (UTC)
> aka Greenwich
> + ? ?# Mean Time (GMT), thus 14:00 to 22:00 UTC/GMT. The
> NCBI don't
> + ? ?# mention summer/winter time (daylight saving time),
> so ignore that.
> + ? ?if 14 <= time.gmtime(current).tm_hour < 22 \
> + ? ?and time.gmtime(current).tm_wday <= 5:
> + ? ? ? ?# Peak time (Monday = 0, Friday = 5)
> + ? ? ? ?_open.peak_requests += 1
> + ? ? ? ?if _open.peak_requests > 100:
> + ? ? ? ? ? ?import warnings
> + ? ? ? ? ? ?warnings.warn("The NCBI request you make
> at most 100 Entrez "
> + ? ? ? ? ? ? ? ? ? ? ? ? ?"requests during
> the peak time 9AM to 5PM EST "
> + ? ? ? ? ? ? ? ? ? ? ? ? ?"(which is 14:00 to
> 22:00 UTC/GMT). "
> + ? ? ? ? ? ? ? ? ? ? ? ? ?"You have exceeded
> this limit.")
> + ? ?else:
> + ? ? ? ?# Off peak
> + ? ? ? ?# Reset the counter (in case this is a long
> running script)
> + ? ? ? ?_open.peak_requests = 0
> +
> ? ? # Remove None values from the parameters
> ? ? for key, value in params.items():
> ? ? ? ? if value is None:
> @@ -368,3 +388,4 @@ E-utilities.""", UserWarning)
> ? ? return uhandle
>
> _open.previous = 0
> +_open.peak_requests = 0
>
From cy at cymon.org Fri Mar 26 07:38:56 2010
From: cy at cymon.org (Cymon Cox)
Date: Fri, 26 Mar 2010 11:38:56 +0000
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com>
References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
<3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
<320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com>
<320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com>
<3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com>
Message-ID: <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com>
Hi Eric and Peter,
On 25 March 2010 20:27, Eric Talevich wrote:
> On Wed, Mar 24, 2010 at 11:16 AM, Peter >wrote:
>
> > On Mon, Mar 22, 2010 at 9:48 PM, Peter
> > wrote:
> > >> In Bio.Nexus, would you normally have handled this with the method
> > >> root_with_outgroup? I intend to port that method to Bio.Phylo once I
> > >> understand it, but the existing code has been kind of hard for me to
> > figure
> > >> out.
> > >
> > > I've just got a quick answer for you now tonight: I've not used
> Bio.Nexus
> > > to try and do this - I'll try to get back to you in more depth
> tomorrow.
> >
> > Here is an example using Bio.Nexus.Trees to reroot with an outgroup.
> >
> > [...]
> >
> > In my example, the outgroup originally has a branch length of 0.00145.
> > A new root node was created (here #12) with two children, one with a
> > branch length of zero (#5, the outgroup) and one with the full length
> > (#3, branch length 0.00145). Essentially this new root node (#12) and
> > the outgroup (#5) are now both right at the base of the tree.
> >
> > There is more than one what to do this though. For example FigTree
> > seems to introduce a new root node half way along the outgroup branch
> > (replacing the edge with two edges of half its length). This way the
> > new root node represents the last common ancestor of the outgroup and
> > the ingroup (everything else), although putting it at the mid point is
> > perhaps a little arbitrary.
>
Yes, what FigTree is doing is arbitrary, it introduces information into the
displayed tree that is not present, and is open to misinterpretation. But
it's doing so purely for the graphical presentation because you are trying
to root on a terminal branch. Thankfully, if you save this tree in FigTree
it writes the original trifurcating tree.
> I looked up this section in *Inferring Phylogenies* and found no decisive
> statement on how it should be done. I gathered:
>
> 1. The new root can be placed anywhere along the branch between the
> outgroup
> and its ancestor.
>
The root may in biological reality be anywhere along that branch but, in the
absence of further information, the question is where do you place it in
this situation ie, rooting (making a bifurcating root node) on that terminal
branch.
> 2. Another way to root a tree is by assuming a molecular clock -- place the
> root so that the distances to all the tips are roughly equal.
>
> So FigTree and Bio.Nexus are both doing reasonable things. (PyCogent
> doesn't
> seem to support this operation, as far as I can tell.)
>
> Thinking of this operation as extending the tree further back in time,
> where
> the (monophyletic) tree without the outgroup is a sub-clade of the larger
> rooted tree we're introducing -- it makes sense to me that the branch
> length
> of the outgroup should represent the total evolutionary distance from the
> root of the monophyletic sub-clade to the outgroup.
Yes, the outgroup taxa are included in analyses to orientate the
relationships (including br lens) of the ingroup. In this case, with a
single outgroup taxon you do not a very good estimate of the ingroup br len
(its presumably not the immediate ancestor of the ingroup), but its all
you've got given the way the experiment was set up - including more
outgroups would have been a good idea.
Based on that, I'm
> tempted to do the opposite of Bio.Nexus,
Curious, because given that I think Bio.Nexus is doing the right thing ;) By
using this function you are rooting (making a dichotomous root node) using
an outgroup (1 taxon in this case), and the biological interpretation is
that the length belongs to the ingroup.
letting the outgroup keep its
> original branch length, and assigning a length of 0 to the branch leading
> to
> the remaining sub-clade. Then by default we get something resembling a
> trifucating root, and the user can shift the actual location of the root
> further back without too much difficulty.
>
I dont understand what you are getting at here...
Other points:
They way that FigTree displays the rooted tree from root_with_outgroup() is
how I would expect the tree to be presented if you only had a single
outgroup taxon.
There is a case to be made for not making a dichotomous root, but making the
nearest trifurcating node to the designated outgroup the root node - this is
what PAUP does (it wont write at dichotomously rooted tree even if you tell
it to root it).
I think the whole problem stems from only having a single outgroup (which
when you root to it ends up 'looking' like the immediate ancestor of the
ingroup). Typically, you would include multiple ougroups and present/display
the tree with a trifurcating root node, one of which lineages is the ingroup
- unless you are using a non-reversible model you dont need dichotomously
rooted trees.
Cheers, C.
--
From bugzilla-daemon at portal.open-bio.org Fri Mar 26 18:28:10 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Mar 2010 18:28:10 -0400
Subject: [Biopython-dev] [Bug 3036] New: PhyloXML cannot read node colors
created by PhyloXML
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3036
Summary: PhyloXML cannot read node colors created by PhyloXML
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: joelb at lanl.gov
Using a simple example file provided:
>>> tree = Phylo.read('bcl_2.xml','phyloxml')
>>> tree.clade[0].color = Phylo.PhyloXML.BranchColor(255,0,255)
>>> tree.clade[0].color
BranchColor(blue='255', green='0', red='255')
Phylo.write(tree,'colored.phyloxml','phyloxml')
1
>>> tree2=Phylo.read('colored.phyloxml','phyloxml')
Traceback (innermost last):
File "", line 1, in
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/_io.py", line 57, in read
tree = tree_gen.next()
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/_io.py", line 42, in parse
for tree in getattr(supported_formats[format], 'parse')(file):
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 317,
in parse
yield self._parse_phylogeny(elem)
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 342,
in _parse_phylogeny
phylogeny.root = self._parse_clade(elem)
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 388,
in _parse_clade
clade.clades.append(self._parse_clade(elem))
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 410,
in _parse_clade
setattr(clade, tag, getattr(self, tag)(elem))
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXMLIO.py", line 518,
in color
return PX.BranchColor(red, green, blue)
File "/usr/lib64/python2.6/site-packages/Bio/Phylo/PhyloXML.py", line 432, in
__init__
), "Color values must be integers between 0 and 255."
AssertionError: Color values must be integers between 0 and 255.
This is not a problem with an example file not written by biopython:
>>> tree = Phylo.parse('made_up.xml','phyloxml').next()
>>> tree.clade[0].color
BranchColor(blue='28', green='220', red='128')
Also, forester/archaeoptryx is able to correctly read colors written by
biopython.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 26 18:30:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Mar 2010 18:30:36 -0400
Subject: [Biopython-dev] [Bug 3037] New: PhyloXMLIO creates extremely ugly
xml
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3037
Summary: PhyloXMLIO creates extremely ugly xml
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P3
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: joelb at lanl.gov
This is a request for an enhancement.
The xml code PhyloXMLIO creates has no linefeeds. It would be very helpful for
debugging if the XML is prettyprinted.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Sat Mar 27 08:45:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 27 Mar 2010 12:45:28 +0000
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com>
References: <320fb6e01003220921p1d6be73cg48814711ed6f2991@mail.gmail.com>
<3f6baf361003221328u5f1cae6bo18ee7dcb249307e@mail.gmail.com>
<320fb6e01003221448y2b40976bi6792a762b1feda07@mail.gmail.com>
<320fb6e01003240816i6d4e31a6j96fa51a2467e31d1@mail.gmail.com>
<3f6baf361003251327o20cdda2bkeac3c3a6a87b468a@mail.gmail.com>
<7265d4f1003260438x100cc73nf80cc6b5a992691c@mail.gmail.com>
Message-ID: <320fb6e01003270545l7e43c3bbu7a1174397a45ce99@mail.gmail.com>
On Fri, Mar 26, 2010 at 11:38 AM, Cymon Cox wrote:
>
> I think the whole problem stems from only having a single outgroup (which
> when you root to it ends up 'looking' like the immediate ancestor of the
> ingroup). Typically, you would include multiple ougroups and present/display
> the tree with a trifurcating root node, one of which lineages is the ingroup
> - unless you are using a non-reversible model you dont need dichotomously
> rooted trees.
>
And I thought it would be simpler from this example to use a single
out group ;)
Thanks for the comments both of you.
Peter
From bugzilla-daemon at portal.open-bio.org Sat Mar 27 22:58:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 27 Mar 2010 22:58:59 -0400
Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml
In-Reply-To:
Message-ID: <201003280258.o2S2wxut000915@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3037
------- Comment #1 from eric.talevich at gmail.com 2010-03-27 22:58 EST -------
(In reply to comment #0)
> This is a request for an enhancement.
>
> The xml code PhyloXMLIO creates has no linefeeds. It would be very helpful for
> debugging if the XML is prettyprinted.
>
This is a shortcoming of the ElementTree module in the Python standard library
-- the writer doesn't have an option for setting whitespace. But I agree it
would be nice to have this feature, so I'll leave the bug open as a reminder to
look for other ways to do this.
In the meantime I recommend using some external tool to reformat the XML if you
want to look at the raw data. XML Starlet can do this:
http://xmlstar.sourceforge.net/
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Mar 28 12:35:34 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 28 Mar 2010 12:35:34 -0400
Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records
without model id
In-Reply-To:
Message-ID: <201003281635.o2SGZYuv009361@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
k.okonechnikov at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|biopython-dev at biopython.org |k.okonechnikov at gmail.com
Status|NEW |ASSIGNED
------- Comment #6 from k.okonechnikov at gmail.com 2010-03-28 12:35 EST -------
Created an attachment (id=1468)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1468&action=view)
This patch solves this issue and also Bug 2951
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Mar 28 14:03:21 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 28 Mar 2010 14:03:21 -0400
Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records
without model id
In-Reply-To:
Message-ID: <201003281803.o2SI3LhC011261@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
eric.talevich at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |biopython-dev at biopython.org
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From bugzilla-daemon at portal.open-bio.org Sun Mar 28 14:18:24 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 28 Mar 2010 14:18:24 -0400
Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml
In-Reply-To:
Message-ID: <201003281818.o2SIIOLM011602@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3037
------- Comment #2 from chapmanb at 50mail.com 2010-03-28 14:18 EST -------
Eric, check out Fredrik Lundh's indent function for ElementTree. I'm not sure
this ever made it into the source, but it's small enough to copy/paste:
http://effbot.org/zone/element-lib.htm#prettyprint
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Mar 29 06:58:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 29 Mar 2010 11:58:28 +0100
Subject: [Biopython-dev] NCBI E-utility 100 requests rule in Bio.Entrez?
In-Reply-To: <996381.46391.qm@web62407.mail.re1.yahoo.com>
References: <320fb6e01003250925h76fc91d3r541092eb540af112@mail.gmail.com>
<996381.46391.qm@web62407.mail.re1.yahoo.com>
Message-ID: <320fb6e01003290358y1e30fc6eme2028a126a36cdb@mail.gmail.com>
On Fri, Mar 26, 2010 at 2:37 AM, Michiel de Hoon wrote:
> I have no objections, but basically I think that this can be left to the responsibility of the end user.
>
> --Michiel.
OK, unless the NCBI decide to clarify what exactly they mean,
then let's just leave this as is it (the responsibility of the end user).
Peter
From biopython at maubp.freeserve.co.uk Mon Mar 29 07:05:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 29 Mar 2010 12:05:25 +0100
Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally
Message-ID: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com>
Hi Michiel et al,
The NCBI looks to be encouraging more use of the email and tool
parameters in their revised guidelines. To make this easy to use
we have a global setting for the email - I think we should do the
same for the tool (for when users are building their own application
or script on top of Biopython). Something like this patch?
What do you think?
Peter
------------------------
diff --git a/Bio/Entrez/__init__.py b/Bio/Entrez/__init__.py
index 33d8d14..f64015c 100644
--- a/Bio/Entrez/__init__.py
+++ b/Bio/Entrez/__init__.py
@@ -12,6 +12,9 @@ http://www.ncbi.nlm.nih.gov/Entrez/
A list of the Entrez utilities is available at:
http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html
+Variables:
+email Set the Entrez email parameter globally (default is not set).
+tool Set the Entrez tool parameter globally (defaults to biopython).
Functions:
efetch Retrieves records in the requested format from a list of one or
@@ -50,7 +53,7 @@ from Bio import File
email = None
-
+tool = "biopython"
# XXX retmode?
def epost(db, **keywds):
@@ -275,6 +278,7 @@ def _open(cgi, params={}, post=False):
This function also enforces the "up to three queries per second rule"
to avoid abusing the NCBI servers.
"""
+ global tool, email
# NCBI requirement: At most three queries per second.
# Equivalently, at least a third of second between queries
delay = 0.333333334
@@ -291,7 +295,7 @@ def _open(cgi, params={}, post=False):
del params[key]
# Tell Entrez that we are using Biopython
if not "tool" in params:
- params["tool"] = "biopython"
+ params["tool"] = tool
# Tell Entrez who we are
if not "email" in params:
if email!=None:
From biopython at maubp.freeserve.co.uk Mon Mar 29 08:36:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 29 Mar 2010 13:36:19 +0100
Subject: [Biopython-dev] Deprecating PropertyManager,
Encodings and Bio.utils?
Message-ID: <320fb6e01003290536p4c61e1d1u8bc6e2ad83f1c9e5@mail.gmail.com>
Hi all,
I think we've done pretty well at carefully removing, fixing or
replacing most of the dusty bits of code Biopython had acquired
over the years. There are still things to clean up though... in
particular modules Bio.PropertyManager and Bio.Encodings
seem rather unnecessary.
Bio.Encodings is tied into the old (and now deprecated)
Bio.Translate and Bio.Transcribe code. Once they are
removed (after the next release) we can at least cut a lot
of Bio.Encodings.
Bio.PropertyManager and Bio.Encodings only seem to be
used by Bio.utils, which I would also like to deprecate. This
is an undocumented module with no unit tests. It offers a
few bits of sequence related functionality which would be
better off in Bio.Seq or Bio.SeqUtils, and some fairly trivial
functions we could just deprecate.
These strike me as the only bits of functionality worth keeping
in Bio.utils:
Function verify_alphabet (which is being used by the code in
Bio.NeuralNetwork.Gene) just checks a Seq object's sequence
obeys the alphabet letters. This essentially is something I think
the Seq object should do itself, during initialisation (Bug 2597).
With that done, then Bio.utils.verify_alphabet could be
deprecated.
There are a few functions for getting molecular weights via
the IUPAC alphabet objects. These could be reimplemented
by using weight tables belonging to the IUPAC alphabet
classes explicitly, perhaps exposed as new functions under
Bio.SeqUtils. It would be interesting to look at refinements
like handling the start/end of the sequence explicitly (i.e.
the 5' and 3' ends of a nucleotide sequence, or the N and C
terminals of a peptide).
Function reduce_sequence (linked to Bio.Alphabet.Reduced)
is for things like mapping a protein sequence to a simplified
sequence using the Murphy alphabet (e.g. using a single
letter for all the aliphatics: I,L,V). This is perhaps interesting
enough to retain - again perhaps under Bio.SeqUtils. It does
need documentation and unit tests though.
Is anyone interested in updating, documenting and then
testing the molecular weight and reduced alphabet code?
[I suggest starting a new thread if you are.]
If not, should we consider just deprecating Bio.utils,
Bio.PropertyManager and Bio.Encodings in the next
release?
Peter
From mjldehoon at yahoo.com Mon Mar 29 09:54:01 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 29 Mar 2010 06:54:01 -0700 (PDT)
Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally
In-Reply-To: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com>
Message-ID: <614002.50617.qm@web62406.mail.re1.yahoo.com>
Basically I think that this patch is OK, but why do tool and email need to be global inside the _open function?
--Michiel
--- On Mon, 3/29/10, Peter wrote:
> From: Peter
> Subject: Setting the NCBI Entrez tool parameter globally
> To: "Michiel de Hoon" , "Biopython-Dev Mailing List"
> Date: Monday, March 29, 2010, 7:05 AM
> Hi Michiel et al,
>
> The NCBI looks to be encouraging more use of the email and
> tool
> parameters in their revised guidelines. To make this easy
> to use
> we have a global setting for the email - I think we should
> do the
> same for the tool (for when users are building their own
> application
> or script on top of Biopython). Something like this patch?
>
> What do you think?
>
> Peter
>
> ------------------------
>
> diff --git a/Bio/Entrez/__init__.py
> b/Bio/Entrez/__init__.py
> index 33d8d14..f64015c 100644
> --- a/Bio/Entrez/__init__.py
> +++ b/Bio/Entrez/__init__.py
> @@ -12,6 +12,9 @@ http://www.ncbi.nlm.nih.gov/Entrez/
> A list of the Entrez utilities is available at:
> http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html
>
> +Variables:
> +email? ? ? ? Set the Entrez email
> parameter globally (default is not set).
> +tool? ? ? ???Set the Entrez
> tool parameter globally (defaults to biopython).
>
> Functions:
> efetch? ? ???Retrieves records in
> the requested format from a list of one or
> @@ -50,7 +53,7 @@ from Bio import File
>
>
> email = None
> -
> +tool = "biopython"
>
> # XXX retmode?
> def epost(db, **keywds):
> @@ -275,6 +278,7 @@ def _open(cgi, params={}, post=False):
> ? ???This function also enforces the
> "up to three queries per second rule"
> ? ???to avoid abusing the NCBI
> servers.
> ? ???"""
> +? ? global tool, email
> ? ???# NCBI requirement: At most three
> queries per second.
> ? ???# Equivalently, at least a third
> of second between queries
> ? ???delay = 0.333333334
> @@ -291,7 +295,7 @@ def _open(cgi, params={}, post=False):
> ? ? ? ? ? ???del
> params[key]
> ? ???# Tell Entrez that we are using
> Biopython
> ? ???if not "tool" in params:
> -? ? ? ? params["tool"] = "biopython"
> +? ? ? ? params["tool"] = tool
> ? ???# Tell Entrez who we are
> ? ???if not "email" in params:
> ? ? ? ???if email!=None:
>
From chapmanb at 50mail.com Mon Mar 29 09:50:23 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 29 Mar 2010 09:50:23 -0400
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
Message-ID: <20100329135023.GF42657@sobchak.mgh.harvard.edu>
Eric, Peter and Cymon;
> I've got a real example of a simple tree manipulation that I would
> like to handle via your new module. I have a (small) unrooted tree from a
> gene family in Newick format, which by construction includes an
> out-group (the same gene but from a more distant organism). I would like to
> reroot the tree so that this out-group is at the basal level.
Really enjoying the discussion on this. It's a bit outside my area
of expertise but I stumbled across DendroPy this weekend:
http://packages.python.org/DendroPy/index.html
which has a reroot_at function that might be worth looking into:
http://github.com/jeetsukumaran/DendroPy/blob/master/dendropy/dataobject/tree.py
Hope this helps,
Brad
From biopython at maubp.freeserve.co.uk Mon Mar 29 11:22:11 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 29 Mar 2010 16:22:11 +0100
Subject: [Biopython-dev] Setting the NCBI Entrez tool parameter globally
In-Reply-To: <614002.50617.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e01003290405s2c76f875ucf39c077277f1916@mail.gmail.com>
<614002.50617.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e01003290822u62190365j9c147df71a9de46e@mail.gmail.com>
On Mon, Mar 29, 2010 at 2:54 PM, Michiel de Hoon wrote:
> Basically I think that this patch is OK, but why do tool and email
> need to be global inside the _open function?
I just thought it was clearer than implicit scope rules,
I'll omit that line and commit the rest.
Peter
From biopython at maubp.freeserve.co.uk Mon Mar 29 11:35:01 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 29 Mar 2010 16:35:01 +0100
Subject: [Biopython-dev] Rerooting a tree with Bio.Phylo
In-Reply-To: <20100329135023.GF42657@sobchak.mgh.harvard.edu>
References: <20100329135023.GF42657@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003290835n25dd2a35kcc8c40587dd10c05@mail.gmail.com>
On Mon, Mar 29, 2010 at 2:50 PM, Brad Chapman wrote:
> Eric, Peter and Cymon;
>
>> I've got a real example of a simple tree manipulation that I would
>> like to handle via your new module. I have a (small) unrooted tree from a
>> gene family in Newick format, which by construction includes an
>> out-group (the same gene but from a more distant organism). I would like to
>> reroot the tree so that this out-group is at the basal level.
>
> Really enjoying the discussion on this. It's a bit outside my area
> of expertise but I stumbled across DendroPy this weekend:
>
> http://packages.python.org/DendroPy/index.html
>
> which has a reroot_at function that might be worth looking into:
>
> http://github.com/jeetsukumaran/DendroPy/blob/master/dendropy/dataobject/tree.py
>
> Hope this helps,
> Brad
Hey Brad,
I also spotted DendroPy recently (via a blog post or something),
but hadn't yet looked to see how they handled this.
It looks like their reroot_at function takes an *internal* node as the
argument to specify the new root. This neatly avoids the problem
about having to introduce a new node when rerooting with a given
terminal node (taxon) as the out group.
Peter
From bugzilla-daemon at portal.open-bio.org Mon Mar 29 12:07:12 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 29 Mar 2010 12:07:12 -0400
Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records
without model id
In-Reply-To:
Message-ID: <201003291607.o2TG7CDf013375@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |biopython-
| |bugzilla at maubp.freeserve.co.
| |uk
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-29 12:07 EST -------
(In reply to comment #6)
> Created an attachment (id=1468)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1468&action=view) [details]
> This patch solves this issue and also Bug 2951
>
Just by eye there is something wrong with your indentation in that patch.
Maybe you have mixed tabs and spaces?
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From bugzilla-daemon at portal.open-bio.org Mon Mar 29 13:28:18 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 29 Mar 2010 13:28:18 -0400
Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records
without model id
In-Reply-To:
Message-ID: <201003291728.o2THSIAf015768@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
------- Comment #8 from k.okonechnikov at gmail.com 2010-03-29 13:28 EST -------
Created an attachment (id=1469)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1469&action=view)
Improved version of the patch
Added default value for serial_num in Model constructor, fixed indentation
issues.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From bugzilla-daemon at portal.open-bio.org Mon Mar 29 13:39:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 29 Mar 2010 13:39:14 -0400
Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records
without model id
In-Reply-To:
Message-ID: <201003291739.o2THdEoU016014@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
------- Comment #9 from k.okonechnikov at gmail.com 2010-03-29 13:39 EST -------
Created an attachment (id=1470)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1470&action=view)
Simple test script
It downloads NMR structure, checks model serial numbers and writes structure to
file.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From biopython at maubp.freeserve.co.uk Mon Mar 29 17:41:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 29 Mar 2010 22:41:14 +0100
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
<320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
<9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
Message-ID: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
On Fri, Mar 19, 2010 at 11:08 PM, Sebastian Bassi wrote:
> On Fri, Mar 19, 2010 at 7:45 AM, Peter wrote:
>> Give an inch and they'll take a mile ;)
>
> In Spanish we say: Give a hand and they'll take the whole arm :)
I think I like that version more :)
>> that if they don't specify the format that Biopython will
>> determine it automatically - which it won't.
>
> In this respect, Python zen favours being explicit,so I see your point.
>
>> Also, could you clarify if you are in favour of relaxing the
>> requirement that the write function takes a list/iterator of
>> records/alignments to allow a single SeqRecord or alignment?
>
> Is OK for me to allow a single record instead of a iterable, this
> change will not break any existing code so it is OK for me.
That sounds like you don't object, but are not strongly in
favour either.
No-one else has commented (other than Eric and Marshall
who were in favour).
Maybe it would be prudent to leave it? [Will this suggestion
provoke any further comments I wonder?]
Peter
From eric.talevich at gmail.com Mon Mar 29 23:05:51 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 29 Mar 2010 23:05:51 -0400
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
<320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
<9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
<320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
Message-ID: <3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com>
On Mon, Mar 29, 2010 at 5:41 PM, Peter wrote:
> On Fri, Mar 19, 2010 at 11:08 PM, Sebastian Bassi wrote:
> > On Fri, Mar 19, 2010 at 7:45 AM, Peter
> wrote:
> >> Also, could you clarify if you are in favour of relaxing the
> >> requirement that the write function takes a list/iterator of
> >> records/alignments to allow a single SeqRecord or alignment?
> >
> > Is OK for me to allow a single record instead of a iterable, this
> > change will not break any existing code so it is OK for me.
>
> That sounds like you don't object, but are not strongly in
> favour either.
>
> No-one else has commented (other than Eric and Marshall
> who were in favour).
>
> Maybe it would be prudent to leave it? [Will this suggestion
> provoke any further comments I wonder?]
I know I've already voted, but here's another thought: if we're going to
make this change eventually, it would be nice if the very first release of
Bio.Phylo had the right behavior and retained the same behavior through
later releases. Otherwise we'd have one or more isolated releases where
Phylo.write doesn't handle single trees directly, and when documentation is
updated to track later releases that do handle single trees, that could
cause some confusion for some folks still using Biopython 1.54.
-Eric
>
From bugzilla-daemon at portal.open-bio.org Tue Mar 30 01:17:44 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 30 Mar 2010 01:17:44 -0400
Subject: [Biopython-dev] [Bug 3036] PhyloXML cannot read node colors created
by PhyloXML
In-Reply-To:
Message-ID: <201003300517.o2U5HiVN001772@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3036
eric.talevich at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from eric.talevich at gmail.com 2010-03-30 01:17 EST -------
(In reply to comment #0)
> Using a simple example file provided:
> ...
> This is not a problem with an example file not written by biopython:
> >>> tree = Phylo.parse('made_up.xml','phyloxml').next()
> >>> tree.clade[0].color
> BranchColor(blue='28', green='220', red='128')
Thanks for catching this! I pushed a fix to GitHub:
http://github.com/biopython/biopython/commit/6e2eac9612f600507491c3bb45fc19ffdc987169
The problem was occurring for color values of 0 -- PhyloXMLIO was using an
inline and-or test instead of if-else (Py2.4 compatibility hack) to check and
convert the node text to an integer. Since 0 evaluates as boolean False, the
expression was returning None instead of integer 0, causing the BranchColor
constructor to vom.
> Also, forester/archaeoptryx is able to correctly read colors written by
> biopython.
Good to know. Thanks again for testing this.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Mar 30 01:24:09 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 30 Mar 2010 01:24:09 -0400
Subject: [Biopython-dev] [Bug 3037] PhyloXMLIO creates extremely ugly xml
In-Reply-To:
Message-ID: <201003300524.o2U5O9Hq001917@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3037
eric.talevich at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from eric.talevich at gmail.com 2010-03-30 01:24 EST -------
(In reply to comment #2)
> Eric, check out Fredrik Lundh's indent function for ElementTree. I'm not sure
> this ever made it into the source, but it's small enough to copy/paste:
>
> http://effbot.org/zone/element-lib.htm#prettyprint
>
Thanks! I did just that:
http://github.com/biopython/biopython/commit/3d892a39015c5659c91ff819ceeea043585f3607
The write() function in Phylo, PhyloXMLIO and Writer class now take an 'indent'
argument, which defaults to False for the sake of I/O performance and file
size.
Side note: apparently the ElementTree module has just emerged from a 4-year
hibernation, so new features (like this one) and bug fixes will begin appearing
in the Python stdlib as of version 3.2.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Mar 30 05:46:04 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 30 Mar 2010 10:46:04 +0100
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
<320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
<9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
<320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
<3f6baf361003292005y5075df0ch1fa929304f3c501a@mail.gmail.com>
Message-ID: <320fb6e01003300246s61b75b9v123141ac23240b18@mail.gmail.com>
On Tue, Mar 30, 2010 at 4:05 AM, Eric Talevich wrote:
>
> I know I've already voted, but here's another thought: if we're going to
> make this change eventually, it would be nice if the very first release of
> Bio.Phylo had the right behavior and retained the same behavior through
> later releases. Otherwise we'd have one or more isolated releases where
> Phylo.write doesn't handle single trees directly, and when documentation is
> updated to track later releases that do handle single trees, that could
> cause some confusion for some folks still using Biopython 1.54.
>
True. Another plus for doing it now is we're relaxing the filename/handle
thing, so it makes sense to make this change now (get these changes
all done in one go to reduce end user confusion).
Peter
From chapmanb at 50mail.com Tue Mar 30 08:16:00 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 30 Mar 2010 08:16:00 -0400
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
<320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
<9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
<320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
Message-ID: <20100330121600.GF35248@sobchak.mgh.harvard.edu>
Peter;
> >> Also, could you clarify if you are in favour of relaxing the
> >> requirement that the write function takes a list/iterator of
> >> records/alignments to allow a single SeqRecord or alignment?
[...]
> No-one else has commented (other than Eric and Marshall
> who were in favour).
>
> Maybe it would be prudent to leave it? [Will this suggestion
> provoke any further comments I wonder?]
+1 from me for making it more flexible. I don't see a lot of downside:
it helps avoid a common source of initial confusion and is fully back
compatible.
Brad
From biopython at maubp.freeserve.co.uk Tue Mar 30 08:42:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 30 Mar 2010 13:42:46 +0100
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <20100330121600.GF35248@sobchak.mgh.harvard.edu>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
<320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
<9e2f512b1003181239j875b1d5h7d4bbf3039b4da79@mail.gmail.com>
<320fb6e01003190345u4d88d8aeme189c445f3e8d0c9@mail.gmail.com>
<9e2f512b1003191508w2fae969ciecb8627639abcefe@mail.gmail.com>
<320fb6e01003291441m56e81288t7ae1518a237816e3@mail.gmail.com>
<20100330121600.GF35248@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003300542m17c02e3fic15ccdfc60dcd33c@mail.gmail.com>
On Tue, Mar 30, 2010 at 1:16 PM, Brad Chapman wrote:
> Peter;
>
>> >> Also, could you clarify if you are in favour of relaxing the
>> >> requirement that the write function takes a list/iterator of
>> >> records/alignments to allow a single SeqRecord or alignment?
> [...]
>> No-one else has commented (other than Eric and Marshall
>> who were in favour).
>>
>> Maybe it would be prudent to leave it? [Will this suggestion
>> provoke any further comments I wonder?]
>
> +1 from me for making it more flexible. I don't see a lot of downside:
> it helps avoid a common source of initial confusion and is fully back
> compatible.
OK, checked in. We'll need to at least add an FAQ entry to
the tutorial on this, and the SeqIO/AlignIO chapters may
need clarification too.
Eric - could you look at Bio.phylo.write() this week?
Thanks,
Peter
From aaronquinlan at gmail.com Tue Mar 30 13:03:56 2010
From: aaronquinlan at gmail.com (Aaron Quinlan)
Date: Tue, 30 Mar 2010 13:03:56 -0400
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
<2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
<2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com>
Message-ID:
Hi Kevin,
Per the author, BamTools supposedly now supports all endian flavors.
Aaron
On Mar 4, 2010, at 9:07 AM, Kevin Jacobs wrote:
> On Thu, Mar 4, 2010 at 8:52 AM, Kevin Jacobs wrote:
> On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote:
> Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/). The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc.
>
> I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries. Save for a few minor bugs along the way, it was rather straightforward to include.
>
> Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools. The bamtools code looks well designed and quite similar to my emerging Cython/Python rendition.
>
>
> Ouch-- never mind. The bamtools code isn't endian-clean -- it will only work correctly on native little-endian architectures.
>
> -Kevin
>
From bioinformed at gmail.com Tue Mar 30 20:57:50 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 30 Mar 2010 20:57:50 -0400
Subject: [Biopython-dev] Alignment object
In-Reply-To:
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
<2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
<2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com>
Message-ID:
On Tue, Mar 30, 2010 at 1:03 PM, Aaron Quinlan wrote:
> Hi Kevin,
> Per the author, BamTools supposedly now supports all endian flavors.
> Aaron
>
>
Thanks, Aaron. I've been in touch with Derek and am testing his new version
on Power PC. I was very gratified to see him respond and address several of
the minor issues I raised.
-Kevin
From updates at feedmyinbox.com Wed Mar 31 02:14:22 2010
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Wed, 31 Mar 2010 02:14:22 -0400
Subject: [Biopython-dev] 3/31 BioStar - Biopython Questions
Message-ID:
==================================================
1. Is there a non-perl alternative to accessing Ensembl's API?
==================================================
March 30, 2010 at 11:06 AM
I'm looking for a programmatic way to access Ensembl or UCSC's genome browser using something like python or ruby. Perl is just not my thing, sadly.
PyCogent seems to have something that I have not yet tested properly, but just wanted to ping the community before going ahead and coding something that might already exist.
Any ideas?
http://biostar.stackexchange.com/questions/536/is-there-a-non-perl-alternative-to-accessing-ensembls-api
--------------------------------------------------
===========================================================
Source: http://biostar.stackexchange.com/questions/tagged/biopython
This email was sent to biopython-dev at lists.open-bio.org.
Account Login:
https://www.feedmyinbox.com/members/login/
Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/
-----------------------------------------------------------
This email was carefully delivered by FeedMyInbox.com.
230 Franklin Road Suite 814 Franklin, TN 37064
From crosvera at gmail.com Wed Mar 31 18:39:06 2010
From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=)
Date: Wed, 31 Mar 2010 18:39:06 -0400
Subject: [Biopython-dev] PDB-Tidy proposal
Message-ID:
Dear Biopythoners,
I'm Carlos R?os, a student from Chile. As some of you may know, I'm
very interested in apply to the Google Summer of Code with the
PDB-Tidy idea. So, I wrote a draft that suppose to be my proposal.
I'm open to receive any comment, feedback, disagreement...
here is the link of the draft:
http://github.com/crosvera/pdbtidy_proposal/blob/master/proposal
Regards.
Ps: sorry if my English is not so good.
--
http://crosvera.blogspot.com
Carlos R?os V.
Estudiante de Ing. (E) en Computaci?n e Inform?tica.
Universidad del B?o-B?o
VIII Regi?n, Chile
Linux user number 425502
From bugzilla-daemon at portal.open-bio.org Mon Mar 1 18:14:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Mar 2010 13:14:45 -0500
Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic
alignment, e.g. align[1:2, 5:-5]
In-Reply-To:
Message-ID: <201003011814.o21IEjcK024496@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2551
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-01 13:14 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj
This now covers:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]
Bug 2552 - Adding alignments
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bioinformed at gmail.com Mon Mar 1 23:22:42 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Mon, 1 Mar 2010 18:22:42 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
Message-ID: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
On Thu, Feb 11, 2010 at 12:29 AM, Peter wrote:
> On Mon, Jan 11, 2010 at 5:11 PM, Peter
> wrote:
> > I didn't want to rush the SFF support into Biopython 1.53, but its been
> > waiting "ready" for a while now. Any objections or comments about
> > me merging this now?
>
> There were no objections, and I ran this by Brad and Michiel and
> have just merged this into the master branch. Time for some more
> testing!
>
>
I've tried out the recently landed SFF SeqIO code and am pleased to report
that it works very well. I am parsing gsMapper 454PairAlign.txt output and
converting it to SAM/BAM format to view in IGV (among other things) and
wanted to include per-based quality score information from the SFF files.
The only glitch so far is that the indexed access mode yields sequences
with no alphabet assigned. The solution is to add the following to the
beginning of SffDict.__init__:
if alphabet is None:
alphabet = Alphabet.generic_dna
My only other comment is that several file reads and struct.unpacks can be
merged in _sff_read_seq_record. Given the number of records in most 454 SFF
files, I suspect the micro-optimization effort will be worth the slight cost
in code clarity.
Thanks to Peter and Jose for all of their hard work!
Best regards,
-Kevin
From biopython at maubp.freeserve.co.uk Tue Mar 2 10:08:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 10:08:27 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
Message-ID: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs
wrote:
> On Thu, Feb 11, 2010 at 12:29 AM, Peter
> wrote:
>>
>> On Mon, Jan 11, 2010 at 5:11 PM, Peter
>> wrote:
>> > I didn't want to rush the SFF support into Biopython 1.53, but its been
>> > waiting "ready" for a while now. Any objections or comments about
>> > me merging this now?
>>
>> There were no objections, and I ran this by Brad and Michiel and
>> have just merged this into the master branch. Time for some more
>> testing!
>>
>
> I've tried out the recently landed SFF SeqIO code and am pleased to
> report that it works very well.
Great :)
If you have suggestions for the documentation please voice them.
Also did the handling of trimmed reads seem sensible? Until we
release this we can tweak the API.
> I am parsing gsMapper 454PairAlign.txt output and
> converting it to SAM/BAM format to view in IGV (among other things) and
> wanted to include per-based quality score information from the SFF files.
Are you reading and writing SAM/BAM format with Python? Looking
into this is on my (long) todo list.
>?The only glitch so far is that the indexed access mode yields sequences
> with no alphabet assigned. ?The solution is to add the following to the
> beginning of SffDict.__init__:
> ?? ? ? ?if alphabet is None:
> ?? ? ? ? ?alphabet = Alphabet.generic_dna
Thanks - I'll look at that.
> My only other comment is that several file reads and struct.unpacks can be
> merged in?_sff_read_seq_record. ?Given the number of records in most 454 SFF
> files, I suspect the micro-optimization effort will be worth the slight cost
> in code clarity.
I did try and spend some effort on the run time, but it wouldn't
surprise me that there was still room for improvement. I found
that since most of my SFF files were only up to 2GB with under
a million reads, that this wasn't such an issue (compared to
FASTQ files with Solexa data).
I guess you mean the flowgram values, flowgram index, bases
and qualities might be loaded with a single read? That would
be worth trying.
> Thanks to Peter and Jose for all of their hard work!
> Best regards,
> -Kevin
And thanks for the feedback :)
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 12:02:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 12:02:53 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
Message-ID: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com>
On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote:
> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote:
>>?The only glitch so far is that the indexed access mode yields sequences
>> with no alphabet assigned. ?The solution is to add the following to the
>> beginning of SffDict.__init__:
>> ?? ? ? ?if alphabet is None:
>> ?? ? ? ? ?alphabet = Alphabet.generic_dna
>
> Thanks - I'll look at that.
Yes, that looks sensible - change commited. Would you like to be credited
in our NEWS and CONTRIB file for this little bug fix?
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 12:25:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 12:25:05 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20091028121833.GC22395@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman wrote:
>Peter wrote:
>> My rough work in progress in on github - at the moment I'm still trying
>> things out, and don't assume anything is set in stone. If you want to
>> have a play with this code, feedback is very welcome - probably best
>> on the dev list rather than here. See:
>>
>> http://github.com/peterjc/biopython/tree/seqrecords
>>
>> (a lot of the alignment things I want to support, like slicing and adding
>> are very closely linked to doing the same operations to SeqRecords)
Here is a new branch implementing a multiple-sequence-alignment
class (living under Bio.Align for now) based on the recent support
for slicing and adding SeqRecord objects:
http://github.com/peterjc/biopython/tree/alignment-obj
This handles most of the basic tasks I want to be able to easily do
with classical alignments, based on previous discussions on the
mailing list and/or bugzilla:
http://bugzilla.open-bio.org/show_bug.cgi?id=2551
http://bugzilla.open-bio.org/show_bug.cgi?id=2552
http://bugzilla.open-bio.org/show_bug.cgi?id=2553
http://bugzilla.open-bio.org/show_bug.cgi?id=2554
At its core, the alignment is still held as a list of SeqRecord objects,
which should mean minimal problems with backwards compatibility.
If anyone would like to try out the code, comments would be very
welcome. There are plenty of doctests in the docstrings which
should explain how I expect things to work.
> The bx-python alignment object is nice and goes to/from MAF
> and AXT formats:
>
> http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py
>
> This supports slicing by alignment coordinates and by reference
> coordinates for a species in the alignment. Some other useful
> features are limiting the alignment to specific species and removing
> all gap columns that can result. The representation is a high level
> Alignment object containing multiple Components.
My code does not (yet) attempt to deal with next-gen sequencing
alignments, which would require padding all the (short) reads with
leading and trailing gaps to ensure all rows of the alignment have
the same length. Doing this in a memory efficient way could be
done with a PaddedSeq object, or a very different alignment object
(hold read and their offsets in memory). I'm not sure what is best,
but the bx-python model looks worth understanding to help decide.
Perhaps until this is settled, it would be premature to merge my
alignment class to the trunk. After all, we may need to tweak the
alignment object class heirachy.
Peter
From bioinformed at gmail.com Tue Mar 2 12:29:38 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 07:29:38 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<320fb6e01003020402v6b2fab6j88f4c6fc90da15a9@mail.gmail.com>
Message-ID: <2e1434c11003020429y37343796oddf02ad433ab82ea@mail.gmail.com>
On Tue, Mar 2, 2010 at 7:02 AM, Peter wrote:
> On Tue, Mar 2, 2010 at 10:08 AM, Peter wrote:
> > On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs wrote:
> >> The only glitch so far is that the indexed access mode yields sequences
> >> with no alphabet assigned. The solution is to add the following to the
> >> beginning of SffDict.__init__:
> >> if alphabet is None:
> >> alphabet = Alphabet.generic_dna
> >
> > Thanks - I'll look at that.
>
> Yes, that looks sensible - change commited. Would you like to be credited
> in our NEWS and CONTRIB file for this little bug fix?
>
>
I'm happy to contribute and be listed in the credits.
Thanks,
-Kevin
From bioinformed at gmail.com Tue Mar 2 12:36:27 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 07:36:27 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
Message-ID: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote:
> On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman
> wrote:My code does not (yet) attempt to deal with next-gen sequencing
> alignments, which would require padding all the (short) reads with
> leading and trailing gaps to ensure all rows of the alignment have
> the same length. Doing this in a memory efficient way could be
> done with a PaddedSeq object, or a very different alignment object
> (hold read and their offsets in memory). I'm not sure what is best,
> but the bx-python model looks worth understanding to help decide.
>
> Perhaps until this is settled, it would be premature to merge my
> alignment class to the trunk. After all, we may need to tweak the
> alignment object class heirachy.
Hi Peter,
I'm just jumping in here and have not yet read all of the background
material. However, I am working with next-gen alignments and am curious as
to what you have in mind. At first glance, it sounds like you want to
access aligned reads in a 'pileup' format (i.e., an object model akin to
http://samtools.sourceforge.net/pileup.shtml). Or are you thinking of
something different entirely?
Best regards,
-Kevin
From bioinformed at gmail.com Tue Mar 2 12:28:22 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 07:28:22 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
Message-ID: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
On Tue, Mar 2, 2010 at 5:08 AM, Peter wrote:
> On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs
> wrote:
> > I've tried out the recently landed SFF SeqIO code and am pleased to
> > report that it works very well.
>
> Great :)
>
> If you have suggestions for the documentation please voice them.
> Also did the handling of trimmed reads seem sensible? Until we
> release this we can tweak the API.
I only looked at the module documentation and it was more than sufficient to
get started. I've never really used BioPython before, so I was pleasantly
surprised at how easy it was to get started. The BioPython SFF parser and
indexed access replaced a hairy process of extracting data using 454's
sffinfo and packing it into a BDB file.
> > I am parsing gsMapper 454PairAlign.txt output and
> > converting it to SAM/BAM format to view in IGV (among other things) and
> > wanted to include per-based quality score information from the SFF files.
>
> Are you reading and writing SAM/BAM format with Python? Looking
> into this is on my (long) todo list.
>
Yes-- so far I have code to populate the basic data for unpaired reads, but
none of the optional annotations. My script reads the 454 pairwise
alignment data, finds each read in the source SFF file, figures out if extra
trimming was applied by gsMapper, and extracts the matching PHRED quality
scores. Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
FAQ). The script can output SAM records or create a subprocess to sort the
records and recode to BAM format using samtools. I've attached the current
version script and you are welcome to use it for any purpose.
> My only other comment is that several file reads and struct.unpacks can be
> > merged in _sff_read_seq_record. Given the number of records in most 454
> SFF
> > files, I suspect the micro-optimization effort will be worth the slight
> cost
> > in code clarity.
>
> [...]I guess you mean the flowgram values, flowgram index, bases
and qualities might be loaded with a single read? That would
> be worth trying.
>
Exactly! Also, flowgrams do not need to be unpacked when trimming. My own
bias is to encode the quality scores and flowgrams in numpy arrays rather
than lists, however I understand that the goal is to keep the external
dependencies to a minimum (although NumPy is required elsewhere).
Also, the test "chr(0)*padding != handle.read(padding)" could be written
just as clearly as "handle.read(padding).count('\0') != padding" and not
generate as many temporary objects.
Best regards,
-Kevin
-------------- next part --------------
# -*- coding: utf-8 -*-
# Convert 454PairAlign.txt and the corresponding SFF files into SAM/BAM format
import re
import sys
from operator import getitem, itemgetter
from itertools import izip, imap, groupby, repeat
from subprocess import Popen, PIPE
import numpy as np
try:
# Import fancy versions of basic IO functions from my GLU package
# see http://code.google.com/p/glu-genetics
from glu.lib.fileutils import autofile,hyphen,table_writer,table_reader
except ImportError:
import csv
# The real version handles automatic gz/bz2 (de)compression
autofile = file
def hyphen(filename,default):
if filename=='-' and default is not None:
return default
return filename
# Write a tab-delimited ASCII file
# The real version handles many more formats (CSV, XLS, Stata), column
# selection, header optionds, row filters, and other toys.
def table_writer(filename,hyphen=None):
if filename=='-' and hyphen is not None:
dest = hyphen
else:
dest = autofile(filename,'wb')
return csv.writer(dest, dialect='excel-tab')
# Read a tab-delimited ASCII file
# The real version handles many more formats (CSV, XLS, Stata), column
# selection, header optionds, row filters, and other toys.
def table_reader(filename,hyphen=None):
if filename=='-' and hyphen is not None:
dest = hyphen
else:
dest = autofile(filename,'rb')
return csv.reader(dest, dialect='excel-tab')
CIGAR_map = { ('-','-'):'P' }
for a in 'NACGTacgt':
CIGAR_map[a,'-'] = 'I'
CIGAR_map['-',a] = 'D'
for b in 'NACGTacgt':
CIGAR_map[a,b] = 'M'
def make_cigar_py(query,ref):
assert len(query)==len(ref)
igar = imap(getitem, repeat(CIGAR_map), izip(query,ref))
cigar = ''.join('%d%s' % (len(list(run)),code) for code,run in groupby(igar))
return cigar
# Try to import the optimized Cython version
# The Python version is pretty fast, but I wanted to play with Cython.
try:
from cigar import make_cigar
except ImportError:
make_cigar = make_cigar_py
class SFFIndex(object):
def __init__(self, sfffiles):
self.sffindex = sffindex = {}
for sfffile in sfffiles:
from Bio import SeqIO
prefix,ext = sfffile[-13:].split('.')
assert ext=='sff'
print >> sys.stderr,'Loading SFF index for',sfffile
reads = SeqIO.index(sfffile, 'sff-trim')
sffindex[prefix] = reads
def get_quality(self, qname, query, qstart, qstop):
prefix = qname[:9]
sff = self.sffindex.get(prefix)
if not sff:
return '*'
rec = sff[qname]
phred = rec.letter_annotations['phred_quality']
sffqual = np.array(phred,dtype=np.uint8)
sffqual += 33
sffqual = sffqual.tostring()
# Align the query to the original read to find the matching quality
# score information. This is complicated by the extra trimming done by
# gsMapper. We could obtain this information by parsing the
# 454TrimStatus.txt, but it is easier to search for the sub-sequence in
# the reference. Ones hopes the read maps uniquely, but this is not
# checked.
# CASE 1: Forward read alignment
if qstart> sys.stderr,'MATCHED TYPE F2: name=%s, qstart=%d(%d), qstop=%d, qlen=%d, len.query=%d' % (qname,start+1,qstart,qstop,qlen,len(query))
qual = sffqual[start:start+len(query)]
# CASE 2: Backward read alignment
else:
# Try using specified cut-points
read = str(rec.seq.complement())
seq = read[qstop-1:qstart][::-1]
read = read[::-1]
# If it matches, then compute quality
if seq==query:
qual = sffqual[qstop-1:qstart][::-1]
else:
# otherwise gsMapper applied extra trimming, so we have to manually find the offset
start = read.index(query)
seq = read[start:start+len(query)]
if seq==query:
#print >> sys.stderr,'MATCHED TYPE R2: name=%s, qstart=%d, qstop=%d(%d), qlen=%d, len.query=%d' % (qname,qstart,start+1,qstop,qlen,len(query))
qual = sffqual[::-1][start:start+len(query)]
assert seq==query
assert len(qual) == len(query)
return qual
def pair_align(filename, sffindex):
records = autofile(filename)
split = re.compile('[\t ,.]+').split
mrnm = '*'
mpos = 0
isize = 0
mapq = 60
for line in records:
assert line.startswith('>')
fields = split(line)
qname = fields[0][1:]
qstart = int(fields[1])
qstop = int(fields[2])
#qlen = int(fields[4])
rname = fields[6]
rstart = int(fields[7])
rstop = int(fields[8])
#rlen = int(fields[10])
query = split(records.next())[2]
qq = query.replace('-','')
ref = split(records.next())[2]
cigar = make_cigar(query,ref)
qual = sffindex.get_quality(qname, qq, qstart, qstop)
flag = 0
if qstart>qstop:
flag |= 0x10
if rstart>rstop:
flag |= 0x20
yield [qname, flag, rname, rstart, mapq, cigar, mrnm, mpos, isize, qq, qual]
def option_parser():
import optparse
usage = 'usage: %prog [options] 454PairAlign.txt[.gz] [SFFfiles.sff..]'
parser = optparse.OptionParser(usage=usage)
parser.add_option('-r', '--reflist', dest='reflist', metavar='FILE',
help='Reference genome contig list')
parser.add_option('-o', '--output', dest='output', metavar='FILE', default='-',
help='Output SAM file')
return parser
def main():
parser = option_parser()
options,args = parser.parse_args()
if not args:
parser.print_help(sys.stderr)
sys.exit(2)
sffindex = SFFIndex(args[1:])
alignment = pair_align(hyphen(args[0],sys.stdin), sffindex)
write_bam = options.output.endswith('.bam')
if write_bam:
if not options.reflist:
raise ValueError('Conversion to BAM format requires a reference genome contig list (-r/--reflist)')
# Creating the following two-stage pipeline deadlocks due to problems with subprocess
# -- use the shell method below instead
#sammer = Popen(['samtools','import',options.reflist,'-','-'],stdin=PIPE,stdout=PIPE)
#bammer = Popen(['samtools','sort','-', options.output[:-4]], stdin=sammer.stdout)
cmd = 'samtools import "%s" - - | samtools sort - "%s"' % (options.reflist,options.output[:-4])
bammer = Popen(cmd,stdin=PIPE,shell=True,bufsize=-1)
out = table_writer(bammer.stdin)
else:
out = table_writer(options.output,hyphen=sys.stdout)
out.writerow(['@HD', 'VN:1.0'])
if options.reflist:
reflist = table_reader(options.reflist)
for row in reflist:
if len(row)<2:
continue
contig_name = row[0]
contig_len = int(row[1])
out.writerow(['@SQ', 'SN:%s' % contig_name, 'LN:%d' % contig_len])
print >> sys.stderr, 'Generating alignment from %s to %s' % (args[0],options.output)
for qname,qalign in groupby(alignment,itemgetter(0)):
qalign = list(qalign)
if len(qalign)>1:
# Set MAPQ to 0 for multiply aligned reads
for row in qalign:
row[4] = 0
out.writerow(row)
else:
out.writerow(qalign[0])
if write_bam:
print >> sys.stderr,'Finishing BAM encoding...'
bammer.communicate()
if __name__=='__main__':
if 1:
main()
else:
try:
import cProfile as profile
except ImportError:
import profile
import pstats
prof = profile.Profile()
try:
prof.runcall(main)
finally:
stats = pstats.Stats(prof)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats(25)
From biopython at maubp.freeserve.co.uk Tue Mar 2 13:01:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 13:01:53 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
Message-ID: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
Kevin wrote:
> I only looked at the module documentation and it was more than sufficient to
> get started. ?I've never really used BioPython before, so I was pleasantly
> surprised at how easy it was to get started. ?The BioPython SFF parser and
> indexed access replaced a hairy process of extracting data using 454's
> sffinfo and packing it into a BDB file.
Great :)
>> > I am parsing gsMapper 454PairAlign.txt output and
>> > converting it to SAM/BAM format to view in IGV (among other things) and
>> > wanted to include per-based quality score information from the SFF
>> > files.
>>
>> Are you reading and writing SAM/BAM format with Python? Looking
>> into this is on my (long) todo list.
>
> Yes-- so far I have code to populate the basic data for unpaired reads, but
> none of the optional annotations. ?My script reads the 454 pairwise
> alignment data, finds each read in the source SFF file, figures out if extra
> trimming was applied by gsMapper, and extracts the matching PHRED quality
> scores. ?Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
> non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
> FAQ). ?The script can output SAM records or create a subprocess to sort the
> records and recode to BAM format using samtools. ?I've attached the current
> version script and you are welcome to use it for any purpose.
I'll take a look...
>> [...] I guess you mean the flowgram values, flowgram index, bases
>> and qualities might be loaded with a single read? That would
>> be worth trying.
>
> Exactly!
If I recall I felt the unpacking was more complicated (and not needed
for the sequence bases), but I agree it this is faster it is worthwhile.
> Also, flowgrams do not need to be unpacked when trimming.
True, that shouldn't make the function much more complex. I'll try
to look at that later today.
> My own bias is to encode the quality scores and flowgrams in numpy
> arrays rather than lists, however I understand that the goal is to keep
> the external dependencies to a minimum (although NumPy is required
> elsewhere).
Yes, I did wonder about using NumPy here but wanted to ensure that
the core of Biopython remains without an external dependency here.
> Also, the test "chr(0)*padding != handle.read(padding)" could be written
> just as clearly as "handle.read(padding).count('\0') != padding" and not
> generate as many temporary objects.
Good point, done - and you're in the contributors list now ;)
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 14:34:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 14:34:07 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
Message-ID: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
On Tue, Mar 2, 2010 at 12:36 PM, Kevin Jacobs wrote:
> On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote:
>> My code does not (yet) attempt to deal with next-gen sequencing
>> alignments, which would require padding all the (short) reads with
>> leading and trailing gaps to ensure all rows of the alignment have
>> the same length. Doing this in a memory efficient way could be
>> done with a PaddedSeq object, or a very different alignment object
>> (hold read and their offsets in memory). I'm not sure what is best,
>> but the bx-python model looks worth understanding to help decide.
>>
>> Perhaps until this is settled, it would be premature to merge my
>> alignment class to the trunk. After all, we may need to tweak the
>> alignment object class heirachy.
>
>
> Hi Peter,
>
> I'm just jumping in here and have not yet read all of the background
> material. ?However, I am working with next-gen alignments and am
> curious as to what you have in mind. ?At first glance, it sounds like
> you want to access aligned reads in a 'pileup' format (i.e., an object
> model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are
> you thinking of something different entirely?
Probably something different. My general concern boils down to
the fact that the current Alignment model as an enhanced "list of
SeqRecord objects" is potentially limiting.
The alignment code in Biopython (and my branch which is basically
an extension to that) deals with classical multiple sequence alignments
like ClustalW etc. You can think of the alignment as a matrix of letters,
each row is a sequence (e.g. a gene), and there will be some gap
characters for insertions, and padding for leading/trailing commissions.
There may or may not be a consensus sequence too.
With assembles you have a (long) consensus with many (short) reads
aligned to it. In order to hold this as a "matrix" representation, all the
(short) reads would require (lots of) leading/trailing padding. The same
applies when mapping reads to a reference genome.
So, while the current object model may work, all this extra padding
might mean too much of a memory overhead (especially as all the
rows are currently stored as SeqRecord objects). Instead, we might
just store the (short) read sequence, name, and its offset (and
perhaps the strand). We can then reconstruct columns or rows
mimicking the "matrix" interpretation on demand. However, the
API should make it easy to get the unpadded reads and their
offsets too - so the current alignment API might either be extended
or perhaps changed.
Related to this, a "Lite" version of the alignment object might
be useful when there is no annotation requiring using SeqRecord
objects. e.g. For ClustalW, FASTA, PHYLIP alignments all we
need is the sequence and identifiers.
Regarding one of your points, accessing aligned reads (or rows)
from an alignment - currently this is only supported by index
(row number). In most cases the reads (rows) have a unique
identifier/name, and thus one idea I am considering for this
branch is overloading the align[...] syntax further to allow a
record's id to be used as an alternative. i.e. More like a dictionary.
Other ideas for enhancements on this branch including sorting
the rows (with a list like sort method, defaulting to sorting on the
record's id strings), per-column annotation (useful for PFAM
alignments and the match string in pairwise alignments), and
a general annotations dictionary (like we have on SeqRecord
objects).
Peter
From bioinformed at gmail.com Tue Mar 2 14:36:32 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 09:36:32 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
<320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
Message-ID: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote:
> Kevin wrote:> My own bias is to encode the quality scores and flowgrams in
> numpy
> > arrays rather than lists, however I understand that the goal is to keep
> > the external dependencies to a minimum (although NumPy is required
> > elsewhere).
>
> Yes, I did wonder about using NumPy here but wanted to ensure that
> the core of Biopython remains without an external dependency here.
>
In addition to not creating many little objects, my leanings toward using
NumPy are also due to the generality of tricks like the following to recode
quality scores to Sanger ASCII-33 format:
sffqual =
np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
sffqual += 33
sffqual = sffqual.tostring()
That said, the alternatives aren't that slow and small integers are shared
from a pre-allocated pool, so this is not as big a concern.
-Kevin
From biopython at maubp.freeserve.co.uk Tue Mar 2 14:44:13 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 14:44:13 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
<320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
<2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
Message-ID: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com>
On Tue, Mar 2, 2010 at 2:36 PM, Kevin Jacobs
wrote:
> On Tue, Mar 2, 2010 at 8:01 AM, Peter wrote:
>> Yes, I did wonder about using NumPy here but wanted to ensure that
>> the core of Biopython remains without an external dependency here.
>
> In addition to not creating many little objects, my leanings toward using
> NumPy are also due to the generality of tricks like the following to recode
> quality scores to Sanger ASCII-33 format:
>
> ? ?sffqual ?=
> np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
> ? ?sffqual += 33
> ? ?sffqual ?= sffqual.tostring()
>
Yeah - I had this kind of thing in mind for the qualities, both when
looking at the SFF files and earlier when doing the FASTQ and
QUAL stuff.
You can probably make that more efficient with one line:
sffqual = (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
+ 33).tostring()
Not sure if it will make a measurable difference mind you ;)
> That said, the alternatives aren't that slow and small integers are shared
> from a pre-allocated pool, so this is not as big a concern.
Indeed.
Peter
From bioinformed at gmail.com Tue Mar 2 14:51:04 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 09:51:04 -0500
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
<320fb6e01003020501hc1c8a70jd4a93b9ddbe1ea26@mail.gmail.com>
<2e1434c11003020636j570a2994u7e4275a7d3e3fd2@mail.gmail.com>
<320fb6e01003020644u229e6353ufee054403e562915@mail.gmail.com>
Message-ID: <2e1434c11003020651y541ce3e5q92fb0fea308a59e9@mail.gmail.com>
On Tue, Mar 2, 2010 at 9:44 AM, Peter wrote:
> You can probably make that more efficient with one line:
>
> sffqual =
> (np.array(rec.letter_annotations['phred_quality'],dtype=np.uint8)
> + 33).tostring()
>
> Not sure if it will make a measurable difference mind you ;)
>
I haven't measured, but my understanding is that the inplace "+= 33" will
avoid creating a temporary copy and thus be quicker. But as you said, not
likely to make a difference in practice.
-Kevin
From chapmanb at 50mail.com Tue Mar 2 15:03:08 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 2 Mar 2010 10:03:08 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
Message-ID: <20100302150308.GP98028@sobchak.mgh.harvard.edu>
Peter and Kevin;
> >> My code does not (yet) attempt to deal with next-gen sequencing
> >> alignments,
[...]
> >> Perhaps until this is settled, it would be premature to merge my
> >> alignment class to the trunk. After all, we may need to tweak the
> >> alignment object class heirachy.
My vote would be to merge what you've done in for handling
standard multiple alignments, and then look at next-generation read
representation as an analogous but separate problem. All of the
SeqRecord objects which are useful for drilling in on multiple
alignments are likely going to be memory hogs for any real world
next gen work.
> > I'm just jumping in here and have not yet read all of the background
> > material. ?However, I am working with next-gen alignments and am
> > curious as to what you have in mind. ?At first glance, it sounds like
> > you want to access aligned reads in a 'pileup' format (i.e., an object
> > model akin to http://samtools.sourceforge.net/pileup.shtml). ?Or are
> > you thinking of something different entirely?
This is a good way to go. SAM is at least an emerging standard that
people are adopting, and samtools and the pysam module do a good job
of dealing with them:
http://code.google.com/p/pysam/
pysam exposes a Pileup style API from sorted and indexed BAM files
and scales great for large alignment files:
http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html
This is a good starting point for providing interoperability with
Biopython; it would be great to re-use what we can from these
projects.
Brad
From biopython at maubp.freeserve.co.uk Tue Mar 2 15:28:45 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 15:28:45 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
<320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>
<2e1434c11003011522l5d08c64dh546997449e9528fd@mail.gmail.com>
<320fb6e01003020208i6b38c79dvba5b523a9f146cd3@mail.gmail.com>
<2e1434c11003020428w34d7f3e9rb459573f70683db7@mail.gmail.com>
Message-ID: <320fb6e01003020728v760e8208h5da4288dfaef7ed7@mail.gmail.com>
On Tue, Mar 2, 2010 at 12:28 PM, Kevin Jacobs wrote:
>
>?Also, flowgrams do not need to be unpacked when trimming.
>
True - change made on the trunk, should make parsing SFF files
as trimmed records a little bit faster.
Thanks
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 2 16:43:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Mar 2010 16:43:18 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003020843n72a23176wa023786c46ffb7b3@mail.gmail.com>
On Tue, Mar 2, 2010 at 3:03 PM, Brad Chapman wrote:
> Peter and Kevin;
>
>> >> My code does not (yet) attempt to deal with next-gen sequencing
>> >> alignments,
> [...]
>> >> Perhaps until this is settled, it would be premature to merge my
>> >> alignment class to the trunk. After all, we may need to tweak the
>> >> alignment object class heirachy.
>
> My vote would be to merge what you've done in for handling
> standard multiple alignments, and then look at next-generation read
> representation as an analogous but separate problem. All of the
> SeqRecord objects which are useful for drilling in on multiple
> alignments are likely going to be memory hogs for any real world
> next gen work.
OK - that is what I was leaning towards.
What do you think about the fact I am introducing an "improved"
version of the existing Bio.Align.Generic.Alignment class under
Bio.Align.MultipleSeqAlignment?
That's actually several questions in one - should this be a new
object or just enhance the old one? I favour a new object here
because I want to *enforce* the fact that all the rows are the
same length, but I doubt people are using the flexibility of
the current alignment object in this way.
Next where should the new object live? I find the current use
of Bio.Align.Generic somewhat hidden away, thus my
suggestion of using Bio.Align directly.
Next, what should the new object be called? We could reuse
the old name of Alignment but it is a bit vague and would
cause confusion given the existing object is also called that.
I have used MultipleSeqAlignment but am open to suggestions
(e.g. MulSeqAlignment is shorter).
Peter
From bioinformed at gmail.com Tue Mar 2 17:07:03 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Tue, 2 Mar 2010 12:07:03 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100302150308.GP98028@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
Message-ID: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
On Tue, Mar 2, 2010 at 10:03 AM, Brad Chapman wrote:
> Kevin;
> > > I'm just jumping in here and have not yet read all of the background
> > > material. However, I am working with next-gen alignments and am
> > > curious as to what you have in mind. At first glance, it sounds like
> > > you want to access aligned reads in a 'pileup' format (i.e., an object
> > > model akin to http://samtools.sourceforge.net/pileup.shtml). Or are
> > > you thinking of something different entirely?
>
> This is a good way to go. SAM is at least an emerging standard that
> people are adopting, and samtools and the pysam module do a good job
> of dealing with them:
>
> http://code.google.com/p/pysam/
>
>
I find pysam pretty limited for doing more than reading and subsetting
SAM/BAM files. I'm planning to add a constructor and helper functions for
creating new aligned reads. The current AlignedRead object is also
read-only, which will need to be relaxed for many serious applications.
Until then, I'm writing (text) SAM records and piping them to samtools to
encode in BAM format (see the script attached to one of my earlier emails).
> pysam exposes a Pileup style API from sorted and indexed BAM files
> and scales great for large alignment files:
>
> http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html
Scalability is okay for conversion to pileup format, but not what I'd
consider great. But I agree, pysam is a good starting point. I just wish
that the read identifiers and attributes were available via the C API,
since those are often needed when, e.g., writing a genotype caller.
-Kevin
From chapmanb at 50mail.com Wed Mar 3 14:12:15 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Mar 2010 09:12:15 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
Message-ID: <20100303141215.GZ98028@sobchak.mgh.harvard.edu>
Kevin and Peter;
> I find pysam pretty limited for doing more than reading and subsetting
> SAM/BAM files. I'm planning to add a constructor and helper functions for
> creating new aligned reads. The current AlignedRead object is also
> read-only, which will need to be relaxed for many serious applications.
> Until then, I'm writing (text) SAM records and piping them to samtools to
> encode in BAM format (see the script attached to one of my earlier emails).
Agreed. These sound like good improvements.
> Scalability is okay for conversion to pileup format, but not what I'd
> consider great. But I agree, pysam is a good starting point. I just wish
> that the read identifiers and attributes were available via the C API,
> since those are often needed when, e.g., writing a genotype caller.
Do you think we could build off of what pysam has? The project hasn't
seemed especially active, but it would be great to have a unified
code base in python for dealing with BAM files. They use mercurial
for revision control, so worst case we can always fork this on
bitbucket and work off of that. Galaxy has a fork for their use:
http://bitbucket.org/kanwei/kanwei-pysam/
The bioconductor folks also seem to be standardizing around SAM/BAM for
their analysis pipelines, so practically we may be able to borrow
some of their APIs once they have a released version of Rsamtools.
> What do you think about the fact I am introducing an "improved"
> version of the existing Bio.Align.Generic.Alignment class under
> Bio.Align.MultipleSeqAlignment?
Yes please. I don't think Generic is that great and am happy to see
it improved upon.
> That's actually several questions in one - should this be a new
> object or just enhance the old one? I favour a new object here
> because I want to *enforce* the fact that all the rows are the
> same length, but I doubt people are using the flexibility of
> the current alignment object in this way.
>
> Next where should the new object live? I find the current use
> of Bio.Align.Generic somewhat hidden away, thus my
> suggestion of using Bio.Align directly.
>
> Next, what should the new object be called? We could reuse
> the old name of Alignment but it is a bit vague and would
> cause confusion given the existing object is also called that.
> I have used MultipleSeqAlignment but am open to suggestions
> (e.g. MulSeqAlignment is shorter).
I like MultipleSeqAlignment, and agree it should be as top level as
possible in Bio.Align. If you think a new object is better, go for
that and we can move Generic on a deprecation path. It's great you
are cleaning this up.
Brad
From biopython at maubp.freeserve.co.uk Wed Mar 3 15:03:38 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Mar 2010 15:03:38 +0000
Subject: [Biopython-dev] EMBOSS eprimer3 parser
In-Reply-To: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com>
References: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com>
Message-ID: <320fb6e01003030703k691fdbe8i3ab3dfd5ba1640a6@mail.gmail.com>
On Mon, Jan 18, 2010 at 4:33 PM, Peter wrote:
> Hi all,
>
> Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in
> Biopython? I'd like someone to look over Leighton's proposed enhancements
> to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968
>
> There are two main issues. First, the current code doesn't cope with multiple
> primer sets (so Leighton introduces read/parse functions in line with other
> modules for single or multiple sets of primers). This seems entirely sensible
> to me, and worthwhile in itself.
I've made changes on github to do this based on Leighton's code.
> Second, Leighton makes some changes to the primer record objects.
> I'm not so sure about the necessity here, even if it is backwards
> compatible, but I haven't really used this code. What do the rest of
> you think?
I expect to doing some work with eprimer3 this month, so will feel I
can make a more informed choice later.
Peter
From bugzilla-daemon at portal.open-bio.org Wed Mar 3 15:06:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Mar 2010 10:06:47 -0500
Subject: [Biopython-dev] [Bug 2968] Modifications to Emboss eprimer3 parser
and associated files
In-Reply-To:
Message-ID: <201003031506.o23F6lgb005243@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2968
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-03 10:06 EST -------
(In reply to comment #0)
> The existing Emboss primer3/eprimer3 code has a couple of issues, and some
> scope for improvement:
>
> - The existing Primer3.py parser code can only parse output when eprimer3 is
> applied to a single sequence. When eprimer3 is applied to multiple sequence
> input, it groups all primers for all sequences into a single record, which may
> incorrectly associate primers with the wrong sequences in downstream analysis.
> - The current parser lacks an iterator for iterating over multiple sequence
> output
I've made changes on github to support multiple targets (with a read and
a parse function) this based on Leighton's code which addresses the above
issues.
> - The current parser creates 'ghost' primers for all primer pairs, with length
> zero and sequence as an empty string; it does not do this for internal oligos.
> A more intuitive solution might be to return None for absent primers/oligos
> - The current data model stores all primer data as individual attributes. It
> might be more useful to group the attributes of individual primers into their
> natural associations
Regarding the object changes, I'll be doing some work with eprimer3 this month,
so will feel I can make a more informed choice later.
See also:
http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007255.html
http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007398.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Mar 3 15:57:09 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Mar 2010 15:57:09 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100303141215.GZ98028@sobchak.mgh.harvard.edu>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
On Wed, Mar 3, 2010 at 2:12 PM, Brad Chapman wrote:
> Kevin and Peter;
>
>> I find pysam pretty limited for doing more than reading and subsetting
>> SAM/BAM files. ?I'm planning to add a constructor and helper functions for
>> creating new aligned reads. ?The current AlignedRead object is also
>> read-only, which will need to be relaxed for many serious applications.
>> ?Until then, I'm writing (text) SAM records and piping them to samtools to
>> encode in BAM format (see the script attached to one of my earlier emails).
>
> Agreed. These sound like good improvements.
>
>> Scalability is okay for conversion to pileup format, but not what I'd
>> consider great. ?But I agree, pysam is a good starting point. ?I just wish
>> that the read identifiers and attributes were ?available via the C API,
>> since those are often needed when, e.g., writing a genotype caller.
>
> Do you think we could build off of what pysam has? The project hasn't
> seemed especially active, but it would be great to have a unified
> code base in python for dealing with BAM files. They use mercurial
> for revision control, so worst case we can always fork this on
> bitbucket and work off of that. Galaxy has a fork for their use:
>
> http://bitbucket.org/kanwei/kanwei-pysam/
>
> The bioconductor folks also seem to be standardizing around
> SAM/BAM for their analysis pipelines, so practically we may be
> able to borrow some of their APIs once they have a released
> version of Rsamtools.
I agree that we should work towards supporting SAM (and perhaps
also BAM) in Biopython, and other projects APIs can be very
useful for inspiration or guidance.
I was aware of pysam but am concerned about the dependencies:
pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
itself - which may all be fine on Linux, but will likely be trouble for
us on other platforms (especially Windows).
Is anyone aware of any other SAM/BAM parser in Python?
>> What do you think about the fact I am introducing an "improved"
>> version of the existing Bio.Align.Generic.Alignment class under
>> Bio.Align.MultipleSeqAlignment?
>
> Yes please. I don't think Generic is that great and am happy to see
> it improved upon.
>
>> That's actually several questions in one - should this be a new
>> object or just enhance the old one? I favour a new object here
>> because I want to *enforce* the fact that all the rows are the
>> same length, but I doubt people are using the flexibility of
>> the current alignment object in this way.
>>
>> Next where should the new object live? I find the current use
>> of Bio.Align.Generic somewhat hidden away, thus my
>> suggestion of using Bio.Align directly.
>>
>> Next, what should the new object be called? We could reuse
>> the old name of Alignment but it is a bit vague and would
>> cause confusion given the existing object is also called that.
>> I have used MultipleSeqAlignment but am open to suggestions
>> (e.g. MulSeqAlignment is shorter).
>
> I like MultipleSeqAlignment, and agree it should be as top level as
> possible in Bio.Align. If you think a new object is better, go for
> that and we can move Generic on a deprecation path. It's great you
> are cleaning this up.
OK then - I've been wanting to "clean this up" for some time.
I'll make time to merge what I have so far (which shouldn't be
controversial) and update the tutorial.
I would also like to investigate moving the useful bits of the
SummaryInfo class into methods of the main alignment class.
Testing would be very welcome!
Peter
From biopython at maubp.freeserve.co.uk Wed Mar 3 17:51:41 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Mar 2010 17:51:41 +0000
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
Message-ID: <320fb6e01003030951n261c124bq31578bc9cc5814c9@mail.gmail.com>
On Wed, Mar 3, 2010 at 3:57 PM, Peter wrote:
>
> OK then - I've been wanting to "clean this up" for some time.
> I'll make time to merge what I have so far (which shouldn't be
> controversial) and update the tutorial.
The merge is done, updates to the tutorial to show how to
use the new object pending (but already in the doctests).
Peter
From bioinformed at gmail.com Wed Mar 3 18:30:49 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Wed, 3 Mar 2010 13:30:49 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
<320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
Message-ID: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
On Wed, Mar 3, 2010 at 10:57 AM, Peter wrote:
> I agree that we should work towards supporting SAM (and perhaps
> also BAM) in Biopython, and other projects APIs can be very
> useful for inspiration or guidance.
>
>
Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully
between samtools and Picard source code, I've been able to work out most of
the tricky bits. I'm glad to know that the R folks are also working on
this, since they're usually very good about generating clear documentation.
> I was aware of pysam but am concerned about the dependencies:
> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
> itself - which may all be fine on Linux, but will likely be trouble for
> us on other platforms (especially Windows).
>
> Is anyone aware of any other SAM/BAM parser in Python?
Parsing SAM is pretty simple and I can certainly help with gluing it into
Biopython (with some help on the Biopython side, since I'm still a newb).
I'm about half-way to having a BAM reader and writer for my own purposes.
I'm coding the time-critical parts in Cython with a fallback to pure
Python, so it may not be ideal for use in Biopython.
-Kevin
From chapmanb at 50mail.com Thu Mar 4 13:13:52 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 4 Mar 2010 08:13:52 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
Message-ID: <20100304131352.GB19053@sobchak.mgh.harvard.edu>
Kevin and Peter;
> I was aware of pysam but am concerned about the dependencies:
> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
> itself - which may all be fine on Linux, but will likely be trouble for
> us on other platforms (especially Windows).
I believe you can remove the pyrex requirement by shipping the
generated C file with the distribution. Samtools itself may be an
issue; however, right now it is probably a practical need for dealing
with SAM/BAM since it implements a lot of BAM generation, sorting,
merging and indexing you need in workflows. Also, the C code is
included with the distribution so it is more a matter of getting it
compiled than introducing extra dependencies. The bioconductor work
appears to do the same thing.
> > I agree that we should work towards supporting SAM (and perhaps
> > also BAM) in Biopython, and other projects APIs can be very
> > useful for inspiration or guidance.
All of my work converts SAM directly into sorted and indexed BAM,
and then build from that. For me, direct SAM parsing wouldn't be as
useful as BAM.
> Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully
> between samtools and Picard source code, I've been able to work out most of
> the tricky bits. I'm glad to know that the R folks are also working on
> this, since they're usually very good about generating clear documentation.
Agreed, but at least we are converging on something instead of
having to write a parser every time you use a new aligner. The
bioconductor SVN is here:
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/
(user: readonly, pass: readonly)
I think the pysam API does a decent job for reading and exposing
this. The higher level things that would be nice to add are:
- Converting the CIGAR string into something more useful.
- Smartly dealing with the X? fields from various aligners. These
often contain very useful information missing from the SAM
specification. Where the data actually is will be aligner
specific.
- More generally easing dealing with the optional fields.
> Parsing SAM is pretty simple and I can certainly help with gluing it into
> Biopython (with some help on the Biopython side, since I'm still a newb).
> I'm about half-way to having a BAM reader and writer for my own purposes.
> I'm coding the time-critical parts in Cython with a fallback to pure
> Python, so it may not be ideal for use in Biopython.
Cool. Does the BAM reader require samtools C code or is it
independent of that?
Brad
From aaronquinlan at gmail.com Thu Mar 4 13:33:40 2010
From: aaronquinlan at gmail.com (Aaron Quinlan)
Date: Thu, 4 Mar 2010 08:33:40 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<20091028121833.GC22395@sobchak.mgh.harvard.edu>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
Message-ID:
Just an FYI for those interested in developing tools to work with BAM: it may also be worth looking into the BamTools C++ API developed by Derek Barnett at Boston College (http://sourceforge.net/projects/bamtools/). The API is quite nice and has much of the necessary functionality for iterators, getters/setters, etc.
I added BAM support for my BEDTools package (http://code.google.com/p/bedtools/) using the BAMTools libraries. Save for a few minor bugs along the way, it was rather straightforward to include.
Aaron
Aaron Quinlan, Ph.D.
NRSA Postdoctoral Fellow
Hall Laboratory
University of Virginia
Biochem. & Mol. Genetics
aaronquinlan at gmail.com
On Mar 4, 2010, at 8:13 AM, Brad Chapman wrote:
> Kevin and Peter;
>
>> I was aware of pysam but am concerned about the dependencies:
>> pyrex 0.9.8 or later, python 2.6 or later, plus of course SAMtools
>> itself - which may all be fine on Linux, but will likely be trouble for
>> us on other platforms (especially Windows).
>
> I believe you can remove the pyrex requirement by shipping the
> generated C file with the distribution. Samtools itself may be an
> issue; however, right now it is probably a practical need for dealing
> with SAM/BAM since it implements a lot of BAM generation, sorting,
> merging and indexing you need in workflows. Also, the C code is
> included with the distribution so it is more a matter of getting it
> compiled than introducing extra dependencies. The bioconductor work
> appears to do the same thing.
>
>>> I agree that we should work towards supporting SAM (and perhaps
>>> also BAM) in Biopython, and other projects APIs can be very
>>> useful for inspiration or guidance.
>
> All of my work converts SAM directly into sorted and indexed BAM,
> and then build from that. For me, direct SAM parsing wouldn't be as
> useful as BAM.
>
>> Honestly, the SAM/BAM format specification is pretty dodgy. Thankfully
>> between samtools and Picard source code, I've been able to work out most of
>> the tricky bits. I'm glad to know that the R folks are also working on
>> this, since they're usually very good about generating clear documentation.
>
> Agreed, but at least we are converging on something instead of
> having to write a parser every time you use a new aligner. The
> bioconductor SVN is here:
>
> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Rsamtools/
> (user: readonly, pass: readonly)
>
> I think the pysam API does a decent job for reading and exposing
> this. The higher level things that would be nice to add are:
>
> - Converting the CIGAR string into something more useful.
> - Smartly dealing with the X? fields from various aligners. These
> often contain very useful information missing from the SAM
> specification. Where the data actually is will be aligner
> specific.
> - More generally easing dealing with the optional fields.
>
>> Parsing SAM is pretty simple and I can certainly help with gluing it into
>> Biopython (with some help on the Biopython side, since I'm still a newb).
>> I'm about half-way to having a BAM reader and writer for my own purposes.
>> I'm coding the time-critical parts in Cython with a fallback to pure
>> Python, so it may not be ideal for use in Biopython.
>
> Cool. Does the BAM reader require samtools C code or is it
> independent of that?
>
> Brad
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From bioinformed at gmail.com Thu Mar 4 13:44:39 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Thu, 4 Mar 2010 08:44:39 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <20100304131352.GB19053@sobchak.mgh.harvard.edu>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<320fb6e01003020425y1455fc59ub2f04f96a079569a@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
Message-ID: <2e1434c11003040544j278ffb0fya984cd2668a6d278@mail.gmail.com>
On Thu, Mar 4, 2010 at 8:13 AM, Brad Chapman wrote:
> All of my work converts SAM directly into sorted and indexed BAM,
> and then build from that. For me, direct SAM parsing wouldn't be as
> useful as BAM.
Same here-- I construct and unserialize alignment data into SAM-like
records, but it would be foolish to actually store them natively to disk.
>
> > Parsing SAM is pretty simple and I can certainly help with gluing it into
> > Biopython (with some help on the Biopython side, since I'm still a newb).
> > I'm about half-way to having a BAM reader and writer for my own purposes.
> > I'm coding the time-critical parts in Cython with a fallback to pure
> > Python, so it may not be ideal for use in Biopython.
>
> Cool. Does the BAM reader require samtools C code or is it
> independent of that?
>
It is intended to be independent of the samtools distribution, though some
of the C code is currently duplicated (e.g., bgzf). Of course, a
Cython/Python re-write would be simple enough, though obviously extra work.
-Kevin
From bioinformed at gmail.com Thu Mar 4 13:52:33 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Thu, 4 Mar 2010 08:52:33 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To:
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<2e1434c11003020436g62a65774q184e7b9c001f87d2@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
Message-ID: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote:
> Just an FYI for those interested in developing tools to work with BAM: it
> may also be worth looking into the BamTools C++ API developed by Derek
> Barnett at Boston College (http://sourceforge.net/projects/bamtools/).
> The API is quite nice and has much of the necessary functionality for
> iterators, getters/setters, etc.
>
> I added BAM support for my BEDTools package (
> http://code.google.com/p/bedtools/) using the BAMTools libraries. Save
> for a few minor bugs along the way, it was rather straightforward to
> include.
>
Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools.
The bamtools code looks well designed and quite similar to my emerging
Cython/Python rendition.
-Kevin
From bioinformed at gmail.com Thu Mar 4 14:07:03 2010
From: bioinformed at gmail.com (Kevin Jacobs )
Date: Thu, 4 Mar 2010 09:07:03 -0500
Subject: [Biopython-dev] Alignment object
In-Reply-To: <2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
References: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>
<320fb6e01003020634o1684c29fl68ea24540ec7f0af@mail.gmail.com>
<20100302150308.GP98028@sobchak.mgh.harvard.edu>
<2e1434c11003020907v195359bfm87b3139d5e73f60b@mail.gmail.com>
<20100303141215.GZ98028@sobchak.mgh.harvard.edu>
<320fb6e01003030757g54ead66i8cafdcad2e179058@mail.gmail.com>
<2e1434c11003031030i3951672ck7b59ab1a3fdf3660@mail.gmail.com>
<20100304131352.GB19053@sobchak.mgh.harvard.edu>
<2e1434c11003040552y2ec38b01gc456c310249bb3e5@mail.gmail.com>
Message-ID: <2e1434c11003040607i68904329rc122e3acad9cdbe3@mail.gmail.com>
On Thu, Mar 4, 2010 at 8:52 AM, Kevin Jacobs <
bioinformed at gmail.com> wrote:
> On Thu, Mar 4, 2010 at 8:33 AM, Aaron Quinlan wrote:
>
>> Just an FYI for those interested in developing tools to work with BAM: it
>> may also be worth looking into the BamTools C++ API developed by Derek
>> Barnett at Boston College (http://sourceforge.net/projects/bamtools/).
>> The API is quite nice and has much of the necessary functionality for
>> iterators, getters/setters, etc.
>>
>> I added BAM support for my BEDTools package (
>> http://code.google.com/p/bedtools/) using the BAMTools libraries. Save
>> for a few minor bugs along the way, it was rather straightforward to
>> include.
>>
>
> Thanks for the tip, Aaron. I was unaware of both bamtools and bedtools.
> The bamtools code looks well designed and quite similar to my emerging
> Cython/Python rendition.
>
>
Ouch-- never mind. The bamtools code isn't endian-clean -- it will only
work correctly on native little-endian architectures.
-Kevin
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:47:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:47:36 -0500
Subject: [Biopython-dev] [Bug 2551] Adding advanced __getitem__ to generic
alignment, e.g. align[1:2, 5:-5]
In-Reply-To:
Message-ID: <201003051047.o25Ala5W006656@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2551
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:47 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:18 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:18 -0500
Subject: [Biopython-dev] [Bug 2552] Adding alignments
In-Reply-To:
Message-ID: <201003051048.o25AmIoF006689@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2552
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:34 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:34 -0500
Subject: [Biopython-dev] [Bug 2553] Adding SeqRecord objects to an alignment
(append or extend)
In-Reply-To:
Message-ID: <201003051048.o25AmYYH006723@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2553
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:36 -0500
Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of
SeqRecord objects
In-Reply-To:
Message-ID: <201003051048.o25AmaIn006735@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2554
Bug 2554 depends on bug 2553, which changed state.
Bug 2553 Summary: Adding SeqRecord objects to an alignment (append or extend)
http://bugzilla.open-bio.org/show_bug.cgi?id=2553
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:48:50 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:48:50 -0500
Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of
SeqRecord objects
In-Reply-To:
Message-ID: <201003051048.o25AmoWN006761@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2554
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:48 EST -------
Git branch merged to trunk as discussed on the dev mailing list, marking this
enhancement as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 10:50:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 05:50:45 -0500
Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM
In-Reply-To:
Message-ID: <201003051050.o25Aojkg006835@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2905
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|Short read alignment format |Short read alignment format
| |SAM / BAM
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 05:50 EST -------
Updating summary to include SAM and BAM keywords. See also recent mailing list
discussions such as this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007397.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 5 11:40:05 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Mar 2010 06:40:05 -0500
Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory
In-Reply-To:
Message-ID: <201003051140.o25Be532008197@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3010
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-05 06:40 EST -------
I suspect any memory leak is within KDTree.c function KDTree_set_data. Looking
at this I wondered how the memory allocated by KDTree_add_point gets freed.
The following *might* help, but even if I am right, this is at best only a
partial fix:
diff --git a/Bio/KDTree/KDTree.c b/Bio/KDTree/KDTree.c
index d074f26..07cdc1f 100644
--- a/Bio/KDTree/KDTree.c
+++ b/Bio/KDTree/KDTree.c
@@ -621,9 +621,14 @@ int KDTree_set_data(struct KDTree* tree, float *coords,
long
tree->_radius_list = NULL;
}
tree->_count=0;
+ if (tree->_data_point_list) {
+ free(tree->_data_point_list);
+ tree->_data_point_list = NULL;
+ tree->_data_point_list_size = 0;
+ }
/* keep pointer to coords to delete it */
tree->_coords=coords;
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Wed Mar 10 14:30:57 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 10 Mar 2010 14:30:57 +0000
Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc)
Message-ID: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com>
Dear Biopythoneers,
The Open Bioinformatics Foundation (the Bio* umbrella
organisation) is preparing an application for the 2010
Google Summer of Code (GSoC).
http://code.google.com/soc/
If you are interested in becoming a mentor for a Biopython
related project, you can join us in the application. If you are
a student and are interested in a project (or would like to
propose one), please take a look at these pages:
http://www.open-bio.org/wiki/Google_Summer_of_Code
http://biopython.org/wiki/Google_Summer_of_Code
Regards,
Brad & Peter
From biopython at maubp.freeserve.co.uk Thu Mar 11 11:21:50 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Mar 2010 11:21:50 +0000
Subject: [Biopython-dev] Bio.Phylo.Applications?
Message-ID: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com>
Hi Eric et al,
We have started a collection of command line tool wrappers for
multiple sequence alignments under Bio.Align.Applications, so I was
thinking about where to put wrappers for phylogenetic tree command
line tools. How does Bio.Phylo.Applications sound (following the same
structure as the Bio.Align.Applications module). The kind of things I
am thinking about include:
QuickTree (neighbour joining, NJ)
http://www.sanger.ac.uk/resources/software/quicktree/
QuickJoin (NJ)
http://www.daimi.au.dk/~mailund/quick-join.html
RaxML (maximum likelihood, ML),
http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm
[We should talk to Biopython contributor Frank Kauff as he uses this
with Python]
And so on. Plus pointers in the documentation to the EMBOSS module for
PHYLIP tools.
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 11 11:30:04 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Mar 2010 11:30:04 +0000
Subject: [Biopython-dev] Adding format method to phylo tree object?
Message-ID: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
Hi Eric (et al),
Are you familiar with the format method of the SeqRecord and alignment
object (plus the __format__ method which does the same thing aiming to
work nicely with the Python 2.6 built in function format)? This allows
the user to turn their data into a string in a specified output
format. Internally the method calls Bio.SeqIO.write (or AlignIO) with
a StringIO handle.
Do you think it would it make sense to have this for the tree objects
in Bio.Phylo, allowing easy access to the object as a Newick tree
format etc?
For people using IPython, the __pretty__ method looks related. I know
the Bio.Nexus tree has a "prity print" method which might be exposed
like this. I wonder if this convention will become more widespread?
http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html
Peter
From p.j.a.cock at googlemail.com Thu Mar 11 15:34:07 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 15:34:07 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
Message-ID: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Hi all,
It is probably time to starting getting ready for Biopython 1.54,
perhaps aiming to release within about a months time?
This means not landing any major additions to the trunk for now (keep
things like GFF and Geography on branches for now).
Other than finishing up any documentation for new stuff (especially
the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
are there any important issues we should address before the release?
Regards,
Peter
From tiagoantao at gmail.com Thu Mar 11 15:42:21 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 11 Mar 2010 15:42:21 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com>
On Thu, Mar 11, 2010 at 3:34 PM, Peter Cock wrote:
> Other than finishing up any documentation for new stuff (especially
> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
> are there any important issues we should address before the release?
I think I will be able to commit my code around the 20th. Currently I
need to address the issue of supporting thousands of markers in the
genepop parser as people do complain about that (like a couple of
times a month or so, not more).
--
"Heavier than air flying machines are impossible"
Lord Kelvin, President, Royal Society, c. 1895
From andrea at biocomp.unibo.it Thu Mar 11 17:11:00 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 11 Mar 2010 18:11:00 +0100 (CET)
Subject: [Biopython-dev] Planning for Biopython 1.54
Message-ID: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
What about the Uniprot XML format parser?
The code is functional, and was reviewd, but it would be nice to have some
beta testing.
The only remaining "issue" is where to save the comment fields.
The actual implementation will work for biosql schema, and store most
of the data in the comment fields.
Andrea
From p.j.a.cock at googlemail.com Thu Mar 11 17:31:08 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 17:31:08 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
On Thu, Mar 11, 2010 at 5:11 PM, Andrea Pierleoni
wrote:
> What about the Uniprot XML format parser?
> The code is functional, and was reviewd, but it would be nice to have some
> beta testing.
> The only remaining "issue" is where to save the comment fields.
> The actual implementation will work for biosql schema, and store most
> of the data in the comment fields.
>
> Andrea
Hi Andrea,
Your UnitProt XML parser was one of the things I thought we should
delay until after getting Biopython 1.54 out the door, but I would
expect it to be included in Biopython 1.55.
There are at least two remaining issues, (1) where to save the comment
fields, and (2) what to call the format in SeqIO. Both of these should
ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to
ensure the OBF projects which use simple strings for file formats are
consistent. Would you like me to start a discussion there regarding
the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe
even "unitprotxml". Personally, "uniprot" seems fine provided this is
going to be the primary file format for UniProt records in the short
to medium term.
Also I don't think any of the current Biopython developers have sat
down to review the code. As the Bio.SeqIO maintainer, I will do this,
but right now I think getting Biopython 1.54 out should be
prioritised. From a very quick look just now, the recent merging of
the SFF support to the trunk will require a few tweaks in
test_SeqIO.py (e.g. an empty file is not valid for SFF files as well
as the UniProt XML). Also including a UniProt XML file in
test_BioSQL_SeqIO.py would be worthwhile.
Regards,
Peter
From andrea at biocomp.unibo.it Thu Mar 11 17:43:13 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 11 Mar 2010 18:43:13 +0100 (CET)
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
<320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
Message-ID: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it>
>
> Hi Andrea,
>
> Your UnitProt XML parser was one of the things I thought we should
> delay until after getting Biopython 1.54 out the door, but I would
> expect it to be included in Biopython 1.55.
>
> There are at least two remaining issues, (1) where to save the comment
> fields, and (2) what to call the format in SeqIO. Both of these should
> ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to
> ensure the OBF projects which use simple strings for file formats are
> consistent. Would you like me to start a discussion there regarding
> the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe
> even "unitprotxml". Personally, "uniprot" seems fine provided this is
> going to be the primary file format for UniProt records in the short
> to medium term.
>
Of course you are free to open a discussion. I used 'uniprot' for sake of
simplicity, but then I noticed that the format is called 'uniprotxml' in
EBI
REST web services.
A common name will easier for everybody.
> Also I don't think any of the current Biopython developers have sat
> down to review the code.
The code was reviewed by Mauro Amico, I don't know if he is one of the
"current Biopython developers", anyhow any additional review is welcome.
> As the Bio.SeqIO maintainer, I will do this,
> but right now I think getting Biopython 1.54 out should be
> prioritised. From a very quick look just now, the recent merging of
> the SFF support to the trunk will require a few tweaks in
> test_SeqIO.py (e.g. an empty file is not valid for SFF files as well
> as the UniProt XML). Also including a UniProt XML file in
> test_BioSQL_SeqIO.py would be worthwhile.
>
Mauro also added some unit testing that should be useful for this.
Let me know if you need any help/info.
Bests,
Andrea
From p.j.a.cock at googlemail.com Thu Mar 11 17:49:50 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 17:49:50 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it>
References: <686fb59bbdd586934afd4f47c41b923d.squirrel@lipid.biocomp.unibo.it>
<320fb6e01003110931h71fba0dcm2a392e43ca045088@mail.gmail.com>
<4ee0d56a0ed98ff87b2dcf00b2c0d6e8.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01003110949v206a1868g6360002198a41ddd@mail.gmail.com>
On Thu, Mar 11, 2010 at 5:43 PM, Andrea Pierleoni
wrote:
>
>>
>> Hi Andrea,
>>
>> Your UnitProt XML parser was one of the things I thought we should
>> delay until after getting Biopython 1.54 out the door, but I would
>> expect it to be included in Biopython 1.55.
>>
>> There are at least two remaining issues, (1) where to save the comment
>> fields, and (2) what to call the format in SeqIO. Both of these should
>> ideally be run by BioPerl and EMBOSS on the openbio-l mailing list to
>> ensure the OBF projects which use simple strings for file formats are
>> consistent. Would you like me to start a discussion there regarding
>> the format name? e.g. Should it be "uniprot", "uniprot-xml", or maybe
>> even "unitprotxml". Personally, "uniprot" seems fine provided this is
>> going to be the primary file format for UniProt records in the short
>> to medium term.
>
> Of course you are free to open a discussion. I used 'uniprot' for sake of
> simplicity, but then I noticed that the format is called 'uniprotxml' in
> EBI REST web services. A common name will easier for everybody.
In that case, given the EBI REST convention, uniprotxml may be wise.
>> Also I don't think any of the current Biopython developers have sat
>> down to review the code.
>
> The code was reviewed by Mauro Amico, I don't know if he is one of the
> "current Biopython developers", anyhow any additional review is welcome.
I don't recall Mauro Amico contributing to Biopython in the past, but
as you say, the more eyes on the code the better :)
Peter
From eric.talevich at gmail.com Thu Mar 11 22:54:38 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 11 Mar 2010 17:54:38 -0500
Subject: [Biopython-dev] Adding format method to phylo tree object?
In-Reply-To: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
Message-ID: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com>
On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote:
> Hi Eric (et al),
>
> Are you familiar with the format method of the SeqRecord and alignment
> object (plus the __format__ method which does the same thing aiming to
> work nicely with the Python 2.6 built in function format)? This allows
> the user to turn their data into a string in a specified output
> format. Internally the method calls Bio.SeqIO.write (or AlignIO) with
> a StringIO handle.
>
> Do you think it would it make sense to have this for the tree objects
> in Bio.Phylo, allowing easy access to the object as a Newick tree
> format etc?
>
Sure, I could do that. It makes a lot of sense for Newick trees, and could
be useful with the XML formats for debugging.
>
> For people using IPython, the __pretty__ method looks related. I know
> the Bio.Nexus tree has a "prity print" method which might be exposed
> like this. I wonder if this convention will become more widespread?
>
> http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html
>
I didn't know about that. I also have a pretty_print method in Bio.Phylo
which does something much different from the Bio.Nexus printer -- the Nexus
one looks more like it's more useful for debugging the Tree object's
internal structure in terms of references, so (highly biased judgment) I'm
inclined to use the code from Bio.Phylo._utils.pretty_print to implement
__pretty__ for IPython. But I'll play with this IPython feature to see how
it's supposed to behave in general.
-Eric
From eric.talevich at gmail.com Thu Mar 11 23:03:59 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 11 Mar 2010 18:03:59 -0500
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com>
On Thu, Mar 11, 2010 at 10:34 AM, Peter Cock wrote:
> Hi all,
>
> It is probably time to starting getting ready for Biopython 1.54,
> perhaps aiming to release within about a months time?
>
> This means not landing any major additions to the trunk for now (keep
> things like GFF and Geography on branches for now).
>
> Other than finishing up any documentation for new stuff (especially
> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
> are there any important issues we should address before the release?
>
Is it all right to leave the documentation for Bio.Phylo on the wiki for
now, or should I try to add something to the main tutorial?
-Eric
From p.j.a.cock at googlemail.com Thu Mar 11 23:18:18 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Mar 2010 23:18:18 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <3f6baf361003111503m4656258av721852264516818f@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<3f6baf361003111503m4656258av721852264516818f@mail.gmail.com>
Message-ID: <320fb6e01003111518o3f50b95bw6b2446611fbb9bf5@mail.gmail.com>
On Thu, Mar 11, 2010 at 11:03 PM, Eric Talevich wrote:
>> Other than finishing up any documentation for new stuff (especially
>> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
>> are there any important issues we should address before the release?
>
> Is it all right to leave the documentation for Bio.Phylo on the wiki
> for now, or should I try to add something to the main tutorial?
I would like at least a short section in the tutorial mentioning
the new module with a link to the wiki. That way people just
browsing the tutorial to get an idea of what Biopython covers
will be made aware of it. In the long term I think the module
deserves a chapter (which can be based on the wiki text).
Are you familiar with LaTeX? (The mark up language the
tutorial is written in).
Also, I think it would be great to have a post on the news
server (which we can link to in the release announcement)
talking about what Bio.Phylo adds (and thank GSoC and
NESCent etc). A little advertising ;) How does that sound?
Regards,
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 11 23:23:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Mar 2010 23:23:54 +0000
Subject: [Biopython-dev] Adding format method to phylo tree object?
In-Reply-To: <3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com>
References: <320fb6e01003110330u63c9317av537b0a2c552052fc@mail.gmail.com>
<3f6baf361003111454l6d1f0409pcb732e006a8b8f67@mail.gmail.com>
Message-ID: <320fb6e01003111523r4fe5f4c7va9f77e089385ba0c@mail.gmail.com>
On Thu, Mar 11, 2010 at 10:54 PM, Eric Talevich wrote:
> On Thu, Mar 11, 2010 at 6:30 AM, Peter wrote:
>
>> Hi Eric (et al),
>>
>> Are you familiar with the format method of the SeqRecord and alignment
>> object (plus the __format__ method which does the same thing aiming to
>> work nicely with the Python 2.6 built in function format)? This allows
>> the user to turn their data into a string in a specified output
>> format. Internally the method calls Bio.SeqIO.write (or AlignIO) with
>> a StringIO handle.
>>
>> Do you think it would it make sense to have this for the tree objects
>> in Bio.Phylo, allowing easy access to the object as a Newick tree
>> format etc?
>>
>
> Sure, I could do that. It makes a lot of sense for Newick trees, and could
> be useful with the XML formats for debugging.
>
Great.
>> For people using IPython, the __pretty__ method looks related. I know
>> the Bio.Nexus tree has a "pretty print" method which might be exposed
>> like this. I wonder if this convention will become more widespread?
>>
>> http://ipython.scipy.org/doc/stable/html/api/generated/IPython.external.pretty.html
>>
>
> I didn't know about that.
I only read about it recently myself - it may not be worth doing.
(I'm not trying to invent work here *grin*, just looking for things
we can polish before your code gets its first proper release.)
Thanks,
Peter
From lpritc at scri.ac.uk Fri Mar 12 08:18:09 2010
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 12 Mar 2010 08:18:09 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID:
On 11/03/2010 Thursday, March 11, 15:34, "Peter Cock"
wrote:
> Other than finishing up any documentation for new stuff (especially
> the Tutorial), and the Bio.PopGen stuff Tiago hopes to tackle soon,
> are there any important issues we should address before the release?
There are those updates to ePrimer3/PrimerSearch EMBOSS interaction (that
you'll need for that differential primer script, BTW...)
Cheers,
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From biopython at maubp.freeserve.co.uk Fri Mar 12 13:22:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Mar 2010 13:22:55 +0000
Subject: [Biopython-dev] Daily builds of the Tutorial (PDF and HTML)
Message-ID: <320fb6e01003120522q22377f52nc0769ceb4e3add13@mail.gmail.com>
Hi all,
Back in November I set up a simple pair of cron jobs to update the code
snapshot on http://biopython.open-bio.org/SRC/biopython/ every hour:
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007002.html
I've just added another job which takes the latest Tutorial.tex file and
compiles it with pdflatex (already installed) and hevea (installed from
source under my user account) to make the PDF and HTML files.
These are then copied to the webserver and published as:
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf
These are currently updated once a day (at 2:40am which shouldn't
be too busy whichever USA timezone the server uses). Assuming
I got my crontab settings right - in the short term I'll keep an eye on
it to check ;)
In comparison the "official" versions at the following URLs are
generally updated only for releases:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
I know that not everyone has latex or hevea installed (installing
hevea from source is a bit of a hassle even on Linux), and further
more proof reading the raw markup in Tutorial.tex isn't that easy.
So, the point of all this effort is now anyone can help proofread
the latest version of the tutorial - this should also be of use to
those users/contributors actually running the latest code from
git rather than the official releases.
Regards,
Peter
From biopython at maubp.freeserve.co.uk Fri Mar 12 13:32:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Mar 2010 13:32:32 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
References:
<200911250945.20870.jblanca@btc.upv.es>
<320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com>
<200911251220.53881.jblanca@btc.upv.es>
<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
<320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
Message-ID: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
Hi all,
I'd like to proceed as outlined below for Biopython 1.54,
i.e. don't change the current Seq equality but add a warning
that we plan to change it.
Should we have a discussion on the main list first?
Peter
On Mon, Feb 22, 2010 at 2:48 PM, Peter wrote:
> Hi all,
>
> I've just got back from Japan - Brad and I were fortunate to be
> able to attend the DBCLS BioHackathon 2010 held in Tokyo,
> http://hackathon3.dbcls.jp/
>
> As Brad already mentioned in passing, we also managed to have
> dinner one evening with Michiel, and had an informal chat about
> Biopython plans. Expect a few more emails on other topics to
> follow.
>
> One of the short term aims we agreed on was to press ahead
> with the Seq equality changes outlined on this thread late last
> year. Mailing list archive link:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html
>
> To recap, the agreed best behaviour was to make Seq equality
> act like string equality, but to raise a Python warning when
> incompatible alphabets are compared (e.g. DNA to Protein).
> This also applies to all the other comparison operators:
> not equal, less than, greater than, less than or equal, and
> greater than or equal.
>
> This is my outline plan for the change:
>
> For Biopython up to 1.53, Seq class uses object equality,
> seq1==seq2 acts as id(seq1)==id(seq2)
>
> For Biopython 1.54 (and perhaps a few more releases),
> the Seq classes will still use object equality but will trigger
> a warning suggesting explicit use of ?id(seq1)==id(seq2)
> or str(seq1)==str(seq2) as appropriate.
>
> For Biopython 1.xx (maybe 1.55 or 1.56?) the Seq classes
> will switch to using string equality (with an alphabet aware
> warning for comparing DNA to RNA etc), but will also trigger
> a warning that this is a change from previous releases, and
> suggest in the short term the continued explicit use of either
> id(seq1)==id(seq2) for object identity or str(seq1)==str(seq2)
> for string identity.
>
> For Biopython 1.yy (maybe 1.57?) the Seq classes will
> use string equality (with an alphabet aware warning for
> comparing DNA to RNA etc), without any warning about
> this being a change from historic behaviour.
>
> These warning messages could also point at a wiki page,
> and we'd need a FAQ entry in the tutorial as well. The
> aim of this slightly drawn out switch is to try and make
> sure all users are aware of the change, even if they
> only update their copy of Biopython every few releases.
>
> Does that all sound sensible? If so, we should probably
> have an announcement on the main mailing list, in case
> there are any other views.
>
> Other more complex options include a flag for switching
> between the modes - but that complexity doesn't seem
> such a good idea to me. All my own code and most of
> the unit tests use str(seq1)==str(seq2) explicitly anyway.
> The only exception is some of the genetic algorithm unit
> tests which do seem to want explicit object identity.
>
> Regards,
>
> Peter
>
From kellrott at gmail.com Fri Mar 12 18:00:45 2010
From: kellrott at gmail.com (Kyle)
Date: Fri, 12 Mar 2010 10:00:45 -0800
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID:
>
>
> It is probably time to starting getting ready for Biopython 1.54,
> perhaps aiming to release within about a months time?
>
> This means not landing any major additions to the trunk for now (keep
> things like GFF and Geography on branches for now).
>
I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I
don't think it counts as a major addition. I think to finish it off, we
just needed to finalize the driver names.
For post 1.54 stuff, I have some HMMER3, Pfam, and GO parsing code (Chris
Lasher has a GO fork as well). But I need some community feedback to fill in
the interface holes.
Kyle
From p.j.a.cock at googlemail.com Fri Mar 12 18:09:39 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Mar 2010 18:09:39 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To:
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
Message-ID: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
On Fri, Mar 12, 2010 at 6:00 PM, Kyle wrote:
>>
>> It is probably time to starting getting ready for Biopython 1.54,
>> perhaps aiming to release within about a months time?
>>
>> This means not landing any major additions to the trunk for now (keep
>> things like GFF and Geography on branches for now).
>
> I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I
> don't think it counts as a major addition. ?I think to finish it off, we
> just needed to finalize the driver names.
Oh yeah - I confess I'd forgotten about that. Has there been any news
on the Jython front about SQLite support?
> For post 1.54 stuff, I have some HMMER3, Pfam, and GO??parsing code?(Chris
> Lasher has a GO fork as well). But I need some community feedback to fill in
> the interface holes.
> Kyle
Lots of exciting stuff to come then :)
Peter
From kellrott at gmail.com Fri Mar 12 18:28:45 2010
From: kellrott at gmail.com (Kyle)
Date: Fri, 12 Mar 2010 10:28:45 -0800
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
Message-ID:
>
>
> Oh yeah - I confess I'd forgotten about that. Has there been any news
> on the Jython front about SQLite support?
>
There is no official support, but you can always work through existing Java
packages (
http://old.nabble.com/SQLite-%2B-JDBC-%2B-Jython.-Example-td13322270.html ).
Kyle
From eric.talevich at gmail.com Fri Mar 12 19:14:51 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 12 Mar 2010 14:14:51 -0500
Subject: [Biopython-dev] Bio.Phylo.Applications?
In-Reply-To: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com>
References: <320fb6e01003110321u6ac77a89uce77306d332e675c@mail.gmail.com>
Message-ID: <3f6baf361003121114v36b8a311i5b4dc9cee27961c2@mail.gmail.com>
On Thu, Mar 11, 2010 at 6:21 AM, Peter wrote:
> Hi Eric et al,
>
> We have started a collection of command line tool wrappers for
> multiple sequence alignments under Bio.Align.Applications, so I was
> thinking about where to put wrappers for phylogenetic tree command
> line tools. How does Bio.Phylo.Applications sound (following the same
> structure as the Bio.Align.Applications module).
>
Sounds great to me! I don't have any code that would go there yet, but feel
free to add the directory and any new code you have.
-Eric
From bugzilla-daemon at portal.open-bio.org Fri Mar 12 21:57:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Mar 2010 16:57:53 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To:
Message-ID: <201003122157.o2CLvrtP008861@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3000
------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2010-03-12 16:57 EST -------
Hi Peter,
I finally got back to this. Thank your for all your work. I would be glad if
one could use the accession without the trailing ".1", etc for get_raw() and
get(). I think just any version of the record should be returned, and maybe a
list if there were multiple versions of the same.
>>> print data.get_raw("BC035166")
Traceback (most recent call last):
File "", line 1, in
File "Bio/SeqIO/_index.py", line 280, in get_raw
handle.seek(dict.__getitem__(self, key))
KeyError: 'BC035166'
>>>
Similarly, if I loop over the entries I have to do:
>>> mylist = ['ACC1', 'ACC2', 'ACC3']
>>> sequences = []
>>> for acc in data.keys():
... if data.get(acc).id.split('.')[0] in mylist:
... sequences.append(data.get(acc))
Oh no, this is not what I wanted, in full:
from Bio import SeqIO
data = SeqIO.index("full.gb", "gb")
mylist = ['AC11111.1', 'AC2222.2', 'AC3333.3']
sequences = []
for acc in mylist:
if acc in map(lambda x: x.split('.')[0], data.keys()):
print "Found %s" % acc
if data.get(acc + '.1'):
sequences.append(data.get(acc + '.1'))
else:
if data.get(acc + '.2'):
sequences.append(data.get(acc + '.2'))
else:
sequences.append(data.get(acc + '.3'))
else:
print "Missing %s" % acc
output_handle = open("filtered.gb", "w")
SeqIO.write(sequences, output_handle, "genbank")
There was already a discussing on the user mailing list, I do not think forcing
uppercase letters for genbank files is a good idea. Just stick with what was
supplied. Myself, I use mixed typically to emphasize, ORFs, but sometimes in
lower-case low-quality regions. Anyway, I provided original NCBI-web GenBank
file of an EST and the DNA sequence was in lowercase, biopython returned
uppercase. In turn, diff(1) command returns too many changed lines,
unnecessarily. I suggest giving use an opportunity to specify on input parsing
whether to keep mixed-case/lower-case or force uppercase. Also, protein
sequences I have often seen in lower-case, which is ugly to my eyes, btw.
Finally, the remaining differences are here (probably the first is in bug
#2578):
--- /tmp/orig.gb 2010-03-12 21:09:24.000000000 +0100
+++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
@@ -1,4 +1,4 @@
-LOCUS CR603932 1625 bp mRNA linear HTC
16-OCT-2008
+LOCUS CR603932 1625 bp DNA HTC
16-OCT-2008
DEFINITION full-length cDNA clone CS0DK007YH24 of HeLa cells Cot
25-normalized
of Homo sapiens (human).
ACCESSION CR603932
@@ -29,39 +29,39 @@
division of Invitrogen.
FEATURES Location/Qualifiers
source 1..1625
- /organism="Homo sapiens"
/mol_type="mRNA"
- /db_xref="taxon:9606"
/clone="CS0DK007YH24"
+ /db_xref="taxon:9606"
/tissue_type="HeLa cells Cot 25-normalized"
/plasmid="pCMVSPORT_6"
+ /organism="Homo sapiens"
ORIGIN
Thanks for all you work on this, it is a great service. ;-) Next, I will try to
filter by .features['tissue_type'] but sadly will have to search for the very
same string through COMMENT string as well.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Mar 12 22:05:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Mar 2010 17:05:39 -0500
Subject: [Biopython-dev] [Bug 3026] New:
Bio.SeqIO.InsdcIO._split_multi_line(): Your description
cannot be broken into nice lines!
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3026
Summary: Bio.SeqIO.InsdcIO._split_multi_line(): Your description
cannot be broken into nice lines!
Product: Biopython
Version: 1.53
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: mmokrejs at ribosome.natur.cuni.cz
Traceback (most recent call last):
File "/home/mmokrejs/bin/filter-accessions.py", line 22, in
SeqIO.write(sequences, output_handle, "genbank")
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 363, in
write
count = writer_class(handle).write_file(sequences)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 271, in
write_file
count = self.write_records(records)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/Interfaces.py", line 256, in
write_records
self.write_record(record)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 691, in
write_record
self._write_comment(record)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 579, in
_write_comment
self._write_multi_line("", line)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 335, in
_write_multi_line
lines = self._split_multi_line(text, max_len)
File "/usr/lib/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 279, in
_split_multi_line
"Your description cannot be broken into nice lines!"
AssertionError: Your description cannot be broken into nice lines!
Please fix the message so it prints out the accession/version number. ;-)
LOCUS BF378302 501 bp mRNA linear EST 27-NOV-2000
DEFINITION CM0-UM0001-060300-270-g07 UM0001 Homo sapiens cDNA, mRNA sequence.
ACCESSION BF378302
VERSION BF378302.1 GI:11367336
KEYWORDS EST.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 501)
AUTHORS Dias Neto,E., Garcia Correa,R., Verjovski-Almeida,S., Briones,M.R.,
Nagai,M.A., da Silva,W. Jr., Zago,M.A., Bordin,S., Costa,F.F.,
Goldman,G.H., Carvalho,A.F., Matsukuma,A., Baia,G.S., Simpson,D.H.,
Brunstein,A., deOliveira,P.S., Bucher,P., Jongeneel,C.V., O'Hare
,M.J., Soares,F., Brentani,R.R., Reis,L.F., de Souza,S.J. and
Simpson,A.J.
TITLE Shotgun sequencing of the human transcriptome with ORF expressed
sequence tags
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 97 (7), 3491-3496 (2000)
PUBMED 10737800
COMMENT Contact: Simpson A.J.G.
Laboratory of Cancer Genetics
Ludwig Institute for Cancer Research
Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP,
Brazil
Tel: +55-11-2704922
Fax: +55-11-2707001
Email: asimpson at ludwig.org.br
This sequence was derived from the FAPESP/LICR Human Cancer Genome
Project. This entry can be seen in the following URL
(http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001-060300-270-g07&t3=2000-03-06&t4=1
)
Seq primer: puc 18 forward.
FEATURES Location/Qualifiers
[cut]
I have few more example slike this from some dbEST data, I think all from a
same project, though.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Sat Mar 13 13:43:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 13 Mar 2010 13:43:53 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
<4B6995D0.3030405@fold.natur.cuni.cz>
<320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>
<4B9AA432.2050407@fold.natur.cuni.cz>
Message-ID: <320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com>
On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ?
wrote:
>
> I finally got back to this. Thank your for all your work.
> I would be glad if one could use the accession without
> the trailing ".1", etc for get_raw() and get(). I think
> just any version of the record should be returned,
> and maybe a list if there were multiple versions of
> the same.
This is just a quick reply to answer this part of your email.
It would be unwise to try and be clever with the key
matching - in this case yes, for GenBank files we know
what the names means, accession.version - but this is
not true in general.
In this case the answer for your needs would be to use
the Bio.SeqIO.index optional argument to specify the
keys. e.g. something like this:
from Bio import SeqIO
def strip_version(identifier):
return identifier.rsplit(".",1)[0]
my_dict = SeqIO.index(filename, "gb", key_function=strip_version)
That way all the keys will have just the accession
without the version (assuming there are no clashes
which I think will raise an error).
Peter
From sbassi at clubdelarazon.org Sun Mar 14 07:16:25 2010
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Sun, 14 Mar 2010 04:16:25 -0300
Subject: [Biopython-dev] Biopython & Google Summer of Code 2010 (GSoc)
In-Reply-To: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com>
References: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com>
Message-ID: <9e2f512b1003132316j55a95ca7u6a87191ff877898d@mail.gmail.com>
On Wed, Mar 10, 2010 at 11:30 AM, Peter Cock wrote:
> related project, you can join us in the application. If you are
> a student and are interested in a project (or would like to
> propose one), please take a look at these pages:
> http://www.open-bio.org/wiki/Google_Summer_of_Code
> http://biopython.org/wiki/Google_Summer_of_Code
Regarding GSoC call in Biopython, I found the PDB-Tidy task pretty
interesting. I will study the proposal and write back to you. I am
working currently with microRNA but I use Bio.PDB a lot to help my
wife who does antigen structure prediction and works with modeller,
PyMol and PDB files. A tool like the proposed PDB-Tidy could come
handily.
Best,
SB.
From biopython at maubp.freeserve.co.uk Sun Mar 14 13:50:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 14 Mar 2010 13:50:52 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To: <4B9BB1F6.9000505@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
<4B6995D0.3030405@fold.natur.cuni.cz>
<320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>
<4B9AA432.2050407@fold.natur.cuni.cz>
<320fb6e01003130543p10a43e32kdb073879dc406e11@mail.gmail.com>
<4B9BB1F6.9000505@fold.natur.cuni.cz>
Message-ID: <320fb6e01003140650o54a8eea2h66ea87abc42c754@mail.gmail.com>
On Sat, Mar 13, 2010 at 3:40 PM, Martin MOKREJ? wrote:
>
> Thanks Peter,
> ?yes, that is what I already ended-up with in a more awkward way. ;-)
> But basically I have the same workaround.
> Best,
> M.
So does using the Bio.SeqIO.index() function's key_function
argument seem like a good solution to your key problem?
Peter
From biopython at maubp.freeserve.co.uk Sun Mar 14 20:30:45 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 14 Mar 2010 20:30:45 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
unparsed multiline entry?
In-Reply-To: <4B9AA432.2050407@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
<4B6995D0.3030405@fold.natur.cuni.cz>
<320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>
<4B9AA432.2050407@fold.natur.cuni.cz>
Message-ID: <320fb6e01003141330t199bbbcfm6bf32c5357b9fd77@mail.gmail.com>
On Fri, Mar 12, 2010 at 8:29 PM, Martin MOKREJ? wrote:
>
> Finally, the remaining differences are here (probably the first is in bug #2578):
>
> --- /tmp/orig.gb ? ? ? ?2010-03-12 21:09:24.000000000 +0100
> +++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
> @@ -1,4 +1,4 @@
> -LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?mRNA ? ?linear ? HTC 16-OCT-2008
> +LOCUS ? ? ? CR603932 ? ? ? ? ? ? ? ?1625 bp ? ?DNA ? ? ? ? ? ? ?HTC 16-OCT-2008
> ?DEFINITION ?full-length cDNA clone CS0DK007YH24 of HeLa cells Cot 25-normalized
> ? ? ? ? ? ? of Homo sapiens (human).
> ?ACCESSION ? CR603932
> @@ -29,39 +29,39 @@
> ? ? ? ? ? ? division of Invitrogen.
> ?FEATURES ? ? ? ? ? ? Location/Qualifiers
> ? ? ?source ? ? ? ? ?1..1625
> - ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens"
> ? ? ? ? ? ? ? ? ? ? ?/mol_type="mRNA"
> - ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606"
> ? ? ? ? ? ? ? ? ? ? ?/clone="CS0DK007YH24"
> + ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606"
> ? ? ? ? ? ? ? ? ? ? ?/tissue_type="HeLa cells Cot 25-normalized"
> ? ? ? ? ? ? ? ? ? ? ?/plasmid="pCMVSPORT_6"
> + ? ? ? ? ? ? ? ? ? ? /organism="Homo sapiens"
> ?ORIGIN
>
Yes, the LOCUS line issue would be part of Bug 2578.
As to the order of the feature qualifiers, these are stored
in a Python dictionary which does not preserve the order.
I personally don't think the order of the qualifiers is
important and thus don't care that is can change like
this. Assuming the NCBI have a defined sort order for
the qualifiers (I'm not aware one), then we could sort
the feature qualifiers on output. Another option would
be to store the qualifiers in an ordered-dictionary. Or
just leave it as it is ;)
Peter
From bugzilla-daemon at portal.open-bio.org Sun Mar 14 23:31:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 14 Mar 2010 19:31:51 -0400
Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line():
Your description cannot be broken into nice lines!
In-Reply-To:
Message-ID: <201003142331.o2ENVp3v015452@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3026
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-14 19:31 EST -------
I just used the Entrez web interface, and it comes with the URL split already
to meet the 80 column limit. Also doing it via the API:
>>> from Bio import Entrez
>>> data = Entrez.efetch("nucest", id="BF378302", rettype="gb").read()
>>> print data[1095:1800]
PUBMED 10737800
COMMENT Contact: Simpson A.J.G.
Laboratory of Cancer Genetics
Ludwig Institute for Cancer Research
Rua Prof. Antonio Prudente 109, 4 andar, 01509-010, Sao Paulo-SP,
Brazil
Tel: +55-11-2704922
Fax: +55-11-2707001
Email: asimpson at ludwig.org.br
This sequence was derived from the FAPESP/LICR Human Cancer Genome
Project. This entry can be seen in the following URL
(http://www.ludwig.org.br/scripts/gethtml2.pl?t1=CM0&t2=CM0-UM0001-
060300-270-g07&t3=2000-03-06&t4=1)
Seq primer: puc 18 forward.
FEATURES Location/Qualifiers
In this particular case, it looks like splitting the string on a hyphen would
be a reasonable option (i.e. copy what the NCBI seems to be doing).
Did you just cut and paste it from the NCBI's HTML page where it does seem
to be shown with the URL is shown unbroken (giving a line more than 80
characters)? Or can we download a "broken" GenBank file from the NCBI
somewhere (maybe the FTP site)?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Mar 15 00:44:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 14 Mar 2010 20:44:59 -0400
Subject: [Biopython-dev] [Bug 3026] Bio.SeqIO.InsdcIO._split_multi_line():
Your description cannot be broken into nice lines!
In-Reply-To:
Message-ID: <201003150044.o2F0ixwP017517@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3026
------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2010-03-14 20:44 EST -------
Most I copy&pasted from their web, so this is probably the case.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Mar 15 15:40:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 15 Mar 2010 15:40:20 +0000
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
Message-ID: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
Hi all (especially Eric),
As recently discussed SeqIO and AlignIO will now take filenames as
well as handles. This matches the existing behaviour of Bio.Nexus,
Eric's Bio.Phylo, and several big 3rd partly libraries like ReportLab.
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007352.html
I've updated most of the tutorial to take advantage of this, and quickly
got used less typing when working at the Python prompt. It does make
things easier, and I probably should have conceded this earlier.
It made me wonder about relaxing another restraint of the SeqIO
and AlignIO write functions - they currently insist on a list or iterator
of records or alignments. Giving a single object raises an error,
but we could handle this unambiguously. Amusingly Eric just
updated Bio.Phylo to match this strict behaviour - one reason I
sat down and wrote this email.
So, should we continue to insist on:
record = SeqRecord(...)
SeqIO.write([record], filename, format)
or should be relax a little more and allow this too?:
record = SeqRecord(...)
SeqIO.write(record, filename, format)
For SeqIO and AlignIO we can do a simple isinstance check
for a SeqRecord or alignment object - there isn't really a
problem with ambiguity here. Probably also try for Phylo?
What's the general consensus on the dev list?
Peter
From updates at feedmyinbox.com Tue Mar 16 06:16:42 2010
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Tue, 16 Mar 2010 02:16:42 -0400
Subject: [Biopython-dev] 3/16 BioStar - Biopython Questions
Message-ID: <0ef45bfc18dff2fe627af99c71f3b412@74.63.51.88>
==================================================
1. Compare two protein sequences using local BLAST
==================================================
March 15, 2010 at 7:24 PM
Hi,
I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful!
Cheers
http://biostar.stackexchange.com/questions/302/compare-two-protein-sequences-using-local-blast
--------------------------------------------------
===========================================================
Source: http://biostar.stackexchange.com/questions/tagged/biopython
This email was sent to biopython-dev at lists.open-bio.org.
Account Login:
https://www.feedmyinbox.com/members/login/
Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/
-----------------------------------------------------------
This email was carefully delivered by FeedMyInbox.com.
230 Franklin Road Suite 814 Franklin, TN 37064
From mhampton at d.umn.edu Tue Mar 16 16:01:41 2010
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Tue, 16 Mar 2010 11:01:41 -0500 (CDT)
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To:
References:
Message-ID:
I'm strongly in favor of such relaxations. It would also be convenient if
SeqRecords had a write function.
-Marshall Hampton
>So, should we continue to insist on:
>
>record = SeqRecord(...)
>SeqIO.write([record], filename, format)
>or should be relax a little more and allow this too?:
>record = SeqRecord(...)
>SeqIO.write(record, filename, format)
>For SeqIO and AlignIO we can do a simple isinstance check
>for a SeqRecord or alignment object - there isn't really a
>problem with ambiguity here. Probably also try for Phylo?
>What's the general consensus on the dev list?
From rodrigo_faccioli at uol.com.br Tue Mar 16 19:24:58 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Tue, 16 Mar 2010 16:24:58 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
Message-ID: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Hi all,
I want to know the primary sequence (fasta file) of all proteins. In other
the words, I would like a database which contain the fasta files of all
proteins.
I'm a computer scientist and I don't know how hard it is. However, we have
worked with SEQRES section of PDB files and BioPython. So, we want to work
with fasta files and BioPython to check our results.
I searched the NCBI web-site where I found a lot of databases. I confess I'm
lost with them :)
Sorry if my email is a basic question. But, I'm very lost.
Thanks in advance,
--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
From biopython at maubp.freeserve.co.uk Tue Mar 16 19:42:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Mar 2010 19:42:43 +0000
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli
wrote:
>
> Hi all,
>
> I want to know the primary sequence (fasta file) of all proteins. In other
> the words, I would like a database which contain the fasta files of all
> proteins.
>
> I'm a computer scientist and I don't know how hard it is. However, we have
> worked with SEQRES section of PDB files and BioPython. So, we want to work
> with fasta files and BioPython to check our results.
A single FASTA file of all know proteins would be enormous. Even the
non-redundant ("nr") dataset used by the NCBI for their hugely popular
BLAST search is pretty big.
It sounds like many all you need is a FASTA file containing all the
sequences with structures in the PDB - something you may be
able to download directly from the PDB FTP site.
Peter
From biopython at maubp.freeserve.co.uk Tue Mar 16 19:42:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Mar 2010 19:42:43 +0000
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Message-ID: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli
wrote:
>
> Hi all,
>
> I want to know the primary sequence (fasta file) of all proteins. In other
> the words, I would like a database which contain the fasta files of all
> proteins.
>
> I'm a computer scientist and I don't know how hard it is. However, we have
> worked with SEQRES section of PDB files and BioPython. So, we want to work
> with fasta files and BioPython to check our results.
A single FASTA file of all know proteins would be enormous. Even the
non-redundant ("nr") dataset used by the NCBI for their hugely popular
BLAST search is pretty big.
It sounds like many all you need is a FASTA file containing all the
sequences with structures in the PDB - something you may be
able to download directly from the PDB FTP site.
Peter
From rodrigo_faccioli at uol.com.br Wed Mar 17 01:01:01 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Tue, 16 Mar 2010 22:01:01 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
<320fb6e01003161242w2f111653y6dceb9853412c649@mail.gmail.com>
Message-ID: <3715adb71003161801n294d15ccwb3a52f6d5ea83c23@mail.gmail.com>
Peter,
Thank you for your reply.
Actually, we want to store the sequence of the fasta files in a relational
database which has been developed by my research group. So, we have
developed some calculations with primary sequence of proteins.
We did not download the PDB database because our computation of protein
properties are based on their primary sequence. Therefore, our idea is to
work with the primary sequence of all proteins. My understanding is the PDB
database contains the proteins which is known their tearty structure. The
others are in other database.
Thanks in advance,
--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
On Tue, Mar 16, 2010 at 4:42 PM, Peter wrote:
> On Tue, Mar 16, 2010 at 7:24 PM, Rodrigo Faccioli
> wrote:
> >
> > Hi all,
> >
> > I want to know the primary sequence (fasta file) of all proteins. In
> other
> > the words, I would like a database which contain the fasta files of all
> > proteins.
> >
> > I'm a computer scientist and I don't know how hard it is. However, we
> have
> > worked with SEQRES section of PDB files and BioPython. So, we want to
> work
> > with fasta files and BioPython to check our results.
>
> A single FASTA file of all know proteins would be enormous. Even the
> non-redundant ("nr") dataset used by the NCBI for their hugely popular
> BLAST search is pretty big.
>
> It sounds like many all you need is a FASTA file containing all the
> sequences with structures in the PDB - something you may be
> able to download directly from the PDB FTP site.
>
> Peter
>
From bugzilla-daemon at portal.open-bio.org Wed Mar 17 11:33:09 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 17 Mar 2010 07:33:09 -0400
Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS
6.1.0 arguments
In-Reply-To:
Message-ID: <201003171133.o2HBX9kO004765@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2966
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 07:33 EST -------
(In reply to comment #2)
> I also found an issue with the PrimerSearchCommandline. The command line
> options -sequences and -primers do not appear to be used in EMBOSS6.1.0, having
> been replaced by -seqall and -infile, respectively. I changed the options
> accordingly, and the modified files are available at
> http://github.com/widdowquinn/biopython/tree/emboss-branch.
I've merged that fix on the master,
http://github.com/biopython/biopython/commit/39708be130eb771eacccf96eed3e8ce0a44ea4f0
Will have a look at eprimer3 next.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Mar 17 12:13:46 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 17 Mar 2010 08:13:46 -0400
Subject: [Biopython-dev] [Bug 2966] Primer3Commandline does not use EMBOSS
6.1.0 arguments
In-Reply-To:
Message-ID: <201003171213.o2HCDkf4006396@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2966
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-03-17 08:13 EST -------
(In reply to comment #1)
> I have made changes to Primer3Commandline that involve adding the EMBOSS 6.1.0
> arguments, and docstrings to each argument. I have also added doctests.
>
> The proposed code can be inspected at my GitHub repository:
>
> http://github.com/widdowquinn/biopython/commit/9c0643e333b0cafb4e356426fb4902e0e9d2385c
>
Cherry picked to merge to the trunk.
Marking bug as fixed - thanks.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From sbassi at clubdelarazon.org Wed Mar 17 18:32:17 2010
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Wed, 17 Mar 2010 15:32:17 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
Message-ID: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com>
On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli
wrote:
> I want to know the primary sequence (fasta file) of all proteins. In other
> the words, I would like a database which contain the fasta files of all
> proteins.
You don't need Biopython to get this file. Just download NR database y
use "fastacmd", a program found in the blast suite.
BLAST FTP is not working for me right now so I can't give you the
exact URL to download, but look from here:
ftp://ftp.ncbi.nih.gov/blast/
Here is how to use fastacmd to retrieve sequences from NR database:
http://pwet.fr/man/linux/commandes/fastacmd
From kellrott at gmail.com Wed Mar 17 22:14:25 2010
From: kellrott at gmail.com (Kyle)
Date: Wed, 17 Mar 2010 15:14:25 -0700
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<320fb6e01003121009v5bb78abajb892364bf49d3360@mail.gmail.com>
Message-ID:
>
>
> > I think the zxJDBC support (Jython MySQL for BioSQL) was almost done. I
> > don't think it counts as a major addition. I think to finish it off, we
> > just needed to finalize the driver names.
>
> Oh yeah - I confess I'd forgotten about that.
>
I've posted a fork from the master branch on github (
http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes
related to zxjdbc. I've added two driver requests, "MySQL" and "PostgreSQL",
that select the appropriate driver based on the platform.
Kyle
From tiagoantao at gmail.com Wed Mar 17 22:28:36 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 17 Mar 2010 22:28:36 +0000
Subject: [Biopython-dev] Planning for Biopython 1.54
In-Reply-To: <6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com>
References: <320fb6e01003110734o47986192k80f27c969ff8aa3a@mail.gmail.com>
<6d941f121003110742p524a2d86wbe111ccf880d8bb@mail.gmail.com>
Message-ID: <6d941f121003171528p1e60fbb8q419485f6c6f171c2@mail.gmail.com>
Hi,
2010/3/11 Tiago Ant?o :
> I think I will be able to commit my code around the 20th. Currently I
> need to address the issue of supporting thousands of markers in the
> genepop parser as people do complain about that (like a couple of
> times a month or so, not more).
I am going to add this and support for haploid markers also. I would
like to ask, when its done (soon!) a code review on the part of
support of thousands of markers (The parser will change in nature, and
files will be maintained open during the whole existence of the parser
object). No need for domain knowledge, just comments on code quality.
Also some help with merging with the main trunk would be appreciated,
as I don' t use github for my stuff (bazaar fan here ;) ).
Thanks,
Tiago
--
"Heavier than air flying machines are impossible"
Lord Kelvin, President, Royal Society, c. 1895
From rodrigo_faccioli at uol.com.br Thu Mar 18 00:59:49 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Wed, 17 Mar 2010 21:59:49 -0300
Subject: [Biopython-dev] Primary Sequence of all protein (help)
In-Reply-To: <9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com>
References: <3715adb71003161224i78e56c0bg2d4bb7e98d95fcd@mail.gmail.com>
<9e2f512b1003171132k680a52e5ob052e84d89e68c0b@mail.gmail.com>
Message-ID: <3715adb71003171759p7107f2cbod85339a5335374d5@mail.gmail.com>
Sebastian,
Thank you for your reply. I'll study it.
--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
On Wed, Mar 17, 2010 at 3:32 PM, Sebastian Bassi
wrote:
> On Tue, Mar 16, 2010 at 4:24 PM, Rodrigo Faccioli
> wrote:
> > I want to know the primary sequence (fasta file) of all proteins. In
> other
> > the words, I would like a database which contain the fasta files of all
> > proteins.
>
> You don't need Biopython to get this file. Just download NR database y
> use "fastacmd", a program found in the blast suite.
> BLAST FTP is not working for me right now so I can't give you the
> exact URL to download, but look from here:
> ftp://ftp.ncbi.nih.gov/blast/
> Here is how to use fastacmd to retrieve sequences from NR database:
> http://pwet.fr/man/linux/commandes/fastacmd
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From biopython at maubp.freeserve.co.uk Thu Mar 18 11:19:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 11:19:03 +0000
Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54
Message-ID: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com>
On Wed, Mar 17, 2010 at 10:14 PM, Kyle wrote:
>>
>>> I think the zxJDBC support (Jython MySQL for BioSQL) was almost
>>> done. I don't think it counts as a major addition. ?I think to finish it off,
>>> we just needed to finalize the driver names.
>>
>> Oh yeah - I confess I'd forgotten about that.
>
> I've posted a fork from the master branch on github (
> http://github.com/kellrott/biopython/tree/zxjdbc ) with only the changes
> related to zxjdbc. I've added two driver requests, "MySQL" and
> "PostgreSQL", that select the appropriate driver based on the platform.
> Kyle
Hmm. I think it might be cleaner to have a new optional argument like
batabase back end (MySQL, PostgreSQL, SQLite3). If the back end
is specified without the driver (which would be the encouraged usage)
then we will pick the driver at run time (based on if in Jython, or for
PostgreSQL which drivers are installed). Existing scripts can continue
to specify the driver directly (but we can eventually deprecated this?).
Peter
From anaryin at gmail.com Thu Mar 18 11:33:05 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 18 Mar 2010 04:33:05 -0700
Subject: [Biopython-dev] Small Typo in PDBParser
Message-ID:
Hello All,
There's a small typo in the Bio.PDB PDBParser module. Line 159:
"PDBContructionError" should be "PDBConstructionError"
So that I learn, how do I submit a bug and a patch to the project, such as
in this case?
Best!
Jo?o [...] Rodrigues
@ http://stanford.edu/~joaor/
From anaryin at gmail.com Thu Mar 18 11:36:15 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 18 Mar 2010 04:36:15 -0700
Subject: [Biopython-dev] Small Typo in PDBParser
In-Reply-To:
References:
Message-ID:
Well, actually, PDBConstructionError is not even defined.. It should likely
be PDBConstructionException.
Jo?o [...] Rodrigues
@ http://stanford.edu/~joaor/
On Thu, Mar 18, 2010 at 4:33 AM, Jo?o Rodrigues wrote:
> Hello All,
>
> There's a small typo in the Bio.PDB PDBParser module. Line 159:
>
> "PDBContructionError" should be "PDBConstructionError"
>
> So that I learn, how do I submit a bug and a patch to the project, such as
> in this case?
>
> Best!
>
> Jo?o [...] Rodrigues
> @ http://stanford.edu/~joaor/
>
>
From biopython at maubp.freeserve.co.uk Thu Mar 18 12:02:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 12:02:32 +0000
Subject: [Biopython-dev] Small Typo in PDBParser
In-Reply-To:
References:
Message-ID: <320fb6e01003180502w573baa84od9924f4b8e2486c8@mail.gmail.com>
On Thu, Mar 18, 2010 at 11:33 AM, Jo?o Rodrigues wrote:
> Hello All,
>
> There's a small typo in the Bio.PDB PDBParser module. Line 159:
>
> "PDBContructionError" should be "PDBConstructionError"
>
> So that I learn, how do I submit a bug and a patch to the project, such as
> in this case?
>
> Best!
Hi Jo?o,
I've you've found a bug in a release, and worked out how to fix it, one
of the first steps would be to try the latest code from the repository to
see if the bug is still there (and if you fix would need changing). In this
case the problem has already been fixed (February 23, 2010), see:
http://github.com/biopython/biopython/commits/master/Bio/PDB/PDBParser.py
For a simple change like this, you can use the command line tool diff
to generate a patch file (see "man diff" for details), which you can then
attach to a bug report on our bugzilla. The basic diff usage would be:
diff original_file.py fixed_file.py > bug_fix.patch
For more complex changes, I would suggest you look at learning git.
If you make a change locally you can get a patch file with this:
git diff > bug_fix.patch
Or, publish the fix to a public copy of the repository (e.g. on github).
See also http://biopython.org/wiki/GitUsage
I hope that helps, and that you'll have more patches for us in future :)
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 18 19:01:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 19:01:32 +0000
Subject: [Biopython-dev] Relaxing SeqIO, AlignIO, etc write functions?
In-Reply-To: <3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
References: <320fb6e01003150840x2db094b9l4e0663dab3b40bc6@mail.gmail.com>
<3f6baf361003151026w40d66e44m9d795c28eda9567c@mail.gmail.com>
Message-ID: <320fb6e01003181201j3b486964y3b5223ab480bdde@mail.gmail.com>
On Mon, Mar 15, 2010 at 5:26 PM, Eric Talevich wrote:
> On Mon, Mar 15, 2010 at 11:40 AM, Peter wrote:
>>
>> So, should we continue to insist on:
>>
>> record = SeqRecord(...)
>> SeqIO.write([record], filename, format)
>>
>> or should be relax a little more and allow this too?:
>>
>> record = SeqRecord(...)
>> SeqIO.write(record, filename, format)
>>
>> For SeqIO and AlignIO we can do a simple isinstance check
>> for a SeqRecord or alignment object - there isn't really a
>> problem with ambiguity here. Probably also try for Phylo?
>>
>> What's the general consensus on the dev list?
>
> Sounds good to me! The code I just deleted from Bio.Phylo._io
> was doing something foolish anyway (testing whether the
> argument is iterable) -- now that Bio.Phylo has a single legitimate
> base class, I can restore the feature with an isinstance(trees,
> BaseTree.Tree) check if we have a consensus here.
>
> -Eric
There was another +1 vote from Marshall Hampton, and no
comments against (so far). Let's leave it a few days, but unless
anyone speaks out in favour of the status-quo (keep the
current strict check in the write function), then make the change.
Peter
From biopython at maubp.freeserve.co.uk Thu Mar 18 19:04:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 18 Mar 2010 19:04:10 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
References:
<320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com>
<200911251220.53881.jblanca@btc.upv.es>
<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
<320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>
<320fb6e01003120532v1564eb75s370ec9f1ff43294f@mail.gmail.com>
Message-ID: <320fb6e01003181204l5902cf37yc0cf9387b74994fd@mail.gmail.com>
On Fri, Mar 12, 2010 at 1:32 PM, Peter