From winda002 at student.otago.ac.nz  Tue Mar  3 17:03:36 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 04 Mar 2009 11:03:36 +1300
Subject: [BioPython] ACE contig to alignment
Message-ID: <49ADA938.80408@student.otago.ac.nz>

Hi all,

I'd like to start by thanking everyone that's contributed to biopython 
and especially the cookbook/tutorial - its been a great help to this 
empiricist getting into some (decidedly amateur) bioinformatics.However, 
for the first time I've run into a problem the available docs can't help 
me with.

I want to be able to represent all of the reads that contribute to a 454 
sequencing contig as a generic biopython alignment. I've written some 
code that I thought would pad/cut the reads to size and add them to an 
alignment but when I run it a significant minority of the contigs in the 
files I'm working with have misalignments. I was wondering if someone 
more familiar with the ace parser or generic alignment class could tell 
me if I'm making some elementary mistake (it is possible that original 
alignment was bad, just seems more likely I did something dumb). I can 
send along an ACE file if you want to run the script (didn't want to 
spam the list with attachments).

Thanks in advance for any pointers and I'm sorry to force people to read 
what I'm sure is inelegant code:
 
from Bio.Sequencing import Ace
from Bio.Align.Generic import Alignment
from Bio.Alphabet import IUPAC, Gapped

ace_handle = open('eldoni.ace', 'r')
contigs = Ace.parse(ace_handle)
alignments = [] #start the list to which we'll add the contig data

for contig in contigs:    
  conname = contig.name + " numreads=" + str(contig.nreads)
  conlength = len(contig.sequence)
  align = Alignment(Gapped(IUPAC.ambiguous_dna, "*"))
  for readn in range(len(contig.reads)):
    start = contig.af[readn].padded_start # position rel to consensus
    if start < 1:
      # If 'start' is negative or zero we need to ignore bases
      readseq =  contig.reads[readn].rd.sequence[-1 * start+1:]
    else:
      # If it's larger then the start needs to be padded with gaps
      readseq =  (start-1) * '*' + contig.reads[readn].rd.sequence
    #Finally, pad the end then cut to size
    readseq = readseq + (conlength-len(readseq)) * '*'
    readseq = readseq[:conlength]
    align.add_sequence(readn+1, readseq)
  condata = conname, align
  alignments.append(condata)

-- 
PhD Student
Allan Wilson Centre 
Department of Zoology
University of Otago, PO Box 56, Dunedin 9054

p: +64-3-4798459
m: +64-27-3326815 
e: winda002 at student.otago.ac.nz


From winda002 at student.otago.ac.nz  Tue Mar  3 21:40:08 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 04 Mar 2009 15:40:08 +1300
Subject: [BioPython] ACE contig to alignment (found my error)
In-Reply-To: <49ADA938.80408@student.otago.ac.nz>
References: <49ADA938.80408@student.otago.ac.nz>
Message-ID: <49ADEA08.8070400@student.otago.ac.nz>

Hi again all,

After digging around a little more I realised the dumb mistake I made. 
In case anyone was interested and to prevent future suffering by getting 
the answer on to google:

The code as written is adding the entirety of each read to the alignment 
but when the assembly was made some reads where clipped on either side 
for quality. Including the low quality bases from each read makes some 
of the alignments nasty.  In my case "contig.reads[readn].qa" contains 
the start and end clipping points needed to get just the 'good' bases of 
each read into the alignment.

Cheers,
David

David Winter wrote:
> Hi all,
>
> I'd like to start by thanking everyone that's contributed to biopython 
> and especially the cookbook/tutorial - its been a great help to this 
> empiricist getting into some (decidedly amateur) 
> bioinformatics.However, for the first time I've run into a problem the 
> available docs can't help me with.
>
> I want to be able to represent all of the reads that contribute to a 
> 454 sequencing contig as a generic biopython alignment. I've written 
> some code that I thought would pad/cut the reads to size and add them 
> to an alignment but when I run it a significant minority of the 
> contigs in the files I'm working with have misalignments. I was 
> wondering if someone more familiar with the ace parser or generic 
> alignment class could tell me if I'm making some elementary mistake 
> (it is possible that original alignment was bad, just seems more 
> likely I did something dumb). I can send along an ACE file if you want 
> to run the script (didn't want to spam the list with attachments).
>
> Thanks in advance for any pointers and I'm sorry to force people to 
> read what I'm sure is inelegant code:
>
> from Bio.Sequencing import Ace
> from Bio.Align.Generic import Alignment
> from Bio.Alphabet import IUPAC, Gapped
>
> ace_handle = open('eldoni.ace', 'r')
> contigs = Ace.parse(ace_handle)
> alignments = [] #start the list to which we'll add the contig data
>
> for contig in contigs:     conname = contig.name + " numreads=" + 
> str(contig.nreads)
>  conlength = len(contig.sequence)
>  align = Alignment(Gapped(IUPAC.ambiguous_dna, "*"))
>  for readn in range(len(contig.reads)):
>    start = contig.af[readn].padded_start # position rel to consensus
>    if start < 1:
>      # If 'start' is negative or zero we need to ignore bases
>      readseq =  contig.reads[readn].rd.sequence[-1 * start+1:]
>    else:
>      # If it's larger then the start needs to be padded with gaps
>      readseq =  (start-1) * '*' + contig.reads[readn].rd.sequence
>    #Finally, pad the end then cut to size
>    readseq = readseq + (conlength-len(readseq)) * '*'
>    readseq = readseq[:conlength]
>    align.add_sequence(readn+1, readseq)
>  condata = conname, align
>  alignments.append(condata)

From rodrigo_faccioli at uol.com.br  Wed Mar  4 23:04:07 2009
From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli)
Date: Thu, 5 Mar 2009 01:04:07 -0300
Subject: [BioPython] Bio.Entez - Help
Message-ID: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com>

I want to know where I can find examples about Bio.Entez. Specifically, I'm
developing a program which has a protein primary sequence and I need to
search its conserved domain and read it to show for user.

I'm reading this link
http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However,
I'm not understanding very well. I know that I will work with CDD database.

I made a simple example which is below.

from Bio import Entrez
Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are
handle = Entrez.esearch(db="cdd",
term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN")
record = Entrez.read(handle)
print record["IdList"]

Thanks for any helps.


-- 
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218

From biopython at maubp.freeserve.co.uk  Thu Mar  5 05:42:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 10:42:02 +0000
Subject: [BioPython] Bio.Entez - Help
In-Reply-To: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com>
References: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com>
Message-ID: <320fb6e00903050242v63a2f38cgc6eddfa3819814e4@mail.gmail.com>

On Thu, Mar 5, 2009 at 4:04 AM, Rodrigo faccioli
<rodrigo_faccioli at uol.com.br> wrote:
> I want to know where I can find examples about Bio.Entez. Specifically, I'm
> developing a program which has a protein primary sequence and I need to
> search its conserved domain and read it to show for user.
>
> I'm reading this link
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However,
> I'm not understanding very well. I know that I will work with CDD database.

The CDD database is one of several protein motif databases the NCBI
make available for use with their tool RPS-BLAST.  CDD is a composite
database which includes domains from PFAM, SMART, KOG etc.

Have a look at  http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
with your example and you'll get a hit to pfam00321.

It sound like what you want is a script which runs RPS-BLAST using
your query protein against the CDD motif database.

You can run BLASTN, BLASTP etc online at the NCBI using a script, but
as far as I know, the NCBI do not make RPS-BLAST (or PSI-BLAST)
available in this way.  I haven't checked this in recent months.

However, I have done task myself using standalone BLAST installed on
my computer, i.e. the tool rpsblast from the NCBI.  You'll also need
to install the databases (which are big - you'll need plenty of disk
space and RAM).  Once this is installed and working, you can rpsblast
this from Biopython using the Bio.Blast.NCBIStandalone.rpsblast(...)
function.

> I made a simple example which is below.
>
> from Bio import Entrez
> Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are
> handle = Entrez.esearch(db="cdd",
> term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN")
> record = Entrez.read(handle)
> print record["IdList"]
>
> Thanks for any helps.

I think if you use Entrez to access the CDD database, you can just
access the domains themselves (using their names - not searching by
sequence), e.g.

>>> from Bio import Entrez
>>> Entrez.email = "Your.Name.Here at example.com"
>>> handle = Entrez.esearch(db="cdd", term="pfam00321", retmode="XML")
>>> record = Entrez.read(handle)
>>> print record["IdList"]
['109381']

You can check this ID works via their website:
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=109381

I've tried a few variations but efetch doesn't seem to support the CDD
database (yet).

Peter

From biopython at maubp.freeserve.co.uk  Thu Mar  5 07:26:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 12:26:13 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
Message-ID: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>

Hi All,

As the following examples show, and the python string method's
docstring clearly states, the python string's count method uses a
non-overlapping search:

>>> "AAA".count("A")
3
>>> "AAA".count("AA") # you might expect 2
1
>>> "BBBB".count("BB") # you might expect 3
2

Up until Biopython 1.44, the Seq object's count method only worked for
single characters.  From Biopython 1.45 onwards it accepted longer
strings and followed the built in python string count behaviour.
However, as Noel pointed out on Bug 2779 our docstring does not make
it clear that this does a non-overlapping search.  In fact, as
Leighton suggests, one might the Seq object to use an overlapping
search in the Seq object's count method.
http://bugzilla.open-bio.org/show_bug.cgi?id=2779

We should either:

(a) stick with the python string compatible behaviour (which has been
a general principle for the Seq class), but document this issue more
clearly as a non-overlapping search does run counter to some potential
biological uses.

or,

(b) Or change the behaviour as Leighton suggests to do an overlapping
search.  This could break any code relying on the old python
string-like behaviour.

What do people here think?  Any preferences?

[I don't want to get into details about the implementation here on the
main list]

Peter

From baoilleach at gmail.com  Thu Mar  5 08:11:31 2009
From: baoilleach at gmail.com (Noel O'Boyle)
Date: Thu, 5 Mar 2009 13:11:31 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
Message-ID: <a882e48b0903050511u2495a27eu8fcd619e9a1a94ff@mail.gmail.com>

+1 for (b)

Seq.count() should behave like a biological sequence.

Here's an example in the wild of this type of analysis:
http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14

It's from a bioinformatics textbook with example code in Matlab. I was
helping a colleague who was trying to reproduce the analysis with
BioPython. Everything was fine until the dimer frequencies were found
to disagree. After implementing the count ourselves, we were able to
reproduce the results. It was then we realised that BioPython was
behaving in an unexpected and non-useful way.

- Noel

From biopython at maubp.freeserve.co.uk  Thu Mar  5 08:26:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 13:26:10 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <a882e48b0903050511u2495a27eu8fcd619e9a1a94ff@mail.gmail.com>
References: <a882e48b0903050511u2495a27eu8fcd619e9a1a94ff@mail.gmail.com>
Message-ID: <320fb6e00903050526r688eadcfv440602c32d294ee8@mail.gmail.com>

On Thu, Mar 5, 2009 at 1:11 PM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> +1 for (b)
>
> Seq.count() should behave like a biological sequence.
>
> Here's an example in the wild of this type of analysis:
> http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14
>
> It's from a bioinformatics textbook with example code in Matlab. I was
> helping a colleague who was trying to reproduce the analysis with
> BioPython. Everything was fine until the dimer frequencies were found
> to disagree. After implementing the count ourselves, we were able to
> reproduce the results. It was then we realised that BioPython was
> behaving in an unexpected and non-useful way.

I agree that in this context it is not useful to have the Seq object
count do an non-overlapping search.

However, calling it "unexpected" is debatable, and could probably
depend on the user's background background.  If you already know
Python before using Biopython, I would argue that the non-overlapping
search is expected because that is what python strings do.  On the
other hand, I'm sure many Biopython users learn Python and Biopython
together - and one might still argue having strings and Seq objects do
different things is unexpected.

Overall between options (a) and (b), I'd pick consistency with the
python string (a), even if it isn't ideal.

There is another idea, let's call this option (c).  Give the Seq
object's count method an optional boolean argument to enable an
overlapping search (which I would want to default to matching the
python string behaviour).  This makes switching between string and Seq
objects easier, and makes the more useful (but probably slower)
overlap aware count option quite accessible and discoverable.

Peter

From bartek at rezolwenta.eu.org  Thu Mar  5 08:28:14 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 5 Mar 2009 14:28:14 +0100
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
Message-ID: <8b34ec180903050528m7a3815c8l3048046e42f0ce00@mail.gmail.com>

On Thu, Mar 5, 2009 at 1:26 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:

> (a) stick with the python string compatible behaviour (which has been
> a general principle for the Seq class), but document this issue more
> clearly as a non-overlapping search does run counter to some potential
> biological uses.
>
> or,
>
> (b) Or change the behaviour as Leighton suggests to do an overlapping
> search. ?This could break any code relying on the old python
> string-like behaviour.
>
> What do people here think? ?Any preferences?
>
> [I don't want to get into details about the implementation here on the
> main list]
>

I don't use the count method much, so I don't have a strong opinion on that.

As Leighton pointed out, searching for sequences looks like  a good
job for Bio.Motif

It's currently doable, but (since Bio.Motif mostly deals with more
complex motifs than a single sequence)
the interface is not polished and it's not optimized for performance.

Currently the code to do this would look like this:

m=Bio.Motif.Motif()
m.add_instance(Seq("GG",m.alphabet))
for i in m.search_instances(your_long sequence):
    print "found GG at position",i

If there is a need to keep backwards compatibility for .count(), I can
make changes to Bio.Motif to make it easier for people to use it.

-- 
Bartek


From lpritc at scri.ac.uk  Thu Mar  5 08:34:03 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 05 Mar 2009 13:34:03 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
Message-ID: <C5D5854B.1E785%lpritc@scri.ac.uk>

Hi,

On 05/03/2009 12:26, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> We should either:
> 
> (a) stick with the python string compatible behaviour (which has been
> a general principle for the Seq class), but document this issue more
> clearly as a non-overlapping search does run counter to some potential
> biological uses.
> 
> or,
> 
> (b) Or change the behaviour as Leighton suggests to do an overlapping
> search.  This could break any code relying on the old python
> string-like behaviour.
> 
> What do people here think?  Any preferences?

Not surprisingly, I favour (b).

The intended domain of use for Seq is as a proxy for a biological entity and
I think that, just as we extend methods to reflect useful
biologically-themed operations, we should also override methods as
appropriate to reflect those same themes.

I can think of a number of run-of-the-mill use cases where we would want to
know about the count of (potentially) overlapping matches of a subsequence
in a biological sequence, for short sequence repeats (SSRs), restriction
sites, protein sequence motifs, and so on.  Also, if we want simply to test
the expected number of occurrences of the dimer 'AA' in a larger sequence
with a given base composition, a non-overlapping count() method will give a
misleading answer, as it will underreport occurrences of 'AA' in odd-length
runs of consecutive 'A's.  I think that the overlapping approach (b) should
at least be a default setting, even if we choose to make overlap/non-overlap
an argument to the method.

For some searches that potentially could have overlaps we might want to know
what biological question is being asked before choosing which approach to
take.  We may, for example, desire different behaviour from query sequences
like 'AGCCAG' depending on circumstances.  This query on 'AGCCAGCCAG' will
return 1 if there is no overlap is allowed, and 2 if an overlap is allowed.
The same query on 'AGCCAGAGCCAG' will return 2 in both cases.  If we care
about 'AGCCAG' as a restriction site, then we would want an overlapping
search.  If we care about 'AGCCAG' as a simple repeat unit, then we might
want a non-overlapping search instead (assuming that the circumstances of
the search are such that this is a sensible answer).  Having the option
might be useful.

A non-overlapping search might also be useful in those cases where existing
code already corrects for nonintuitive behaviour of count().  This is only
going to apply to code that has been produced since release 1.45, so may
only have limited impact, if any.  I would argue that, since a correction
was needed, by parsimony the original behaviour was probably what required
the change.

On the whole, I think that an overlapping count() is the most intuitive and
most likely use case.  I see that there's an argument for consistency with
string.count(), in that dyed-in-the-wool programmers might find it hard to
shift mental gears from one to the other, but I'm not sure that it's a good
argument, for the following reason.

The following statements are true:

A String is a Python sequence type.  Its count() method returns a
non-overlapping count of the query substring.

A List is a Python sequence type.  Its count() method returns the number of
elements that match the query.

A Tuple is a Python sequence type.  It doesn't have a count() method,
although you might imagine that it could stand to have one.

There isn't any cross-sequence object consistency regarding count().  Should
we choose String-like or List-like behaviour when dealing with a MutableSeq?
I don't think that we should seek consistency with String at the expense of
utility or biological intuition, when:

A Seq/MutableSeq is a (Bio)Python sequence type.  Its count() method returns
the overlapping count of the query substring.

Fits nicely with the other three statements, in that none of them are
consistent with any other ;)

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From mjldehoon at yahoo.com  Thu Mar  5 09:49:10 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 5 Mar 2009 06:49:10 -0800 (PST)
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
Message-ID: <418103.38901.qm@web62405.mail.re1.yahoo.com>


I vote (b).
Another option is to continue to use count() for a Python-style count, and to add a new method that does a overlapping-type count. For this new method we'd need a clear but short name, and I can't think of anything now.

--Michiel.


--- On Thu, 3/5/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [BioPython] The count method of a Seq (or MutableSeq) object
> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
> Date: Thursday, March 5, 2009, 7:26 AM
> Hi All,
> 
> As the following examples show, and the python string
> method's
> docstring clearly states, the python string's count
> method uses a
> non-overlapping search:
> 
> >>> "AAA".count("A")
> 3
> >>> "AAA".count("AA") # you
> might expect 2
> 1
> >>> "BBBB".count("BB") # you
> might expect 3
> 2
> 
> Up until Biopython 1.44, the Seq object's count method
> only worked for
> single characters.  From Biopython 1.45 onwards it accepted
> longer
> strings and followed the built in python string count
> behaviour.
> However, as Noel pointed out on Bug 2779 our docstring does
> not make
> it clear that this does a non-overlapping search.  In fact,
> as
> Leighton suggests, one might the Seq object to use an
> overlapping
> search in the Seq object's count method.
> http://bugzilla.open-bio.org/show_bug.cgi?id=2779
> 
> We should either:
> 
> (a) stick with the python string compatible behaviour
> (which has been
> a general principle for the Seq class), but document this
> issue more
> clearly as a non-overlapping search does run counter to
> some potential
> biological uses.
> 
> or,
> 
> (b) Or change the behaviour as Leighton suggests to do an
> overlapping
> search.  This could break any code relying on the old
> python
> string-like behaviour.
> 
> What do people here think?  Any preferences?
> 
> [I don't want to get into details about the
> implementation here on the
> main list]
> 
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Thu Mar  5 10:05:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 15:05:39 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <418103.38901.qm@web62405.mail.re1.yahoo.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
Message-ID: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>

On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
>
> I vote (b).
> Another option is to continue to use count() for a Python-style count,
> and to add a new method that does a overlapping-type count. For this
> new method we'd need a clear but short name, and I can't think of
> anything now.
>
> --Michiel.

Did you like plan (c), which preserves the Python string style count
as the default but offers the non-overlapping count via an optional
argument?

i.e.
>>> from Bio.Seq import Seq
>>> nuc = Seq("AAAA")
>>> nuc.count("AA") #default is non-overlapping
2
>>> nuc.count("AA", overlap=True)
3
>>> nuc.count("AA", overlap=False)
2

Peter

From dalloliogm at gmail.com  Thu Mar  5 10:10:59 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 5 Mar 2009 16:10:59 +0100
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
Message-ID: <5aa3b3570903050710hb407258k6fca86cf1bf9520f@mail.gmail.com>

On Thu, Mar 5, 2009 at 4:05 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>
>>
>> I vote (b).
>> Another option is to continue to use count() for a Python-style count,
>> and to add a new method that does a overlapping-type count. For this
>> new method we'd need a clear but short name, and I can't think of
>> anything now.
>>
>> --Michiel.
>
> Did you like plan (c), which preserves the Python string style count
> as the default but offers the non-overlapping count via an optional
> argument?
>
> i.e.
>>>> from Bio.Seq import Seq
>>>> nuc = Seq("AAAA")
>>>> nuc.count("AA") #default is non-overlapping
> 2
>>>> nuc.count("AA", overlap=True)
> 3
>>>> nuc.count("AA", overlap=False)
> 2


Imho this is the best solution. If I can say, I expect a .count()
method to act like the homonymous method in python strings.

A good doctest example (similar to the existing one) would be nice, too.


>
> Peter
> _______________________________________________
> BioPython mailing list ?- ?BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it


From baoilleach at gmail.com  Thu Mar  5 10:23:42 2009
From: baoilleach at gmail.com (Noel O'Boyle)
Date: Thu, 5 Mar 2009 15:23:42 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
Message-ID: <a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>

2009/3/5 Peter <biopython at maubp.freeserve.co.uk>:
> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>
>>
>> I vote (b).
>> Another option is to continue to use count() for a Python-style count,
>> and to add a new method that does a overlapping-type count. For this
>> new method we'd need a clear but short name, and I can't think of
>> anything now.
>>
>> --Michiel.
>
> Did you like plan (c), which preserves the Python string style count
> as the default but offers the non-overlapping count via an optional
> argument?
>
> i.e.
>>>> from Bio.Seq import Seq
>>>> nuc = Seq("AAAA")
>>>> nuc.count("AA") #default is non-overlapping
> 2
>>>> nuc.count("AA", overlap=True)
> 3
>>>> nuc.count("AA", overlap=False)
> 2
>
> Peter

I think we are arguing here over which should be the default value.

Several people here believe that behaviour analagous to Python's
string.count will reduce bug reports and user confusion. However,
no-one except Leighton has been able to come up with a single use case
where the current behaviour is useful (and even that example, with
respect, was flimsy). So we end up with a method with adheres
magnificently to the principle of least surprise, but which is of no
use to users. Aren't you trying to provide methods which are useful
for biological analysis? Isn't that the purpose of wrapping the string
in the first place?

Noel (getting far too excited over painting this bikeshed)

From bsouthey at gmail.com  Thu Mar  5 11:28:11 2009
From: bsouthey at gmail.com (Bruce Southey)
Date: Thu, 5 Mar 2009 10:28:11 -0600
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
Message-ID: <bbcd77d00903050828y5cd3ac34n2036375505439aa9@mail.gmail.com>

Hi,
This is a little deja vu as I feel this type of thing has come up
before. While I can not speak for anyone else, if I sound different to
that, then I was obviously convinced by those arguments as  that
sounds better than I forgot :-)

More seriously, ignoring the reading fame or the genetic code when
counting is rather bad form!

I can not think of a relevant case involving a protein sequence -
although counting pairs of cysteines in insulin-like sequences could
be a situation of importance (related to disulphide bonds).

An example for nucleic sequences, counting 'TTT' in the madeup
sequence  'TTTTTTTGG' can be two in frames 1 and 2 but only one in
frame 3.

Also, a weaker concern is that the sum of counts is greater than or
equal to the length of the sequence is not desirable property unless
the user is informed that duplicates were found.
In the above case, seven sounds rather wrong when one says that a DNA
sequence of nine DNA bases can produce seven Leucines!

Yes, context is everything because 3 different results is not nice.

Don't get me wrong, I know that finding duplicates is important just
that it should not be here - there must different functions.

Thus, I vote for (a) and I also prefer that default syntax is
consistent with Python language.

If this change is done, then all of Biopython must be revised to be
consistent - like reading frames and similar discussion...

Bruce


On Thu, Mar 5, 2009 at 9:23 AM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> 2009/3/5 Peter <biopython at maubp.freeserve.co.uk>:
>> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>>
>>>
>>> I vote (b).
>>> Another option is to continue to use count() for a Python-style count,
>>> and to add a new method that does a overlapping-type count. For this
>>> new method we'd need a clear but short name, and I can't think of
>>> anything now.
>>>
>>> --Michiel.
>>
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>>
>> i.e.
>>>>> from Bio.Seq import Seq
>>>>> nuc = Seq("AAAA")
>>>>> nuc.count("AA") #default is non-overlapping
>> 2
>>>>> nuc.count("AA", overlap=True)
>> 3
>>>>> nuc.count("AA", overlap=False)
>> 2
>>
>> Peter
>
> I think we are arguing here over which should be the default value.
>
> Several people here believe that behaviour analagous to Python's
> string.count will reduce bug reports and user confusion. However,
> no-one except Leighton has been able to come up with a single use case
> where the current behaviour is useful (and even that example, with
> respect, was flimsy). So we end up with a method with adheres
> magnificently to the principle of least surprise, but which is of no
> use to users. Aren't you trying to provide methods which are useful
> for biological analysis? Isn't that the purpose of wrapping the string
> in the first place?
>
> Noel (getting far too excited over painting this bikeshed)
> _______________________________________________
> BioPython mailing list ?- ?BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From biopython at maubp.freeserve.co.uk  Thu Mar  5 11:34:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 16:34:37 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <bbcd77d00903050828y5cd3ac34n2036375505439aa9@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
	<bbcd77d00903050828y5cd3ac34n2036375505439aa9@mail.gmail.com>
Message-ID: <320fb6e00903050834i32bd8d64w672e53b6ef1dbf56@mail.gmail.com>

On Thu, Mar 5, 2009 at 4:28 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> Hi,
> This is a little deja vu as I feel this type of thing has come up
> before. While I can not speak for anyone else, if I sound different to
> that, then I was obviously convinced by those arguments as ?that
> sounds better than I forgot :-)
>
> More seriously, ignoring the reading fame or the genetic code when
> counting is rather bad form!

Why?  In many situations they are irrelevant.  Consider counting
restriction enzyme digest sites for example, plus of counting in any
protein sequences.

> I can not think of a relevant case involving a protein sequence -
> although counting pairs of cysteines in insulin-like sequences could
> be a situation of importance (related to disulphide bonds).
>
> An example for nucleic sequences, counting 'TTT' in the madeup
> sequence ?'TTTTTTTGG' can be two in frames 1 and 2 but only one in
> frame 3.

Giving an answer of 2 (using a non overlapping search like the python
string method) or 5 (using an overlapping search) are valid expected
outcomes for "TTT" in "TTTTTTTGG".

Here you seem want to count codons - which is by its nature a frame
dependent task.

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar  5 11:35:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 16:35:10 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
Message-ID: <320fb6e00903050835h2c548083jda67b5f50fcfc842@mail.gmail.com>

On Thu, Mar 5, 2009 at 3:23 PM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> I think we are arguing here over which should be the default value.
>
> Several people here believe that behaviour analagous to Python's
> string.count will reduce bug reports and user confusion. However,
> no-one except Leighton has been able to come up with a single use case
> where the current behaviour is useful (and even that example, with
> respect, was flimsy). So we end up with a method with adheres
> magnificently to the principle of least surprise, but which is of no
> use to users. Aren't you trying to provide methods which are useful
> for biological analysis? Isn't that the purpose of wrapping the string
> in the first place?
>
> Noel (getting far too excited over painting this bikeshed)

If we hadn't been shipping Biopython with the old non-overlapping
python-string-like count method for  the last year, I would have
probably have been more willing to agree that the Seq count method
could differ from the python-string and use an overlapping search.
However, changing it now also breaks backwards compatibility which
shouldn't be done lightly.  We could still do this (implementation
discussion on the dev list or the Bug 2779), but will have to make
this change very clear in the release notes.

Peter

From mjldehoon at yahoo.com  Fri Mar  6 06:52:58 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 6 Mar 2009 03:52:58 -0800 (PST)
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
Message-ID: <791065.98994.qm@web62403.mail.re1.yahoo.com>


> > Another option is to continue to use count() for a Python-style count,
> > and to add a new method that does a overlapping-type count. For this
> > new method we'd need a clear but short name, and I can't think of
> > anything now.
> >
> Did you like plan (c), which preserves the Python string style count
> as the default but offers the non-overlapping count via an optional
> argument?
> 
It's also OK, but if we use a different method name we can leave count() untouched altogether.

--Michiel.


From biopython at maubp.freeserve.co.uk  Fri Mar  6 07:07:57 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 12:07:57 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<791065.98994.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00903060407u7383545fp80fc8b81899a33a7@mail.gmail.com>

On Fri, Mar 6, 2009 at 11:52 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> > Another option is to continue to use count() for a Python-style count,
>> > and to add a new method that does a overlapping-type count. For this
>> > new method we'd need a clear but short name, and I can't think of
>> > anything now.
>>
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>
> It's also OK, but if we use a different method name we can leave count() untouched altogether.

Looking back, Sebastian Bassi raised this issue back in 2003 on this
mailing list, and his overlap-aware-count implementation is used
internally by Bio.SeqUtils.MeltingTemp, see:
http://lists.open-bio.org/pipermail/biopython/2003-November/001741.html
http://lists.open-bio.org/pipermail/biopython/2003-November/001742.html
etc

Sebastian also posted an enhancement request for adding an overlap
aware counting method to the python base string, with "overcount" as a
possible name.   I don't know what happened to his bug report, it
seems to have been marked private:
http://mail.python.org/pipermail/python-bugs-list/2003-November/021239.html

I don't really like the name "overcount", but as another suggestion
how about "count_ol" which is short for count-with-overlaps?

Peter

From lpritc at scri.ac.uk  Fri Mar  6 07:15:59 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 06 Mar 2009 12:15:59 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com>
Message-ID: <C5D6C47F.1E880%lpritc@scri.ac.uk>

In the spirit of being blindingly obvious, how about:

Seq.overlapping_count()

;)

L.


On 06/03/2009 11:52, "Michiel de Hoon" <mjldehoon at yahoo.com> wrote:

> 
>>> Another option is to continue to use count() for a Python-style count,
>>> and to add a new method that does a overlapping-type count. For this
>>> new method we'd need a clear but short name, and I can't think of
>>> anything now.
>>> 
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>> 
> It's also OK, but if we use a different method name we can leave count()
> untouched altogether.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From chapmanb at 50mail.com  Fri Mar  6 08:14:04 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 6 Mar 2009 08:14:04 -0500
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <C5D6C47F.1E880%lpritc@scri.ac.uk>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
Message-ID: <20090306131404.GJ69627@sobchak.mgh.harvard.edu>

Hey all;
Great discussion on this. My preference is for a new function,
and I like Leighton's naming suggestion.

Also, unless someone has a use case for the current count()
function, we should deprecate and eventually remove it. Overriding
the string API where it makes sense is good, but here it seems to be
creating confusion and not solving a problem. If someone needs the
real string count, they can always do str(your_seq).count("GG").

Brad

> In the spirit of being blindingly obvious, how about:
> 
> Seq.overlapping_count()
> 
> ;)
> 
> L.
> 
> 
> On 06/03/2009 11:52, "Michiel de Hoon" <mjldehoon at yahoo.com> wrote:
> 
> > 
> >>> Another option is to continue to use count() for a Python-style count,
> >>> and to add a new method that does a overlapping-type count. For this
> >>> new method we'd need a clear but short name, and I can't think of
> >>> anything now.
> >>> 
> >> Did you like plan (c), which preserves the Python string style count
> >> as the default but offers the non-overlapping count via an optional
> >> argument?
> >> 
> > It's also OK, but if we use a different method name we can leave count()
> > untouched altogether.
> 
> -- 
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
> 
> 
> ______________________________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.  
> The Scottish Crop Research Institute is a charitable company limited by
> guarantee. 
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
> 
> 
> DISCLAIMER:
> 
> This email is from the Scottish Crop Research Institute, but the views 
> expressed by the sender are not necessarily the views of SCRI and its 
> subsidiaries.  This email and any files transmitted with it are
> confidential
> 
> to the intended recipient at the e-mail address to which it has been 
> addressed.  It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this
> 
> confidentiality and you must not use, disclose, copy, print or rely on
> this 
> e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
> name of the sender and delete the email from your system.
> 
> Although SCRI has taken reasonable precautions to ensure no viruses are 
> present in this email, neither the Institute nor the sender accepts any 
> responsibility for any viruses, and it is your responsibility to scan
> the email and the attachments (if any).
> ______________________________________________________________________
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From biopython at maubp.freeserve.co.uk  Fri Mar  6 09:13:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 14:13:42 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <20090306131404.GJ69627@sobchak.mgh.harvard.edu>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>

On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hey all;
> Great discussion on this. My preference is for a new function,
> and I like Leighton's naming suggestion.

Yes, "overlapping_count" is a reasonable choice.  Its a bit long, but
it is clear.

> Also, unless someone has a use case for the current count()
> function, we should deprecate and eventually remove it. Overriding
> the string API where it makes sense is good, but here it seems to be
> creating confusion and not solving a problem. If someone needs the
> real string count, they can always do str(your_seq).count("GG").

There is the very common use case of my_seq.count("A"), or similar,
with single character search strings, and lots of code does this (both
in Biopython and I'm sure user's scripts).  For single letters of
course, a non-overlapping count and an overlapping count do the same
thing - deprecating the count method would cause a lot of unnecessary
upheaval.

Ignoring that, given we want the Seq to generally behave like a python
string, I think removing the count method would still be a bad idea.

[As a compromise, assuming we add an overlapping_count method and do a
Biopython 1.50 beta release, the beta release could include a warning
in the count method when used with a multi-character search string,
suggesting the user might in fact need a non-overlapping count.  Or is
this a bit too crazy?]

Peter

From bsouthey at gmail.com  Fri Mar  6 10:06:07 2009
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 06 Mar 2009 09:06:07 -0600
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>	<C5D6C47F.1E880%lpritc@scri.ac.uk>	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
Message-ID: <49B13BDF.9030908@gmail.com>

Peter wrote:
> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>   
>> Hey all;
>> Great discussion on this. My preference is for a new function,
>> and I like Leighton's naming suggestion.
>>     
>
> Yes, "overlapping_count" is a reasonable choice.  Its a bit long, but
> it is clear.
>
>   
>> Also, unless someone has a use case for the current count()
>> function, we should deprecate and eventually remove it. Overriding
>> the string API where it makes sense is good, but here it seems to be
>> creating confusion and not solving a problem. If someone needs the
>> real string count, they can always do str(your_seq).count("GG").
>>     
I have already given one user case where overlapping counts is totally 
inappropriate! Unique codon counting is extremely important in many 
areas including gene prediction (possible splicing sites) and molecular 
evolution (like codon usage).

Another valid case given was DNA restriction sites were you may want 
both overlapping and unique counts. For example, if DNA is digested by 
one enzyme that has unique sites in the sequence then followed by a 
second enzyme that has unique sites in the digested product but possibly 
duplicates in the original sequence.

I just do not understand you logic of requiring a conversion when the 
Seq object is designed to 'behave like a python string'.

>
> There is the very common use case of my_seq.count("A"), or similar,
> with single character search strings, and lots of code does this (both
> in Biopython and I'm sure user's scripts).  For single letters of
> course, a non-overlapping count and an overlapping count do the same
> thing - deprecating the count method would cause a lot of unnecessary
> upheaval.
>
> Ignoring that, given we want the Seq to generally behave like a python
> string, I think removing the count method would still be a bad idea.
>   
I agree.
> [As a compromise, assuming we add an overlapping_count method and do a
> Biopython 1.50 beta release, the beta release could include a warning
> in the count method when used with a multi-character search string,
> suggesting the user might in fact need a non-overlapping count.  Or is
> this a bit too crazy?]
>   
Yes it is too crazy and does not fit into the current established 
behavior of Biopython.

Bruce

From biopython at maubp.freeserve.co.uk  Fri Mar  6 10:15:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 15:15:24 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <49B13BDF.9030908@gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
	<49B13BDF.9030908@gmail.com>
Message-ID: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>

On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> I have already given one user case where overlapping counts is totally
> inappropriate! Unique codon counting is extremely important in many areas
> including gene prediction (possible splicing sites) and molecular evolution
> (like codon usage).

For codon counting NEITHER the current non-overlapping count nor the
suggested overlapping count would be suitable.  So this doesn't really
affect the overlapping versus non-overlapping debate.

Peter

From bsouthey at gmail.com  Fri Mar  6 10:34:42 2009
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 06 Mar 2009 09:34:42 -0600
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>	
	<C5D6C47F.1E880%lpritc@scri.ac.uk>	
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>	
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>	
	<49B13BDF.9030908@gmail.com>
	<320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>
Message-ID: <49B14292.6080806@gmail.com>

Peter wrote:
> On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>   
>> I have already given one user case where overlapping counts is totally
>> inappropriate! Unique codon counting is extremely important in many areas
>> including gene prediction (possible splicing sites) and molecular evolution
>> (like codon usage).
>>     
>
> For codon counting NEITHER the current non-overlapping count nor the
> suggested overlapping count would be suitable.  So this doesn't really
> affect the overlapping versus non-overlapping debate.
>
> Peter
>   
With due respect, this does not make any sense.

If it is a cDNA then I can count say the different Lysine codons to find 
any usage bias using seq.count('AAA')/ 
(seq.count('AAA')+seq.count('AAG'). (Actually I am more interested in 
the occurrence of specific multiple codons than single codons.)
If you want the forward frames then just seq[0:].count('AAA'), 
seq[1:].count('AAA') and seq[2:].count('AAA') for frames 1, 2, and 3, 
respectively.

As you pointed out single characters are not relevant so what is relevant?

Bruce

From biopython at maubp.freeserve.co.uk  Fri Mar  6 10:46:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 15:46:19 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <49B14292.6080806@gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
	<49B13BDF.9030908@gmail.com>
	<320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>
	<49B14292.6080806@gmail.com>
Message-ID: <320fb6e00903060746r309216e7t36d00434993a8cfb@mail.gmail.com>

On Fri, Mar 6, 2009 at 3:34 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>
>> For codon counting NEITHER the current non-overlapping count nor the
>> suggested overlapping count would be suitable. ?So this doesn't really
>> affect the overlapping versus non-overlapping debate.
>>
>> Peter
>
> With due respect, this does not make any sense.
>
> If it is a cDNA then I can count say the different Lysine codons to find any
> usage bias using seq.count('AAA')/ (seq.count('AAA')+seq.count('AAG').
> (Actually I am more interested in the occurrence of specific multiple codons
> than single codons.)

If you have the (short) CDS "TAAAAAAAAAAG" which codes for "LKKK",
then the codon count for "AAA" is 2 and the codon count for "AAG" is
1.

Using the (standard python) non overlapping count method,
"TAAAAAAAAAAG".count("AAA") = 3 and "TAAAAAAAAAAG".count("AAG") = 1
which does not do what you want.

Using a hypothetical overlapping count method,
"TAAAAAAAAAAG".overlapping_count("AAA") = 8 and
"TAAAAAAAAAAG".overlapping_count("AAG") = 1 which does not do what you
want.

i.e. As I said, for codon counting NEITHER the current non-overlapping
count nor the suggested overlapping count would be suitable.

You seem to be asking for something different - a codon counting
method, which is a special case of a non-overlapping count.

Peter


From lpritc at scri.ac.uk  Fri Mar  6 10:47:37 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 06 Mar 2009 15:47:37 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <49B13BDF.9030908@gmail.com>
Message-ID: <C5D6F619.1E8B4%lpritc@scri.ac.uk>

On 06/03/2009 15:06, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Peter wrote:
>> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
unless someone has a use case for the current count()
>>> function, we should deprecate and eventually remove it. Overriding
>>> the string API where it makes sense is good, but here it seems to be
>>> creating confusion and not solving a problem. If someone needs the
>>> real string count, they can always do str(your_seq).count("GG").
>>>     
> I have already given one user case where overlapping counts is totally
> inappropriate! Unique codon counting is extremely important in many
> areas including gene prediction (possible splicing sites) and molecular
> evolution (like codon usage).

We're not discussing codon counting though, we're discussing counting
occurrences of an arbitrary substring in a sequence.  They're not the same
operation, even though they both involve counting.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From chapmanb at 50mail.com  Fri Mar  6 17:46:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 6 Mar 2009 17:46:39 -0500
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
Message-ID: <20090306224639.GM69627@sobchak.mgh.harvard.edu>

Me:
> > Also, unless someone has a use case for the current count()
> > function, we should deprecate and eventually remove it. Overriding
> > the string API where it makes sense is good, but here it seems to be
> > creating confusion and not solving a problem. If someone needs the
> > real string count, they can always do str(your_seq).count("GG").

Bruce:
> I have already given one user case where overlapping counts is totally 
> inappropriate! Unique codon counting

Sorry, I was a bit terse in my previous e-mail. My thought on
deprecation was actually based on your and Noel's emails; both of
you presented cases where you had biological expectations for count
which are not met by the standard string count behaviour. 

For Noel, this is handled by the proposed overlapping_count
function. For your example, I think it would be better handled by
functionality that returned a list of codons, like:

Seq("ATGGAACAT").codon_list(phase=0)
["ATG", "GAA", "CAT"]

Bruce:
> I just do not understand you logic of requiring a conversion when the 
> Seq object is designed to 'behave like a python string'.

This is representing a biological sequence, so I think where a biologist
user's intuition opposes what a standard python string does we
should evaluate for an option that is more in line with expectations.
My point about the string was just that if you are thinking as a python programmer
and really want python string behavior, it is pretty easy to get.

Peter:
> There is the very common use case of my_seq.count("A"), or similar,
> with single character search strings, and lots of code does this (both
> in Biopython and I'm sure user's scripts).  For single letters of
> course, a non-overlapping count and an overlapping count do the same
> thing - deprecating the count method would cause a lot of unnecessary
> upheaval.

Good point; I totally overlooked that. Retract my suggestion. I do
like your warning idea, but maybe we can get by here with
documentation and by highlighting the alternative fuctions.
It looked like you're already all over the documentation, so
hopefully the new functionality will fix up any confusion,

Thanks all,
Brad

From chapmanb at 50mail.com  Sun Mar  8 12:29:41 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 8 Mar 2009 12:29:41 -0400
Subject: [BioPython] Initial work on a GFF parser
Message-ID: <20090308162941.GA99653@kunkel>

Hi all;
Generic Feature Format (GFF) is a nice tab delimited file format
that we don't have full support for in Biopython. Michael Hoffman
contributed code to work with GFF MySQL databases (in Bio.GFF), but
we don't have a GFF parser for the flatfiles. Looking back over the
list archives, this has come up a couple of times without a finished
solution being implemented. GFF suffers from the curse of being too easy
to hack together a solution for parsing a very specific problem, while
generating a good standard parser takes more work.

Recently, Peter brought up GFF on the BioSQL mailing list, which
made me interested in digging into GFF as an input and output flat
file format for BioSQL databases. Towards this end I put together an
initial implementation of a GFF (version 3) parser for Biopython. A
write up and the code are here:

http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/

As described in the post, the GFF interface will be a bit different
from the standard SeqIO interface, since GFF stores features
separately from the sequences and also doesn't require features for
a record to be grouped together.

As a result, the interface is up for discussion and the best path is to
start with an implementation and see where it takes us. I'd be grateful
for any feedback and code from those who are interested. We can discuss
on the development mailing list or on the blog, and move towards getting
stable full featured GFF parsing in Biopython.

Brad

From biopython at maubp.freeserve.co.uk  Mon Mar  9 06:14:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 9 Mar 2009 10:14:55 +0000
Subject: [BioPython] Initial work on a GFF parser
In-Reply-To: <20090308162941.GA99653@kunkel>
References: <20090308162941.GA99653@kunkel>
Message-ID: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com>

On Sun, Mar 8, 2009 at 4:29 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Generic Feature Format (GFF) is a nice tab delimited file format
> that we don't have full support for in Biopython. Michael Hoffman
> contributed code to work with GFF MySQL databases (in Bio.GFF), but
> we don't have a GFF parser for the flatfiles. Looking back over the
> list archives, this has come up a couple of times without a finished
> solution being implemented. GFF suffers from the curse of being too easy
> to hack together a solution for parsing a very specific problem, while
> generating a good standard parser takes more work.

You're right about creating a good general parser taking more work ;)

See also enhancement Bug 2762, GFF capability in SeqIO, which has some
discussion.

Also, it wasn't clear from your blog if you are thinking about just
GFF version 3, or something more general, coping with the assorted
comparatively ill defined GFF2 variants.

> Recently, Peter brought up GFF on the BioSQL mailing list, which
> made me interested in digging into GFF as an input and output flat
> file format for BioSQL databases. Towards this end I put together an
> initial implementation of a GFF (version 3) parser for Biopython. A
> write up and the code are here:
>
> http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/
>
> As described in the post, the GFF interface will be a bit different
> from the standard SeqIO interface, since GFF stores features
> separately from the sequences and also doesn't require features for
> a record to be grouped together.

Regarding where to put this code, if it isn't going to support the
Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but
maybe Bio.GFF or Bio.GFF3 instead.

However, you could still fit gff(3) files into Bio.SeqIO, its just
that the sequence may not be present.  This would be similar GenBank
files usually have a long list of features plus the full sequence, but
the sequence itself may be missing - for example if there is a just a
CONTIG line.  Or QUAL files from sequencing where there is never a
sequence.

As with GenBank files for large genome/chromosome, for a typical GFF
file for Bio.SeqIO we'd just return a single SeqRecord containing all
the features - within the SeqIO API there is no way to offer memory
efficient iteration over the features themselves.

Maybe we need to invent Bio.FeatureIO for this?  You could consider
GenBank/EMBL feature tables, GFF files, NCBI protein tables, and
probably a few other formats too.

> As a result, the interface is up for discussion and the best path is to
> start with an implementation and see where it takes us. I'd be grateful
> for any feedback and code from those who are interested. We can discuss
> on the development mailing list or on the blog, and move towards getting
> stable full featured GFF parsing in Biopython.

>From the blog post it sounds like you are using sub-features to store
the parent/child relationship between say mRNAs and genes.  This is
elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to
cope with the general parent (part-of) relationships allowed in GFF
files - for example an exon may have multiple parents.

There is also the complication that when parsing GenBank files, a gene
or CDS feature with a join-location ends up represented using
sub-features (which probably would be represented with an explicit
intron/exon structure in GFF files) [This is something I don't really
like with the current object structure].  We'd want things to be
fairly uniform between the parsers - for one thing our BioSQL code
currently records a feature with subfeatures as a single feature in
the database.

Peter

From chapmanb at 50mail.com  Mon Mar  9 18:42:24 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 9 Mar 2009 18:42:24 -0400
Subject: [BioPython] Initial work on a GFF parser
In-Reply-To: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com>
References: <20090308162941.GA99653@kunkel>
	<320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com>
Message-ID: <20090309224224.GA4481@sobchak.mgh.harvard.edu>

Peter;
Thanks much for the feedback.

> See also enhancement Bug 2762, GFF capability in SeqIO, which has some
> discussion.
> 
> Also, it wasn't clear from your blog if you are thinking about just
> GFF version 3, or something more general, coping with the assorted
> comparatively ill defined GFF2 variants.

Bug 2762 had a lot of good background and ideas which helped in
getting started. I did take the sub_feature route instead of the
flattened method Leighton suggested there.

Right now this tackles GFF3. The hard part is going to be
getting a framework in place, and then GFF2 or GFT (or GFF2.5 or
whatever they call it) support could be added.

> Regarding where to put this code, if it isn't going to support the
> Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but
> maybe Bio.GFF or Bio.GFF3 instead.
> 
> However, you could still fit gff(3) files into Bio.SeqIO, its just
> that the sequence may not be present.  This would be similar GenBank
> files usually have a long list of features plus the full sequence, but
> the sequence itself may be missing - for example if there is a just a
> CONTIG line.  Or QUAL files from sequencing where there is never a
> sequence.

Yes, where it lives is a good topic for debate. For GFF files, you'd
at least like the option to add new features to an existing sequence
record, which is what I do here. It would be easy enough to create
new blank records if one is not present initially. The difficult
thing with adding this to the existing syntax is that the GFF files
are not ordered for efficient iteration. You essentially have to
parse the whole file, so something like this would handle the
syntax:

seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
final_seq_dict = SeqIO.add_features(gff_handle, "gff3", initial_dict=seq_dict)

Along these lines, I liked the way you did a sequence/quality dual
iterator for quality output and think that works well when ordering of
the records in multiple files is stable.

> As with GenBank files for large genome/chromosome, for a typical GFF
> file for Bio.SeqIO we'd just return a single SeqRecord containing all
> the features - within the SeqIO API there is no way to offer memory
> efficient iteration over the features themselves.
> 
> Maybe we need to invent Bio.FeatureIO for this?  You could consider
> GenBank/EMBL feature tables, GFF files, NCBI protein tables, and
> probably a few other formats too.

FeatureIO is something BioPerl has; this page describes the status
of GFF in BioPerl but is over a year old so things may have changed:

http://www.bioperl.org/wiki/GFF_code_audit

The iteration model still falls apart because of the undefined
ordering of the file. That is why I settled on the filter approach
to limit what you get to a reasonable memory size but still guarantee
you've pulled all relevant features before building the parent/child
relationships and features. This could also apply to data that comes off
cluster runs where the output order will not necessarily correlate with
the inputs.

The filtering approach could also be useful for large GenBank files,
as you could skip adding features and parsing locations for elements you
are not interested in. If others find this approach intuitive, it
would be worth looking at there as well.

> From the blog post it sounds like you are using sub-features to store
> the parent/child relationship between say mRNAs and genes.  This is
> elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to
> cope with the general parent (part-of) relationships allowed in GFF
> files - for example an exon may have multiple parents.

For these the exon is added as a sub_feature to all of its parents. The
shared feature is the same one in memory. t_nested_multiparent_features
in the test code demonstrates this. How we output it to BioSQL is up for
debate but we should also be able to do some sharing there; duplication
is also not too bad of an option if it makes it cleaner since these are
not likely to be deeply nested.

> There is also the complication that when parsing GenBank files, a gene
> or CDS feature with a join-location ends up represented using
> sub-features (which probably would be represented with an explicit
> intron/exon structure in GFF files) [This is something I don't really
> like with the current object structure].  We'd want things to be
> fairly uniform between the parsers - for one thing our BioSQL code
> currently records a feature with subfeatures as a single feature in
> the database.

BioSQL definitely needs work to handle sub_features more generally.
The seqfeature_relationship table in BioSQL can handle these but it
needs to be coded. I agree with you that the way we do it now is a
little too GenBank specific. This is a bit of a larger project since
we should coordinate with the other projects, but as long as we
continue to support the same location mechanism they use currently
it will be back-compatible with older code.

Thanks again for the thoughts,
Brad

From hlapp at gmx.net  Mon Mar  9 23:36:30 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 9 Mar 2009 23:36:30 -0400
Subject: [BioPython] Google Summer of Code: Call for Bio* Volunteers
In-Reply-To: <D9B5E438-4FBA-4073-B740-FB0D72C5790B@gmx.net>
References: <D9B5E438-4FBA-4073-B740-FB0D72C5790B@gmx.net>
Message-ID: <BE9FC557-510B-4A41-99AD-E0DA1C102CB3@gmx.net>

You may recall my message to the developer lists of several O|B|F  
projects in February about the idea of O|B|F applying to Google Summer  
of Code as a mentoring organization [1].

I felt that the response to this was very positive and encouraging.  
Although late (sorry, been swamped too much), I've now put up the  
skeleton of an ideas page at

http://open-bio.org/wiki/Google_Summer_Code_2009

I basically modeled (in fact, largely copied) this page after the  
NESCent Phyloinformatics Summer of Code ideas pages, which I think  
worked pretty well. We can completely rework this, though - any  
feedback and suggestions are very much welcome.

In the meantime, I need all developers to double check the information  
under 'Contact'. Would the open-bio-l mailing list indeed reach the  
prospective mentors and other devs? Will be you be fine with students  
asking for feedback to their applications on the developers (i.e.,  
this) list? Is there a blessed IRC where at least some of the  
prospective mentors hang out for students to ask questions during the  
time they apply?

I also need space for the reference information for all projects that  
will participate with at least one project idea (I would hope that  
that's all projects) to be added in the 'Open-Bio projects involved'  
section.

*****
Most important of all, if you can volunteer to mentor a project,  
please post a project idea to the page in the respective section,  
using the idea template that's there already (copy, paste, and edit).
*****

The deadline for organization applications is Friday this week, Mar  
13, which is very soon. The ideas page is a major factor and component  
in how Google scores new mentoring organizations - the more we can  
show the resourcefulness and diversity of our member projects the more  
competitive I think we'll be. So all those who responded with ideas or  
willingness to help out as primary or secondary mentores earlier, I  
need you to think about and put up your idea(s) now.

Cheers,

	-hilmar

[1] http://tinyurl.com/ck7tqe

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From dalloliogm at gmail.com  Tue Mar 10 13:06:27 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 10 Mar 2009 18:06:27 +0100
Subject: [BioPython] can biopython query KEGG directly?
Message-ID: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>

Hi,
is it possible to query the KEGG database with biopython?

Actually I can do it with the kegg's wsdl apis and the python suds
library and it works very well, but I was wondering whether there is
something more integrated with biopython.
For example, if there is something similar to Entrez, that can
automatically retrieve a sequence from ncbi and transform it to a
SeqRecord object.


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Tue Mar 10 14:08:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 10 Mar 2009 18:08:01 +0000
Subject: [BioPython] can biopython query KEGG directly?
In-Reply-To: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
Message-ID: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>

On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> is it possible to query the KEGG database with biopython?

I don't think there is any wrapper for the KEGG online API (yet).  See:
http://www.genome.jp/kegg/soap/doc/keggapi_manual.html

This does sound like a worthwhile addition (especially if the SOAP
stuff can be done using only core python libraries included in Python
2.4+)

> .. and transform it to a SeqRecord object.

We still need a Bio.KEGG gene parser, see also:
http://bioperl.org/wiki/KEGG_sequence_format
http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html
Once that is done, a KEGG wrapper in Bio.SeqIO would make sense.

Peter

From matzke at berkeley.edu  Tue Mar 10 21:18:12 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 10 Mar 2009 18:18:12 -0700
Subject: [BioPython] GSoC project: Biogeographical and community
	phylogenetics for BioPython
Message-ID: <49B71154.5060109@berkeley.edu>

On the advice of Mauricio & Hilmar, I have posted a draft proposal for a 
Google Summer of Code project: Biogeographical and community 
phylogenetics for BioPython.

http://open-bio.org/wiki/Google_Summer_Code_2009#Biogeographical_and_community_phylogenetics_for_BioPython

Comments welcome on- or off-list.  Cheers!

PS: Also, additional suggestions for pertinent members would be appreciated.

Nick


-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================

From dalloliogm at gmail.com  Thu Mar 12 08:33:04 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 12 Mar 2009 13:33:04 +0100
Subject: [BioPython] can biopython query KEGG directly?
In-Reply-To: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>
References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
	<320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>
Message-ID: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com>

On Tue, Mar 10, 2009 at 7:08 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Hi,
> > is it possible to query the KEGG database with biopython?
>
> I don't think there is any wrapper for the KEGG online API (yet).  See:
> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html


well, if someone is in a hurry to query KEGG with soap, I have some scripts
(but they use the suds library).


>
>
> This does sound like a worthwhile addition (especially if the SOAP
> stuff can be done using only core python libraries included in Python
> 2.4+)


I am not sure if the SOAPpy library is the one included in the core python
libraries, and if it is since python 2.4.
For what I know, SOAPpy has ceased developed since 2005 (see
http://pywebsvcs.sourceforge.net/).
I couldn't test this library, because I still didn't managed to get it
working under an http proxy :-(.


>
>
> > .. and transform it to a SeqRecord object.
>
> We still need a Bio.KEGG gene parser, see also:
> http://bioperl.org/wiki/KEGG_sequence_format
> http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html
> Once that is done, a KEGG wrapper in Bio.SeqIO would make sense.
>

I am just curious, but in which object a Kegg gene file would be transposed?
A SeqRecord? And how, exactly? I suppose all the features will go in
SeqRecord.features... but is there any standard convention to do so?
For example, the codon usage table, class, dblinks, and all the other
fields.. how they would be stored?


>
> Peter
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Thu Mar 12 10:15:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 12 Mar 2009 14:15:06 +0000
Subject: [BioPython] can biopython query KEGG directly?
In-Reply-To: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com>
References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
	<320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>
	<5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com>
Message-ID: <320fb6e00903120715n7ad57282h529150e22da826e9@mail.gmail.com>

On Thu, Mar 12, 2009 at 12:33 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
>> We still need a Bio.KEGG gene parser, see also:
>> http://bioperl.org/wiki/KEGG_sequence_format
>> http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html
>> Once that is done, a KEGG wrapper in Bio.SeqIO would make sense.
>
> I am just curious, but in which object a Kegg gene file would be transposed?
> A SeqRecord? And how, exactly? I suppose all the features will go in
> SeqRecord.features... but is there any standard convention to do so?
> For example, the codon usage table, class, dblinks, and all the other
> fields.. how they would be stored?

Bio.SeqIO only deals with SeqRecord objects.  If we had a KEGG gene
parser in Bio.KEGG (written in the same style as the rest of Bio.KEGG
ideally), then it would make sense to add a KEGG gene format to
Bio.SeqIO, where the KEGG gene records would be parsed using Bio.KEGG
and then converted into SeqRecord objects.  At a minimum this would
mean their id/name/description and sequence - even just that would
still be useful I feel.  For any richer annotation, the convention is
to mimic the GenBank parser as closely as possible.  See
http://biopython.org/wiki/SeqIO_dev

Peter

From matzke at berkeley.edu  Sat Mar 14 00:59:37 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Fri, 13 Mar 2009 21:59:37 -0700
Subject: [BioPython] Getting protein structure names from primary IDs
Message-ID: <49BB39B9.2080206@berkeley.edu>

Hi all,

This has got to be trivial, but I can't find a hint about the solution 
online.

I want to:

1. Search NCBI's structure database for structures from a certain group

from Bio import Entrez
handle = Entrez.einfo()
record = Entrez.read(handle)
print "Search the structure database on Organism = Drosophila"
Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
#handle = Entrez.esearch(db="structure", term="Drosophila")
handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]")

pdb_record = Entrez.read(handle)
print pdb_record	#["IdList"]

pdblist = pdb_record["IdList"]


OK, now I have a list of primary IDs for the protein structures from 
Drosophila.


2. Download those structures.  Apparently I have to do this from RSCB 
and not NCBI? (NCBI efetch has no information on efetching from the 
structure database, and I tried a few obvious methods on analogy to 
other databases without result)

This will download from RSCB, but apparently you need the structure 
name, not the NCBI primary ID.


from Bio.PDB import *
pdbl=PDBList()
pdbl.retrieve_pdb_file('1FAT')


So, how do I get from primary ID to structure name?  I'm sure I'm 
missing something obvious.

Cheers,
Nick


-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================

From matzke at berkeley.edu  Sat Mar 14 01:05:47 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Fri, 13 Mar 2009 22:05:47 -0700
Subject: [BioPython] Getting protein structure names from primary IDs
In-Reply-To: <49BB39B9.2080206@berkeley.edu>
References: <49BB39B9.2080206@berkeley.edu>
Message-ID: <49BB3B2B.9080900@berkeley.edu>

Hi again -- Esummary was what I needed, so nevermind!

Sorry for the trouble,
Nick


Nick Matzke wrote:
> Hi all,
> 
> This has got to be trivial, but I can't find a hint about the solution 
> online.
> 
> I want to:
> 
> 1. Search NCBI's structure database for structures from a certain group
> 
> from Bio import Entrez
> handle = Entrez.einfo()
> record = Entrez.read(handle)
> print "Search the structure database on Organism = Drosophila"
> Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
> #handle = Entrez.esearch(db="structure", term="Drosophila")
> handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]")
> 
> pdb_record = Entrez.read(handle)
> print pdb_record    #["IdList"]
> 
> pdblist = pdb_record["IdList"]
> 
> 
> 
> OK, now I have a list of primary IDs for the protein structures from 
> Drosophila.
> 
> 
> 
> 2. Download those structures.  Apparently I have to do this from RSCB 
> and not NCBI? (NCBI efetch has no information on efetching from the 
> structure database, and I tried a few obvious methods on analogy to 
> other databases without result)
> 
> This will download from RSCB, but apparently you need the structure 
> name, not the NCBI primary ID.
> 
> 
> from Bio.PDB import *
> pdbl=PDBList()
> pdbl.retrieve_pdb_file('1FAT')
> 
> 
> So, how do I get from primary ID to structure name?  I'm sure I'm 
> missing something obvious.
> 
> Cheers,
> Nick
> 
> 
> 
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================

From hlapp at gmx.net  Sat Mar 14 18:59:57 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 14 Mar 2009 18:59:57 -0400
Subject: [BioPython] Google Summer of Code: application submitted,
	action needed
In-Reply-To: <BE9FC557-510B-4A41-99AD-E0DA1C102CB3@gmx.net>
References: <D9B5E438-4FBA-4073-B740-FB0D72C5790B@gmx.net>
	<BE9FC557-510B-4A41-99AD-E0DA1C102CB3@gmx.net>
Message-ID: <71A1E85A-2007-4FAE-A03B-475000C5CD38@gmx.net>

Hi all,

I have submitted the application yesterday for O|B|F participating in  
the 2009 Google Summer of Code as a mentoring organization. The  
application is at

http://docs.google.com/Doc?id=dhs98hzv_7zn8bxqjm

and is also linked to from the ideas page at

http://open-bio.org/wiki/Google_Summer_of_Code_2009

Now keep your fingers crossed, Google is slated to announce  
acceptances on March 18.

This is the last cross-project message re: Summer of Code that  
addresses mentors and our projects; future messages that I'll post  
across projects will be primarily for students such as announcing  
whether we are accepted or not and issuing calls for application.

**What we need most and right now is action from our projects'  
developers and from possible mentors.** Google admins will start  
reviewing organization applications on Monday. The ideas page has 6  
project ideas right now - though the ideas are good ones, the quantity  
won't be particularly impressive to Google.

Therefore, if you have an idea for a summer project for a student  
please use the C& template (it is commented out now but you'll see it  
when you pull the Ideas section into the editor) and put it up there  
ASAP. If you're not sure yet who'll mentor, put tentative names there.  
We don't need a full commitment from mentors until the student  
application period starts (March 23).

Next, for all projects, the leads and/or volunteers should check the  
reference information for their project:

http://open-bio.org/wiki/Google_Summer_of_Code_2009#Open-Bio_projects_involved

I just culled these links from the various project websites - it'd be  
much appreciated if going forward everyone can lend a hand in this.  
Please review what's there and add or fix as you see fit. *These links  
must be correct and complete - otherwise potential students may not  
find you.*

Finally, all prospective mentors, primary or secondary, committed or  
not, and anyone else who would like to volunteer to help out, should  
subscribe themselves ASAP to the mailing list for communicating GSoC- 
related administrivia:

http://lists.open-bio.org/mailman/listinfo/gsoc

I will *not* cross-post all administrative announcements or requests  
for information, and so you *will* miss information if you don't  
subscribe yourself there. (Note: students will be subscribed there  
only *after* acceptance).

Those who are considering to mentor, primary or helping out, please  
also add yourselves to the Mentors section on the Ideas page (and  
check your link if you're already there):

http://open-bio.org/wiki/Google_Summer_of_Code_2009#Mentors

Cheers everyone, and fingers crossed!

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From mjldehoon at yahoo.com  Sun Mar 15 06:25:43 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sun, 15 Mar 2009 03:25:43 -0700 (PDT)
Subject: [BioPython] Bio.SwissProt.SProt Dictionary, index_file
Message-ID: <653996.59295.qm@web62408.mail.re1.yahoo.com>


Hi everybody,

Does anybody use the Dictionary class or index_file function in Bio.SwissProt.SProt? As far as I can tell these functions are broken.
If there are no users, I suggest we deprecate the Dictionary class and the index_file function in Bio.SwissProt.SProt.

--Michiel


From biopython at maubp.freeserve.co.uk  Mon Mar 16 09:40:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 16 Mar 2009 13:40:13 +0000
Subject: [BioPython] List of publications citing or using Biopython
Message-ID: <320fb6e00903160640j73289abbl51d9f8935184a760@mail.gmail.com>

Hi all,

I've been working on a listing of journal publications citing or using
Biopython for the website:
http://biopython.org/wiki/Publications

If you've published anything that qualifies that isn't listed, this is
a wiki page so you should be able to add it.  If you are unsure if
something is appropriate, please ask here on the mailing list.  For
publications from the 2008 onwards I have tried to add a short note
saying which part(s) of Biopython were used - this should be easy to
write for your own recent papers ;)

If you try editing the page you should see how to add extra entries  -
for anything in PubMed this is really easy.  See the discussion page
for more details:
http://biopython.org/wiki/Talk:Publications

Peter

From matzke at berkeley.edu  Mon Mar 16 15:31:57 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 16 Mar 2009 12:31:57 -0700
Subject: [BioPython] Entrez.einfo error?
Message-ID: <49BEA92D.7040905@berkeley.edu>

Hi all,

This exact code worked fine for me on Friday, I wonder if it could be a 
temporary problem at Entrez?  A similar problem seems to occur with 
other Entrez queries.

Running biopython 1.49 in IPython...

============
from Bio import Entrez

Entrez.email = "matzke at berkeley.edu"

handle = Entrez.einfo(db="structure")


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)

/bioinformatics/pyeg/<ipython console> in <module>()

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
in einfo(cgi, **keywds)
     195     variables = {}
     196     variables.update(keywds)
--> 197     return _open(cgi, variables)
     198
     199 def esummary(cgi=None, **keywds):

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
in _open(cgi, params)
     320     options = urllib.urlencode(params, doseq=True)
     321     cgi += "?" + options
--> 322     handle = urllib.urlopen(cgi)
     323
     324     # Wrap the handle inside an UndoHandle.

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
in urlopen(url, data, proxies)
      80         opener = _urlopener
      81     if data is None:
---> 82         return opener.open(url)
      83     else:
      84         return opener.open(url, data)

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
in open(self, fullurl, data)
     188         try:
     189             if data is None:
--> 190                 return getattr(self, name)(url)
     191             else:
     192                 return getattr(self, name)(url, data)

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
in open_http(self, url, data)
     323         if realhost: h.putheader('Host', realhost)
     324         for args in self.addheaders: h.putheader(*args)
--> 325         h.endheaders()
     326         if data is not None:
     327             h.send(data)

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in endheaders(self)
     858             raise CannotSendHeader()
     859
--> 860         self._send_output()
     861
     862     def request(self, method, url, body=None, headers={}):

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in _send_output(self)
     730         msg = "\r\n".join(self._buffer)
     731         del self._buffer[:]
--> 732         self.send(msg)
     733
     734     def putrequest(self, method, url, skip_host=0, 
skip_accept_encoding=0):

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in send(self, str)
     697         if self.sock is None:
     698             if self.auto_open:
--> 699                 self.connect()
     700             else:
     701                 raise NotConnected()

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in connect(self)
     665         msg = "getaddrinfo returns an empty list"
     666         for res in socket.getaddrinfo(self.host, self.port, 0,
--> 667                                       socket.SOCK_STREAM):
     668             af, socktype, proto, canonname, sa = res
     669             try:

IOError: [Errno socket error] (7, 'No address associated with nodename')
 > 
/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect()
     666         for res in socket.getaddrinfo(self.host, self.port, 0,
--> 667                                       socket.SOCK_STREAM):
     668             af, socktype, proto, canonname, sa = res


ipdb> record = Entrez.read(handle)
*** NameError: name 'Entrez' is not defined

============


-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================

From matzke at berkeley.edu  Mon Mar 16 15:42:22 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 16 Mar 2009 12:42:22 -0700
Subject: [BioPython] Entrez.einfo error?
In-Reply-To: <49BEA92D.7040905@berkeley.edu>
References: <49BEA92D.7040905@berkeley.edu>
Message-ID: <49BEAB9E.7070707@berkeley.edu>

Looks like PubMed is down at the moment also, so it's all an NCBI 
problem.  Cheers!
Nick


Nick Matzke wrote:
> Hi all,
> 
> This exact code worked fine for me on Friday, I wonder if it could be a 
> temporary problem at Entrez?  A similar problem seems to occur with 
> other Entrez queries.
> 
> Running biopython 1.49 in IPython...
> 
> ============
> from Bio import Entrez
> 
> Entrez.email = "matzke at berkeley.edu"
> 
> handle = Entrez.einfo(db="structure")
> 
> 
> ---------------------------------------------------------------------------
> IOError                                   Traceback (most recent call last)
> 
> /bioinformatics/pyeg/<ipython console> in <module>()
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
> in einfo(cgi, **keywds)
>     195     variables = {}
>     196     variables.update(keywds)
> --> 197     return _open(cgi, variables)
>     198
>     199 def esummary(cgi=None, **keywds):
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
> in _open(cgi, params)
>     320     options = urllib.urlencode(params, doseq=True)
>     321     cgi += "?" + options
> --> 322     handle = urllib.urlopen(cgi)
>     323
>     324     # Wrap the handle inside an UndoHandle.
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
> in urlopen(url, data, proxies)
>      80         opener = _urlopener
>      81     if data is None:
> ---> 82         return opener.open(url)
>      83     else:
>      84         return opener.open(url, data)
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
> in open(self, fullurl, data)
>     188         try:
>     189             if data is None:
> --> 190                 return getattr(self, name)(url)
>     191             else:
>     192                 return getattr(self, name)(url, data)
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
> in open_http(self, url, data)
>     323         if realhost: h.putheader('Host', realhost)
>     324         for args in self.addheaders: h.putheader(*args)
> --> 325         h.endheaders()
>     326         if data is not None:
>     327             h.send(data)
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in endheaders(self)
>     858             raise CannotSendHeader()
>     859
> --> 860         self._send_output()
>     861
>     862     def request(self, method, url, body=None, headers={}):
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in _send_output(self)
>     730         msg = "\r\n".join(self._buffer)
>     731         del self._buffer[:]
> --> 732         self.send(msg)
>     733
>     734     def putrequest(self, method, url, skip_host=0, 
> skip_accept_encoding=0):
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in send(self, str)
>     697         if self.sock is None:
>     698             if self.auto_open:
> --> 699                 self.connect()
>     700             else:
>     701                 raise NotConnected()
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in connect(self)
>     665         msg = "getaddrinfo returns an empty list"
>     666         for res in socket.getaddrinfo(self.host, self.port, 0,
> --> 667                                       socket.SOCK_STREAM):
>     668             af, socktype, proto, canonname, sa = res
>     669             try:
> 
> IOError: [Errno socket error] (7, 'No address associated with nodename')
>  > 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect() 
> 
>     666         for res in socket.getaddrinfo(self.host, self.port, 0,
> --> 667                                       socket.SOCK_STREAM):
>     668             af, socktype, proto, canonname, sa = res
> 
> 
> 
> 
> 
> ipdb> record = Entrez.read(handle)
> *** NameError: name 'Entrez' is not defined
> 
> ============
> 
> 
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================

From biopython at maubp.freeserve.co.uk  Mon Mar 16 15:52:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 16 Mar 2009 19:52:30 +0000
Subject: [BioPython] Entrez.einfo error?
In-Reply-To: <49BEA92D.7040905@berkeley.edu>
References: <49BEA92D.7040905@berkeley.edu>
Message-ID: <320fb6e00903161252s9f41eecx56853a0cc9a76882@mail.gmail.com>

On Mon, Mar 16, 2009 at 7:31 PM, Nick Matzke <matzke at berkeley.edu> wrote:
> Hi all,
>
> This exact code worked fine for me on Friday, I wonder if it could be a
> temporary problem at Entrez?  A similar problem seems to occur with other
> Entrez queries.
>
> Running biopython 1.49 in IPython...
>
> ============
> from Bio import Entrez
> Entrez.email = "matzke at berkeley.edu"
> handle = Entrez.einfo(db="structure")
> ---------------------------------------------------------------------------
> IOError                                   Traceback (most recent call last)
> ...

Yes, I think you were experiencing a temporary problem, either at the
NCBI or somewhere else on the network.  Its working now on my machine
right now.  In general an IOError in Bio.Entrez is a good sign of a
network issue, and for any complex task you may want to explicitly
catch these exceptions.

Peter

From mgenome at gmail.com  Tue Mar 17 08:02:42 2009
From: mgenome at gmail.com (mgenome)
Date: Tue, 17 Mar 2009 21:02:42 +0900
Subject: [BioPython] How can I draw genome comparison figure to publish?
Message-ID: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>

I have the whole genome sequence of a phage to compare it's ORFs to those of
other related phages. I want to draw a comparison figure of two or more
genomes.
Two genomes should be compared by their ORFs similarities calculated by
BLASTP or stretcher etc.

If there is a table like this

ORF1, start, stop, strand, ORF2, start, stop, strand, similarity,
genome1_ORF1, 1, 200, +, genome2_ORF1, 1,  300,  -, 50
genome1_ORF2, 201, 400, +, genome2_ORF3,  320, 500, -, 90
....

the programs or library should draw as follows;
===>   ===> ....
 |          |
 |          |
 |          |
<===  <===  ....
Their different similarities should be represented by different colors of
linker lines.

I examined several programs, but I didn't find the program good enough to
use for publication.
ACT (Artemics) can draw comparison figure but it can not show ORFs well.
inGeno is the program close to what I want. But It cannot compare multiple
genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do
not support comparison of ORFs in genomic level.

Does anybody know a program and library to draw genome comparion figure
showing ORF comparison.  I known that it is stupid to want a perfect program
to fulfill all my requirments, but I want to find program or library to
fulfill a part of my requirements.

Thank you in advance.

Kyoung-Ho Kim, Korea.

From lpritc at scri.ac.uk  Tue Mar 17 08:42:36 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Mar 2009 12:42:36 +0000
Subject: [BioPython] How can I draw genome comparison figure to publish?
In-Reply-To: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>
Message-ID: <C5E54B3C.1F237%lpritc@scri.ac.uk>

Hi Kyoung-Ho,

On 17/03/2009 12:02, "mgenome" <mgenome at gmail.com> wrote:

> Two genomes should be compared by their ORFs similarities calculated by
> BLASTP or stretcher etc.
> 
> If there is a table like this
> 
> ORF1, start, stop, strand, ORF2, start, stop, strand, similarity,
> genome1_ORF1, 1, 200, +, genome2_ORF1, 1,  300,  -, 50
> genome1_ORF2, 201, 400, +, genome2_ORF3,  320, 500, -, 90
> ....
> 
> the programs or library should draw as follows;
> ===>   ===> ....
>  |          |
>  |          |
>  |          |
> <===  <===  ....
> Their different similarities should be represented by different colors of
> linker lines.
> 
> I examined several programs, but I didn't find the program good enough to
> use for publication.
> ACT (Artemics) can draw comparison figure but it can not show ORFs well.
> inGeno is the program close to what I want. But It cannot compare multiple
> genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do
> not support comparison of ORFs in genomic level.

GenomeDiagram does not draw the linker lines you require, I'm afraid.  The
package I would use to do so is ACT, and I have published diagrams created
using ACT (figure 3 in http://dx.doi.org/10.1073/pnas.0402424101).  There is
also M-GCAT (http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html),
which is very similar to ACT, and perhaps so similar that it will have the
same problems when generating publication-quality images to your liking.
GCV (http://zamov.online.fr/projects/gct/) I've never tried.
 
> Does anybody know a program and library to draw genome comparion figure
> showing ORF comparison.  I known that it is stupid to want a perfect program
> to fulfill all my requirments, but I want to find program or library to
> fulfill a part of my requirements.

GenomeDiagram does not currently have a facility to indicate synteny in the
way that you require using linker lines, so it may not be the tool you need
just yet.  However, it has been used to indicate the results of comparisons
between ORFs on the whole-genome level, using the colours of the compared
features to indicate the sequence identities of the matches (e.g. Figure 2
in http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 and
http://apsjournals.apsnet.org/doi/abs/10.1094).

Cheers,

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

From biopython at maubp.freeserve.co.uk  Tue Mar 17 08:51:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 17 Mar 2009 12:51:18 +0000
Subject: [BioPython] How can I draw genome comparison figure to publish?
In-Reply-To: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>
References: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>
Message-ID: <320fb6e00903170551u284b1f20v4a77fedd7bdbfbed@mail.gmail.com>

On Tue, Mar 17, 2009 at 12:02 PM, mgenome <mgenome at gmail.com> wrote:
> ... I examined several programs, but I didn't find the program good enough
> to use for publication.
> ACT (Artemics) can draw comparison figure but it can not show ORFs well.
> inGeno is the program close to what I want. But It cannot compare multiple
> genomes and I want to draw ORF as arrows. I know GenomeDaigrams in
> python do not support comparison of ORFs in genomic level.

Based on your description, I was going to suggest ACT (Artemics), but
you have already considered this.

GenomeDiagram has been integrated into Biopython and will be part of
Biopython 1.50, and as part of this work it does now support drawing
features (e.g. ORFs) as simple arrows.  GenomeDiagram is very good at
comparative genomics plots - but not the kind you are interested in.
It wouldn't be very elegant, but you might be able to use
GenomeDiagram to draw two linear genome diagrams, and then combine
this and add the comparison lines on yourself with extra code using
ReportLab directly.  This would probably be quite a lot of work...

Peter

From biopython at maubp.freeserve.co.uk  Tue Mar 17 12:52:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 17 Mar 2009 16:52:23 +0000
Subject: [BioPython] Biopython contributors and participants listings
Message-ID: <320fb6e00903170952t329332aer310906da64f49cb6@mail.gmail.com>

Hi all,

We're starting to prepare for the release of Biopython 1.50, so its
seems a good occasion to update the Biopython contributors and
participants listing.  I've just changed the formatting for the wiki
page, and to me at least this looks much nicer now - you can look at
the history and decide for yourselves:
http://biopython.org/wiki/Participants

I see some of you aren't on this participants wiki page and probably
should be (e.g. Tiago), so could I encourage relevant people to add
themselves.  Likewise if you have contributed to the project and think
you have been left out of the contributors file, please let us know:
http://biopython.org/SRC/biopython/CONTRIB
or:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/CONTRIB?cvsroot=biopython

Peter

From biopython at maubp.freeserve.co.uk  Tue Mar 17 13:38:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 17 Mar 2009 17:38:58 +0000
Subject: [BioPython] [Biopython-dev] PDB Parser error
In-Reply-To: <3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com>
References: <3715adb70903170830x61bb6e3bl4412a8cf1504d80c@mail.gmail.com>
	<320fb6e00903170901v6533910bl57ddd534dc05cf51@mail.gmail.com>
	<3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com>
Message-ID: <320fb6e00903171038m72127569m279801556e5b9551@mail.gmail.com>

On Tue, Mar 17, 2009 at 5:34 PM, Rodrigo faccioli
<rodrigo_faccioli at uol.com.br> wrote:
> Peter,
>
> Your suspect was corrected. When I received a database value its was stored
> in a Tuple data structure. The solution was converted them in string
> objects. For this, I used str command.
>
> Now, I can proceed with my tests.
>
> Thanks for your help.

OK, good luck.

Peter

From mitlox at op.pl  Wed Mar 18 05:05:58 2009
From: mitlox at op.pl (mitlox)
Date: Wed, 18 Mar 2009 19:05:58 +1000
Subject: [BioPython] protein-ligand interactions
Message-ID: <49C0B976.1020005@op.pl>

Hello,
I have a solved structure (1E8W) with a ligand and I would like to know 
which residues are within 3A of the ligand. This 3A is a cut off and 
should be using just for the C-alpha in each residue, but it would be 
great if I know which C-alpha belongs to a residue.

I am newbie in Biopython/Python, maybe anyone know an example how is it 
possible?

Thank you in advance.

Best regards


From p.j.a.cock at googlemail.com  Wed Mar 18 05:31:14 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 18 Mar 2009 09:31:14 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <49C0B976.1020005@op.pl>
References: <49C0B976.1020005@op.pl>
Message-ID: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>

On Wed, Mar 18, 2009 at 9:05 AM, mitlox <mitlox at op.pl> wrote:
>
> Hello,
> I have a solved structure (1E8W) with a ligand and I would like to
> know which residues are within 3A of the ligand. This 3A is a cut
> off and should be using just for the C-alpha in each residue, but
> it would be great if I know which C-alpha belongs to a residue.
>
> I am newbie in Biopython/Python, maybe anyone know an
>  example how is it possible?

Hi,

I've got a couple of PDB examples on my personal website, and although
they need a little update to use NumPy instead of Numeric, I think the
page on doing protein contact maps would be very informative:
http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/

In your case, for the protein in each residue you'll want to use just
the C-alpha atom (in the residue's atom dictionary under the key
"CA"), but I think you should loop over all the residues in the ligand
in order to find the least distance.

Peter

From p.j.a.cock at googlemail.com  Wed Mar 18 08:36:06 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 18 Mar 2009 12:36:06 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
References: <49C0B976.1020005@op.pl>
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
Message-ID: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>

> Hi,
>
> I've got a couple of PDB examples on my personal website, and although
> they need a little update to use NumPy instead of Numeric, I think the
> page on doing protein contact maps would be very informative:
> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/

I've updated those pages to use NumPy instead of Numeric - all very
straight forward (apart from some issue with rpy for the graphics which
isn't relevant to Biopython):

http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

Peter

From dalke at dalkescientific.com  Wed Mar 18 11:34:59 2009
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 18 Mar 2009 16:34:59 +0100
Subject: [BioPython] Fwd: Available: 2 Bioinformatics positions in
	AstraZeneca
References: <45988AB300A3B1468F7CF2F9EF207579C6A3D7@SEMLRDEMBX01.rd.astrazeneca.net>
Message-ID: <F68EF5B8-CFDD-448E-911B-42CDCD7BEB62@dalkescientific.com>

For those interested, there's a couple of temporary bioinformatics  
positions at AstraZeneca/M?lndal (near Gothenburg). Reading the  
announcements, which are in Swedish, I see it's more biomedical  
informatics than sequence analysis (text mining, workflows, a  
decision system for medical researchers).

> Ads:
> https://www.poolia.se/sok-jobb/webcv/JobAd.aspx?jobadid=19008
> http://annonsoversikt.monster.se/getjob.aspx? 
> JobID=79909293&cy=se&where=L%c3%a4n%3aV%c3%a4stra+G%c3% 
> b6taland&lid=1398&re=95&pg=1&dv=1&AVSDM=2009-03-13+11%3a17% 
> 3a00&seq=11&fseo=1&isjs=1&re=1000
> https://sjobs.brassring.com/1053/ASP/TG/cim_jobdetail.asp?SID=% 
> 5edUuKAW_slp_rhc_DOlGOwdxDn_slp_rhc_PthlP/WlgiP85aWAkz/ 
> xRYSIbMXcsvZrHO0fJu5/ 
> PZdH3vw1QoLQAr5X3A_C_R__L_F_lA_slp_rhc_0Q7alykZpdfns2LzK3W8x8tde_slp_r 
> hc_tU=&jobId=275215&type=search&partnerid=20054&siteid=5036

Also, if you are doing Python in the Gothenburg area, join us for  
GothPy, the Gothenburg Python user's group: http://groups.google.com/ 
group/gothpy


				Andrew
				dalke at dalkescientific.com


From n.j.loman at bham.ac.uk  Wed Mar 18 13:59:09 2009
From: n.j.loman at bham.ac.uk (Nick Loman)
Date: Wed, 18 Mar 2009 17:59:09 +0000
Subject: [BioPython] [Fwd: Bioinformatician wanted]
Message-ID: <49C1366D.7070105@bham.ac.uk>

Hi all,

I hope biopython'ers will excuse me posting this job advert for a 
Research Fellow at University of Birmingham - the project referenced 
makes heavy use of Biopython. The position holder would interact with 
Biopython on a daily basis, and potentially be able to help the 
Biopython open source effort should they wish.

Cheers,

Nick.


Please pass  this advert on to anyone who might be interested and suitable.

http://www.jobs.ac.uk/jobs/BO446/

Research Fellow

School of Immunity and Infection

*Fixed term for 33 months*

We are looking for a talented bioinformatician to assist in the
development, maintenance and exploitation of an internationally renowned
web-based microbial genomics facility, xBASE. The post holder will build
on our existing achievements with xBASE (http://xbase.ac.uk
<ttp://xbase.ac.uk/">; Chaudhuri RR, Loman NJ, Snyder LA, Bailey CM,
Stekel DJ, Pallen MJ. Nucleic Acids Res. 2008 36:D543-6).

The work will be carried out under the supervision of Professor Mark
Pallen (Medical School) in collaboration with Dr Dov Stekel
(Biosciences). The post holder will work within an attractive modern
research environment in the University's newly established
inter-disciplinary Centre for Systems Biology.

All candidates must have proficiency in programming within the
Unix/Linux environment, including web-linked database design,
development and management and use of languages such as Perl, PHP, C++,
Python, Ruby or JAVA. Familiarity with BioPerl, BioSQL and MySQL is
highly desirable.

Applicants must possess the critical thinking skills needed to devise
and carry out research projects and should have experience of analysing
macromolecular sequence data. A PhD in a relevant subject area is
desirable and will be required for appointment to a research fellowship.

A flair for design, particularly as applied to web-based resources, good
team-working skills and an ability to work under their own initiative
will provide an advantage, as will experience of research in molecular
bacteriology, comparative genomics, molecular evolution and/or pathogenesis.

Informal enquiries may be addressed to Professor Mark Pallen on 0121 414
7163 or m.pallen at bham.ac.uk

Starting salary ?27,183 a year, in the range of ?27,183 to ?35,469 a
year (potential progression on performance once in post to ?37,651). The
post will be offered on a fixed-term contract for a period up to two
years and nine months, starting on or shortly after May 1st 2009.

Interviews will be held in the week beginning Monday 30 March 2009.

Closing date: 23 March 2009   Reference: 39855
To download the details and submit an electronic application online
visit: www.hr.bham.ac.uk/jobs
<ttp://www.hr.bham.ac.uk/jobs">alternatively information can be obtained
from 0121 415 9000.

A University of Fairness and Diversity.

Mark

Professor Mark Pallen
Professor of Microbial Genomics
Centre for Systems Biology
Biosciences
University of Birmingham, BIRMINGHAM, B15 2TT
m.pallen at bham.ac.uk
tel ++44(0)121 414 7163

Author: The Rough Guide to Evolution
http://www.amazon.co.uk/Rough-Guide-Evolution-Science-Phenomena/dp/1858289467/

Blog
http://roughguidetoevolution.blogspot.com
feed://roughguidetoevolution.blogspot.com/feeds/posts/default

"There is grandeur in this view of life, with its several powers, having
been originally breathed into a few forms or into one; and that, whilst
this planet has gone cycling on according to the fixed law of gravity,
from so simple a beginning endless forms most beautiful and most
wonderful have been, and are being evolved."
Charles Darwin


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

From hlapp at gmx.net  Wed Mar 18 14:45:50 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Mar 2009 14:45:50 -0400
Subject: [BioPython] OBF application for Summer of Code has been rejected
Message-ID: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>

I hope to find out later why, but our Google Summer of Code  
application as an umbrella org has been rejected.

However, NESCent has been accepted. If you can give your project idea  
a phylogenetics/phyloinformatics focus, go and put it up on the  
NESCent ideas page at

http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009

Do so pretty much **now** - we will start broadcasting and reaching  
out to students tonight and tomorrow. If someone comes to the site and  
they don't see a Bio* project that they would have been interested in,  
they may not check back for updates.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From Yvan.Strahm at bccs.uib.no  Wed Mar 18 14:47:58 2009
From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no)
Date: Wed, 18 Mar 2009 19:47:58 +0100
Subject: [BioPython] How can I get a more explicite error
Message-ID: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>

Hello List,

I try to get a grip on Biopython and followed the chapter 6 form the  
tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html)

I run this script:

from Bio.Blast import NCBIStandalone
import re
import sys

my_blast_db =  
"/export/scratch/yvans/BEE/Apis_mellifera_ligustica_complete_mitochondrial_genome.fasta"
my_blast_file = sys.argv[1]
my_blast_exe = "/Home/lundalm/yvans/src/blast-2.2.19/bin/blastall"

result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn",
                                                       my_blast_db,  
my_blast_file,
                                                       gap_open=5,
                                                       gap_extend=2,
                                                       filter ='F',
                                                       expectation=1000)

blast_results = result_handle.read()

my_results=sys.argv[1]+".xml"

save_file = open(my_results, "w")
save_file.write(blast_results)
save_file.close()

I got this error

[yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta
Traceback (most recent call last):
   File "bioblast.py", line 16, in <module>
     blast_results = result_handle.read()
SystemError: Objects/stringobject.c:4271: bad argument to internal function

if the number of sequence blasted agianst the db is greater than 500000.
The sequence are small reads from a solexa sequencing project.

Is there a size limitation?

And should I save(keep) only the sequence I am interested in into  
my_results instead of saving everything?

And is there a way of running some tests before doinr the blast_result.read()?

Now I try to use keep_hits=1 as a blast parameters in order to reduce  
the size of my_result, will see.

Thanks for your time and help

Cheers,
yvan


From cjfields at illinois.edu  Wed Mar 18 15:08:48 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 18 Mar 2009 14:08:48 -0500
Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has
	been rejected
In-Reply-To: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>
References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>
Message-ID: <B306969B-29F9-48C3-9860-F9E30757F0E7@illinois.edu>

Hilmar,

The idea was floated on the google SOC list that language-specific  
organizations that have been accepted may potentially take  
bioinformatics-related  applications.  Specifically, Jonathan Leto  
(from The Perl Foundation) indicated that bioinformatics-related  
projects using BioPerl might be able to apply through them.  Not sure  
about others (Python Software Foundation, etc) but might be worth  
checking into.

Any idea on who's been accepted beyond NEScent?

chris

On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote:

> I hope to find out later why, but our Google Summer of Code  
> application as an umbrella org has been rejected.
>
> However, NESCent has been accepted. If you can give your project  
> idea a phylogenetics/phyloinformatics focus, go and put it up on the  
> NESCent ideas page at
>
> http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009
>
> Do so pretty much **now** - we will start broadcasting and reaching  
> out to students tonight and tomorrow. If someone comes to the site  
> and they don't see a Bio* project that they would have been  
> interested in, they may not check back for updates.
>
> 	-hilmar
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l


From chapmanb at 50mail.com  Wed Mar 18 17:20:07 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 18 Mar 2009 17:20:07 -0400
Subject: [BioPython] How can I get a more explicite error
In-Reply-To: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
Message-ID: <20090318212007.GM57054@sobchak.mgh.harvard.edu>

Hi Yvan;

> I try to get a grip on Biopython and followed the chapter 6 form the  
> tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html)
> 
> I run this script:
[...]
> blast_results = result_handle.read()
[...]
> [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta
> Traceback (most recent call last):
>    File "bioblast.py", line 16, in <module>
>      blast_results = result_handle.read()
> SystemError: Objects/stringobject.c:4271: bad argument to internal function
> 
> if the number of sequence blasted agianst the db is greater than 500000.
> The sequence are small reads from a solexa sequencing project.

The result_handle.read() line is pulling the entire large BLAST result
file into memory as a string. You will run out of memory with huge files,
leading to the errors you are seeing.

To limit the problem, run BLAST initially at the command line,
and then process the resulting XML file with the BLAST parser
as described here:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56

This iterates over 1 record at a time, avoiding the memory issue.

However, you should be using a short read aligner to map these reads
to the genome. BLAST is not the right tool for this particular
application; massive BLAST report files are going to be one of many
problems you will run into analyzing the data. Here are a couple of
popular aligners designed for the exact problem you are tackling:

Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
Maq: http://maq.sourceforge.net/

Hope this helps,
Brad

From hlapp at gmx.net  Wed Mar 18 18:50:26 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Mar 2009 18:50:26 -0400
Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has
	been rejected
In-Reply-To: <B306969B-29F9-48C3-9860-F9E30757F0E7@illinois.edu>
References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>
	<B306969B-29F9-48C3-9860-F9E30757F0E7@illinois.edu>
Message-ID: <D013D9EA-8509-43B7-988C-3CBA94A580F2@gmx.net>

Yes, thanks for mentioning that, was going to do so too. The Perl  
Foundation and the Python foundation have been accepted.

I guess there isn't a Java Foundation, and if there is a Ruby one it  
hasn't been accepted or hasn't applied. However, Ruby on Rails has  
been accepted. Don't know how open they would be a Bioruby project.

	-hilmar

On Mar 18, 2009, at 3:08 PM, Chris Fields wrote:

> Hilmar,
>
> The idea was floated on the google SOC list that language-specific  
> organizations that have been accepted may potentially take  
> bioinformatics-related  applications.  Specifically, Jonathan Leto  
> (from The Perl Foundation) indicated that bioinformatics-related  
> projects using BioPerl might be able to apply through them.  Not  
> sure about others (Python Software Foundation, etc) but might be  
> worth checking into.
>
> Any idea on who's been accepted beyond NEScent?
>
> chris
>
> On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote:
>
>> I hope to find out later why, but our Google Summer of Code  
>> application as an umbrella org has been rejected.
>>
>> However, NESCent has been accepted. If you can give your project  
>> idea a phylogenetics/phyloinformatics focus, go and put it up on  
>> the NESCent ideas page at
>>
>> http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009
>>
>> Do so pretty much **now** - we will start broadcasting and reaching  
>> out to students tonight and tomorrow. If someone comes to the site  
>> and they don't see a Bio* project that they would have been  
>> interested in, they may not check back for updates.
>>
>> 	-hilmar
>>
>> -- 
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From yvan.strahm at bccs.uib.no  Thu Mar 19 04:10:17 2009
From: yvan.strahm at bccs.uib.no (Yvan Strahm)
Date: Thu, 19 Mar 2009 09:10:17 +0100
Subject: [BioPython] How can I get a more explicite error
In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu>
References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
	<20090318212007.GM57054@sobchak.mgh.harvard.edu>
Message-ID: <49C1FDE9.20305@bccs.uib.no>

Hello Brad,

Thanks for the help, much appreciated.
I will look at bowtie and Maq. In fact I am interested into reads which are not in the reference and 
   how they differ from the reference, how many reads have 1,2,3,.... indels/mismatch.
Cheers,
yvan

Brad Chapman wrote:
> Hi Yvan;
> 
>> I try to get a grip on Biopython and followed the chapter 6 form the  
>> tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html)
>>
>> I run this script:
> [...]
>> blast_results = result_handle.read()
> [...]
>> [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta
>> Traceback (most recent call last):
>>    File "bioblast.py", line 16, in <module>
>>      blast_results = result_handle.read()
>> SystemError: Objects/stringobject.c:4271: bad argument to internal function
>>
>> if the number of sequence blasted agianst the db is greater than 500000.
>> The sequence are small reads from a solexa sequencing project.
> 
> The result_handle.read() line is pulling the entire large BLAST result
> file into memory as a string. You will run out of memory with huge files,
> leading to the errors you are seeing.
> 
> To limit the problem, run BLAST initially at the command line,
> and then process the resulting XML file with the BLAST parser
> as described here:
> 
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56
> 
> This iterates over 1 record at a time, avoiding the memory issue.
> 
> However, you should be using a short read aligner to map these reads
> to the genome. BLAST is not the right tool for this particular
> application; massive BLAST report files are going to be one of many
> problems you will run into analyzing the data. Here are a couple of
> popular aligners designed for the exact problem you are tackling:
> 
> Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
> Maq: http://maq.sourceforge.net/
> 
> Hope this helps,
> Brad
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From biopython at maubp.freeserve.co.uk  Thu Mar 19 06:47:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Mar 2009 10:47:05 +0000
Subject: [BioPython] [Fwd: Bioinformatician wanted]
In-Reply-To: <49C1366D.7070105@bham.ac.uk>
References: <49C1366D.7070105@bham.ac.uk>
Message-ID: <320fb6e00903190347v3a70b6b0w46033c5769b38aa5@mail.gmail.com>

On Wed, Mar 18, 2009 at 5:59 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> Hi all,
>
> I hope biopython'ers will excuse me posting this job advert for a Research
> Fellow at University of Birmingham - the project referenced makes heavy use
> of Biopython. The position holder would interact with Biopython on a daily
> basis, and potentially be able to help the Biopython open source effort
> should they wish.
>
> Cheers,
>
> Nick.

I have no objections to posting targeted and directly relevant
academic jobs adverts here - in fact I rather like it.  I would point
out the job advert text itself doesn't actually mention Biopython -
perhaps you can get HR to amend the copy linked to from the University
job page updated to mention experience of Biopython, BioPerl or BioSQL
being desirable?

Peter

P.S. Could you add links to Biopython, BioPerl and BioSQL to the xBase
website, maybe on the about page? http://xbase.bham.ac.uk/about.pl

P.P.S. Did you have a chance to try out the patch on Bug 2738 for
speeding up loading GenBank files into BioSQL?
http://bugzilla.open-bio.org/show_bug.cgi?id=2738

Cheers!

From biopython at maubp.freeserve.co.uk  Thu Mar 19 06:52:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Mar 2009 10:52:30 +0000
Subject: [BioPython] How can I get a more explicite error
In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu>
References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
	<20090318212007.GM57054@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00903190352rcbca60bi4d703dbf65bcd3b0@mail.gmail.com>

On Wed, Mar 18, 2009 at 9:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> The result_handle.read() line is pulling the entire large BLAST result
> file into memory as a string. You will run out of memory with huge files,
> leading to the errors you are seeing.

I think Brad is probably right about the memory issue - is certainly
something to be careful of.  Instead of this:

blast_results = result_handle.read()
my_results=sys.argv[1]+".xml"
save_file = open(my_results, "w")
save_file.write(blast_results)
save_file.close()

You could try keeping only one line in memory:

my_results=sys.argv[1]+".xml"
save_file = open(my_results, "w")
for line in result_handle :
    save_file.write(line)
save_file.close()

Or, we should get round to fixing Bug 2654 which would let you tell
the BLAST tool to save the file itself, which would be much more
elegant.  Do you want to add yourself as a CC to this bug, so you'll
automatically be informed of any updates:
http://bugzilla.open-bio.org/show_bug.cgi?id=2654

Peter

From mitlox at op.pl  Thu Mar 19 08:55:06 2009
From: mitlox at op.pl (mitlox)
Date: Thu, 19 Mar 2009 22:55:06 +1000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
References: <49C0B976.1020005@op.pl>	
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
Message-ID: <49C240AA.908@op.pl>

I wrote this code:
------------------------------------------------------------------------------------------------
import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb" #not the full cage!

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)

backBoneAtomNames = "N","CA","C","0", "CB"

tempBackbone = [0,0,0,0,0]
Backbone = []
backboneNo = 0

for atom in structure.get_atoms():
    if (atom.get_name() == backBoneAtomNames[backboneNo]) and 
(backboneNo < len(backBoneAtomNames)):
        tempBackbone[backboneNo] = atom   
        backboneNo+=1
    elif atom.get_name() != backBoneAtomNames[backboneNo]:
        backboneNo = 0
    elif len(backBoneAtomNames) == backboneNo:
        Backbone.extend(tempBackbone)
        for a in tempBackbone:
            print a
------------------------------------------------------------------------------------------------
to identified the backbone, but unfortunately it does not work.

Maybe exist already to identified backbone in Biopython?

Thank you in advance

Peter Cock wrote:
>> Hi,
>>
>> I've got a couple of PDB examples on my personal website, and although
>> they need a little update to use NumPy instead of Numeric, I think the
>> page on doing protein contact maps would be very informative:
>> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/
>>     
>
> I've updated those pages to use NumPy instead of Numeric - all very
> straight forward (apart from some issue with rpy for the graphics which
> isn't relevant to Biopython):
>
> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> Peter
>
>   


From p.j.a.cock at googlemail.com  Thu Mar 19 09:31:30 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 19 Mar 2009 13:31:30 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <49C240AA.908@op.pl>
References: <49C0B976.1020005@op.pl>
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
	<49C240AA.908@op.pl>
Message-ID: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>

On Thu, Mar 19, 2009 at 12:55 PM, mitlox <mitlox at op.pl> wrote:
> I wrote this code:
> ------------------------------------------------------------------------------------------------
> import Bio.PDB
> import numpy
>
> pdb_code = "1E8W"
> pdb_filename = "1E8W.pdb" #not the full cage!

That comment was about the fact that the PDB file 1XI4 only contains
part of the full clathrin cage.

> structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
> backBoneAtomNames = "N","CA","C","0", "CB"
> ...
> ------------------------------------------------------------------------------------------------
> to identified the backbone, but unfortunately it does not work.
>
> Maybe exist already to identified backbone in Biopython?

I don't understand what you were trying to do. Have you read the
Bio.PDB documentation about the hierarchy of structures, models,
chains, residues and atoms?
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

This is how I would solve the original question, finding the distance
between the C-alpha carbon to the closest atom is the ligand:

import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb"

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
chainA = model["A"]

def residue_dist_to_ligand(protein_residue, ligand_residue) :
    """Returns distance from the protein C-alpha to the closest ligand atom."""
    distances = []
    for atom in ligand_residue :
        diff_vector  = protein_residue["CA"].coord - atom.coord
        distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
    return min(distances)

#From looking at the PDB file, ligand is last residue in chain A, named QUE
ligand_res = chainA.child_list[-1]
assert ligand_res.resname == "QUE"
for protein_res in chainA.child_list[:-1] :
    dist = residue_dist_to_ligand(protein_res, ligand_res)
    if dist < 5.0 :
        print protein_res.resname, protein_res.id[1], dist

This gives the following output:

ILE 881 3.64203
VAL 882 3.58559
ALA 885 4.62673
THR 886 4.95211
ILE 963 4.64252
ASP 964 3.08788

If you wanted to, it should be simple change this to find the closest
distance between any part of each residue to any part of the ligand,
which should I expect give some distances less than 3A.

Peter

From mitlox at op.pl  Fri Mar 20 08:18:48 2009
From: mitlox at op.pl (mitlox)
Date: Fri, 20 Mar 2009 22:18:48 +1000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>
References: <49C0B976.1020005@op.pl>	
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>	
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>	
	<49C240AA.908@op.pl>
	<320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>
Message-ID: <49C389A8.5090703@op.pl>

Thank you very much for your code, it works and the output is exactly 
for what I was looking for.

I try to get a structureCA object to write out the results in a PDB file 
(outCA.pdb) like this:
ATOM   5275  CA  ILE A 881      17.242  57.141  22.062  1.00 
38.49           C
ATOM   5283  CA  VAL A 882      16.292  57.880  25.678  1.00 
38.90           C 
....

And the second reason for a structureCA object is that I do not want use:
structureCA = Bio.PDB.PDBParser().get_structure(outCA.pdb, outCA.pdb)

Unfortunately I get this error with the extension:
ILE 881 3.64203
VAL 882 3.58559
ALA 885 4.62673
THR 886 4.95211
ILE 963 4.64252
ASP 964 3.08788
Traceback (most recent call last):
  File "interaction.py", line 31, in ?
    io.save('out.pdb')
  File 
"/usr/lib/python2.4/site-packages/biopython-1.49-py2.4-linux-i686.egg/Bio/PDB/PDBIO.py", 
line 121, in save
    for model in self.structure.get_list():
AttributeError: 'list' object has no attribute 'get_list'

Here is the code:
import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb"

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
chainA = model["A"]
structureCA = []

def residue_dist_to_ligand(protein_residue, ligand_residue) :
    """Returns distance from the protein C-alpha to the closest ligand 
atom."""
    distances = []
    for atom in ligand_residue :
        diff_vector  = protein_residue["CA"].coord - atom.coord
        distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
    return min(distances)

#From looking at the PDB file, ligand is last residue in chain A, named QUE
ligand_res = chainA.child_list[-1]
assert ligand_res.resname == "QUE"
for protein_res in chainA.child_list[:-1] :
    dist = residue_dist_to_ligand(protein_res, ligand_res)
    if dist < 5.0 :
        print protein_res.resname, protein_res.id[1], dist
    structureCA.append(protein_res)

io=Bio.PDB.PDBIO()
io.set_structure(structureCA)
io.save('outCA.pdb')

How can I get a structureCA object of the results?

Thank you in advance.

Best regards


From p.j.a.cock at googlemail.com  Fri Mar 20 09:36:47 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Mar 2009 13:36:47 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <49C389A8.5090703@op.pl>
References: <49C0B976.1020005@op.pl>
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
	<49C240AA.908@op.pl>
	<320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>
	<49C389A8.5090703@op.pl>
Message-ID: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com>

On Fri, Mar 20, 2009 at 12:18 PM, mitlox <mitlox at op.pl> wrote:
> Thank you very much for your code, it works and the output is exactly for
> what I was looking for.
>
> I try to get a structureCA object to write out the results in a PDB file
> (outCA.pdb) like this:
> ATOM ? 5275 ?CA ?ILE A 881 ? ? ?17.242 ?57.141 ?22.062 ?1.00 38.49
> C
> ATOM ? 5283 ?CA ?VAL A 882 ? ? ?16.292 ?57.880 ?25.678 ?1.00 38.90
> C ....
>
> Unfortunately I get this error with ... Here is the code:
> ...
> structureCA = []
> ...
> io=Bio.PDB.PDBIO()
> io.set_structure(structureCA)
> io.save('outCA.pdb')

Your structureCA object is just a python list, containing Residue objects.
Instead you need to create a new object with the partial chain - which
can be done by creating structure, model and chain objects manually.

However, I suggest you re-read pages 5 and 6 of the Bio.PDB
documentation for the recommend approach:
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
In your case, you'll want to write your own selection class using the
residue distance to the ligand.  I recognise this might seem rather
complicated for a python novice as you have to create your own
class - so here is my solution:

import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb"

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
chainA = model["A"]

def residue_dist_to_ligand(protein_residue, ligand_residue) :
    """Returns distance from the protein C-alpha to the closest ligand atom."""
    distances = []
    for atom in ligand_residue :
        diff_vector  = protein_residue["CA"].coord - atom.coord
        distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
    return min(distances)

class NearLigandSelect(Bio.PDB.Select):
    def __init__(self, distance_threshold, ligand_residue) :
        self.threshold = distance_threshold
        self.ligand_res = ligand_residue

    def accept_residue(self, residue):
        if residue == self.ligand_res :
            return True #change this to False if you don't want the ligand
        else :
            dist = residue_dist_to_ligand(residue, self.ligand_res)
            return dist < self.threshold

io=Bio.PDB.PDBIO()
io.set_structure(structure)
#From looking at the PDB file, ligand is last residue in chain A
ligand_res = chainA.child_list[-1]
#Going to use a distance theshold of 4A
io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res))
print "Done"

Peter


From mitlox at op.pl  Fri Mar 20 19:45:56 2009
From: mitlox at op.pl (mitlox)
Date: Sat, 21 Mar 2009 09:45:56 +1000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com>
References: <49C0B976.1020005@op.pl>	
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>	
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>	
	<49C240AA.908@op.pl>	
	<320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>	
	<49C389A8.5090703@op.pl>
	<320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com>
Message-ID: <49C42AB4.7050404@op.pl>

Thank you very much for your solution.

Additionally It would be nice to have a structure object with the same 
information like in "near_ligand.pdb", that I do not need to read a new 
pdb file again:
structureMOD = Bio.PDB.PDBParser().get_structure("near", "near_ligand.pdb").

It is possible to have both a "near_ligand.pdb" and the same structure 
object?

Thank you in advance.

Best regards

Peter Cock wrote:
> Your structureCA object is just a python list, containing Residue objects.
> Instead you need to create a new object with the partial chain - which
> can be done by creating structure, model and chain objects manually.
>
> However, I suggest you re-read pages 5 and 6 of the Bio.PDB
> documentation for the recommend approach:
> http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
> In your case, you'll want to write your own selection class using the
> residue distance to the ligand.  I recognise this might seem rather
> complicated for a python novice as you have to create your own
> class - so here is my solution:
>
> import Bio.PDB
> import numpy
>
> pdb_code = "1E8W"
> pdb_filename = "1E8W.pdb"
>
> structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
> model = structure[0]
> chainA = model["A"]
>
> def residue_dist_to_ligand(protein_residue, ligand_residue) :
>     """Returns distance from the protein C-alpha to the closest ligand atom."""
>     distances = []
>     for atom in ligand_residue :
>         diff_vector  = protein_residue["CA"].coord - atom.coord
>         distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
>     return min(distances)
>
> class NearLigandSelect(Bio.PDB.Select):
>     def __init__(self, distance_threshold, ligand_residue) :
>         self.threshold = distance_threshold
>         self.ligand_res = ligand_residue
>
>     def accept_residue(self, residue):
>         if residue == self.ligand_res :
>             return True #change this to False if you don't want the ligand
>         else :
>             dist = residue_dist_to_ligand(residue, self.ligand_res)
>             return dist < self.threshold
>
> io=Bio.PDB.PDBIO()
> io.set_structure(structure)
> #From looking at the PDB file, ligand is last residue in chain A
> ligand_res = chainA.child_list[-1]
> #Going to use a distance theshold of 4A
> io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res))
> print "Done"
>
> Peter
>
>   


From mjldehoon at yahoo.com  Sat Mar 21 00:54:08 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 20 Mar 2009 21:54:08 -0700 (PDT)
Subject: [BioPython] Bio.Enzyme (was: Re: [Biopython-dev] Bio.ExPASy)
In-Reply-To: <76595.11423.qm@web62404.mail.re1.yahoo.com>
Message-ID: <517737.76119.qm@web62403.mail.re1.yahoo.com>


I've created a simplified version of the parser in Bio.Enzyme in Bio.ExPASy.Enzyme. The idea behind it is to collect all parsers related to ExPASy databases in Bio.ExPASy so that they can be found more easily by users.

Bio.ExPASy.Enzyme works essentially the same as Bio.Enzyme, but I've done a few things a bit differently. The biggest change is probably that Bio.Enzyme stores information as attributes to a record, whereas Bio.ExPASy.Enzyme has a Record derived from a dictionary, and stores information in the dictionary (same as Bio.Medline). Does anybody have any objection if Bio.ExPASy.Enzyme becomes the "official" parser for ExPASy's Enzyme database? If not, I'll modify the documentation and tests accordingly, and start the deprecation process for Bio.Enzyme.

--Michiel

--- On Sun, 3/15/09, Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> From: Michiel de Hoon <mjldehoon at yahoo.com>
> Subject: [Biopython-dev] Bio.ExPASy
> To: biopython-dev at biopython.org
> Date: Sunday, March 15, 2009, 6:24 AM
> Hi everybody,
> 
> As discussed previously, I have moved the Bio.Prosite code
> to Bio.ExPASy, and I've added a ScanProsite module to
> Bio.ExPASy. I guess Bio.Enzyme should also move to
> Bio.ExPASy. See
> 
> http://biopython.org/DIST/docs/tutorial/Tutorial.proposal.html
> 
> for the documentation of Biopython as currently in CVS.
> 
> --Michiel.
> 
> 
>       
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From lueck at ipk-gatersleben.de  Tue Mar 24 05:34:19 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 24 Mar 2009 10:34:19 +0100
Subject: [BioPython] Emboss eprimer3
Message-ID: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>

Hi!

I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode:

1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C? 

2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing.

The primer 3 file looks like this:

PRIMER_SEQUENCE_ID=HF15E08r
SEQUENCE=GCATGTAATAATGCCAAAGCTCACAGCTGCAGTTGAATCTTGGGACCCGCGGAGCGAGAATGTACCAATCCATGTATGGGTACACCCATGGCTGCCAACTCTAGGGCAAAGGATAGATACACTGTGCCACTCTATCCGGTACAAGCTGAGTAGTGTCCTCCAATTATGGCAAGCTCACGATTCATCAGCTTATGCTGTGCTATCTCCATGGAAGGGTGTATTTGATCCAGCAAGTTGGGAAGACTTGATAGTGCGTTATATCATTCCTAAACTGAAAATGGCACTCCAGGAGTTCCAGATTAACCCAGCAAGCCAAAAGTTTGACCAGTTTAACTGGGTTATGATCTGGGCTTCTGCTGTCCCGGTACACCATATGGTCCATATGTTGGAAGTTGATTTCTTTAGCAAGTGGCAGCTGGTTTTGTACCATTGGCTGAGCTCACCAAATCCTGATTTCAATGAGATAATGAATTGGTAT
PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200
PRIMER_OPT_TM=60.0
PRIMER_MIN_TM=58.0
PRIMER_MAX_TM=65.0
PRIMER_MAX_DIFF_TM=3.0
PRIMER_DNA_CONC=420
PRIMER_NUM_RETURN=1
PRIMER_PAIR_PENALTY=0.8691
PRIMER_LEFT_PENALTY=0.708329
PRIMER_RIGHT_PENALTY=0.160746
PRIMER_LEFT_SEQUENCE=GCATGTAATAATGCCAAAGC
PRIMER_RIGHT_SEQUENCE=TTGAAATCAGGATTTGGTGA
PRIMER_LEFT=0,20
PRIMER_RIGHT=458,20
PRIMER_LEFT_TM=59.292
PRIMER_RIGHT_TM=60.161
PRIMER_LEFT_GC_PERCENT=40.000
PRIMER_RIGHT_GC_PERCENT=35.000
PRIMER_LEFT_SELF_ANY=7.00
PRIMER_RIGHT_SELF_ANY=8.00
PRIMER_LEFT_SELF_END=2.00
PRIMER_RIGHT_SELF_END=2.00
PRIMER_LEFT_END_STABILITY=8.5000
PRIMER_RIGHT_END_STABILITY=7.9000
PRIMER_PAIR_COMPL_ANY=5.00
PRIMER_PAIR_COMPL_END=3.00
PRIMER_PRODUCT_SIZE=459

Thanks in advance!
Stefanie

From biopython at maubp.freeserve.co.uk  Tue Mar 24 06:00:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Mar 2009 10:00:46 +0000
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>

2009/3/24 Stefanie L?ck <lueck at ipk-gatersleben.de>:
> Hi!
>
> I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode:
>
> 1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C?

OK, you're using the gcclamp argument (i.e. GC clamp), which is
supported by the Bio.Emboss.Applications wrapper.
http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html

I don't know if there is a primer3 argument for limiting the G or C's
at the end - have you asked on the EMBOSS mailing list?

> 2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing.

>From reading the documentation there is a "fformat1" argument which
*might* do what you want - you could try this out on the command line
and see.  Note that this argument is not currently supported in the
Bio.Emboss.Applications wrapper, but that would be easy to add.  If
this argument doesn't do what you want, you'd have to ask the EMBOSS
people about alternative output formats. Alternatively, you might
investigate the original Whitehead version of primer3.

Note that if you do succeed in changing the output format, you may
need a new parser to read it.

Peter


From mitlox at op.pl  Tue Mar 24 07:12:36 2009
From: mitlox at op.pl (mitlox)
Date: Tue, 24 Mar 2009 21:12:36 +1000
Subject: [BioPython] Superimposer
Message-ID: <49C8C024.60403@op.pl>

Hello,
I read that the Superimposer works only with the two lists of atoms 
which contain the same amount of atoms.

So I decided to use "Combinatorial Extension (CE)". This program returns 
a rotation matrix and a translation vector.

After the execution of CE I took the matrix and vector and tried to use 
it with Superimposer:
------------------------------------------------------------------------------
import sys
import numpy
from Bio.PDB import *


pdb_fix = "../files/1z9g.pdb"
pdb_mov = "../files/1z9g90.pdb"
p=PDBParser()
s1=p.get_structure("FIXED", pdb_fix)
fixed=Selection.unfold_entities(s1, "A")

s2=p.get_structure("MOVING", pdb_mov)
moving=Selection.unfold_entities(s2, "A")

rot=numpy.identity(3).astype('f')
tran=numpy.array((1.0, 2.0, 3.0), 'f')

tran[0] = -0.99996603; tran[1] = -2.00002559; tran[2] = -2.99998285
rot[0][0] = 0.19411441; rot[0][1] = -0.85385353; rot[0][2] = 0.48296351
rot[1][0] = 0.94858827; rot[1][1] = 0.28884874; rot[1][2] = 0.12940907
rot[2][0] = -0.24999979; rot[2][1] = 0.43301335; rot[2][2] = 0.86602514

for atom in moving:
    atom.transform(rot, tran)
   
sup=Superimposer()

sup.set_atoms(fixed, moving)

print sup.rotran
print sup.rms

sup.apply(moving)


print "Saving aligned structure as PDB file %s" % pdb_mov
io=PDBIO()
io.set_structure(s2)
io.save(pdb_mov)

print "Done"
------------------------------------------------------------------------------

Unfortunalaty "print sup.rotran" returns this:
(array([[ 0.19411383,  0.94858824, -0.25000035],
       [-0.85385389,  0.28884841,  0.43301285],
       [ 0.4829631 ,  0.12940999,  0.86602523]]), array([-0.06470776,  
1.91446435,  3.21412203]))

but this matrix and vector are no the same like above.

What do I wrong?

Thank you in advance.

Best regards,


From biopython at maubp.freeserve.co.uk  Tue Mar 24 07:43:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Mar 2009 11:43:05 +0000
Subject: [BioPython] Superimposer
In-Reply-To: <49C8C024.60403@op.pl>
References: <49C8C024.60403@op.pl>
Message-ID: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>

On Tue, Mar 24, 2009 at 11:12 AM, mitlox <mitlox at op.pl> wrote:
> Hello,
> I read that the Superimposer works only with the two lists of atoms which
> contain the same amount of atoms.
>
> So I decided to use "Combinatorial Extension (CE)". This program returns a
> rotation matrix and a translation vector.
>
> After the execution of CE I took the matrix and vector and tried to use it
> with Superimposer:

Why?  Once you know the transformation, why do you need to try and
recreate it with the superimposer?  Are you just doing this as a check?

> ------------------------------------------------------------------------------
> import sys
> import numpy
> from Bio.PDB import *
>
>
> pdb_fix = "../files/1z9g.pdb"
> pdb_mov = "../files/1z9g90.pdb"
> p=PDBParser()
> s1=p.get_structure("FIXED", pdb_fix)
> fixed=Selection.unfold_entities(s1, "A")
>
> s2=p.get_structure("MOVING", pdb_mov)
> moving=Selection.unfold_entities(s2, "A")

You should be loading in the ORGINAL pdb file here, as the moved one
won't exist yet, and if it did, you'd apply the transformation twice.

Note you should expect slight differences due to floating point
calculations.  Your input was:

array([[ 0.19411442, -0.85385352,  0.4829635 ],
       [ 0.94858825,  0.28884873,  0.12940907],
       [-0.24999979,  0.43301335,  0.86602515]], dtype=float32)
array([-0.99996603, -2.00002551, -2.99998283], dtype=float32),

The output was:

array([[ 0.19411439,  0.94858827, -0.24999978],
       [-0.85385353,  0.28884871,  0.43301335],
       [ 0.4829635 ,  0.12940907,  0.86602514]]),
array([-0.06473777,  1.91448618,  3.21410633])

The rotation looks transposed (backwards).  The translation does look
different... however, if you switch this line:
sup.set_atoms(fixed, moving)
to:
sup.set_atoms(moving, fixed)
then things agree.  I suspect something is flipped in the logic of
your script regarding the frames of reference.

Also, at the end you do sup.apply(moving), but you have already
manually moved these atoms, so won't your PDB file have them moved
twice?

Peter

From mitlox at op.pl  Tue Mar 24 08:18:32 2009
From: mitlox at op.pl (mitlox)
Date: Tue, 24 Mar 2009 22:18:32 +1000
Subject: [BioPython] Superimposer
In-Reply-To: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>
References: <49C8C024.60403@op.pl>
	<320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>
Message-ID: <49C8CF98.30809@op.pl>

Thank you for you email.  I would like only rotate and translate a pdb 
file that I can see the result in a pdb viewer.

Maybe I do not need the Superimposer object to rotate and translate a 
pdb file with known rotation matrix and translation vector?

Do you know how could I rotate and translate a pdb file?

Thank you in advance.

Peter wrote:
> On Tue, Mar 24, 2009 at 11:12 AM, mitlox <mitlox at op.pl> wrote:
>   
>> Hello,
>> I read that the Superimposer works only with the two lists of atoms which
>> contain the same amount of atoms.
>>
>> So I decided to use "Combinatorial Extension (CE)". This program returns a
>> rotation matrix and a translation vector.
>>
>> After the execution of CE I took the matrix and vector and tried to use it
>> with Superimposer:
>>     
>
> Why?  Once you know the transformation, why do you need to try and
> recreate it with the superimposer?  Are you just doing this as a check?
>
>   
>> ------------------------------------------------------------------------------
>> import sys
>> import numpy
>> from Bio.PDB import *
>>
>>
>> pdb_fix = "../files/1z9g.pdb"
>> pdb_mov = "../files/1z9g90.pdb"
>> p=PDBParser()
>> s1=p.get_structure("FIXED", pdb_fix)
>> fixed=Selection.unfold_entities(s1, "A")
>>
>> s2=p.get_structure("MOVING", pdb_mov)
>> moving=Selection.unfold_entities(s2, "A")
>>     
>
> You should be loading in the ORGINAL pdb file here, as the moved one
> won't exist yet, and if it did, you'd apply the transformation twice.
>
> Note you should expect slight differences due to floating point
> calculations.  Your input was:
>
> array([[ 0.19411442, -0.85385352,  0.4829635 ],
>        [ 0.94858825,  0.28884873,  0.12940907],
>        [-0.24999979,  0.43301335,  0.86602515]], dtype=float32)
> array([-0.99996603, -2.00002551, -2.99998283], dtype=float32),
>
> The output was:
>
> array([[ 0.19411439,  0.94858827, -0.24999978],
>        [-0.85385353,  0.28884871,  0.43301335],
>        [ 0.4829635 ,  0.12940907,  0.86602514]]),
> array([-0.06473777,  1.91448618,  3.21410633])
>
> The rotation looks transposed (backwards).  The translation does look
> different... however, if you switch this line:
> sup.set_atoms(fixed, moving)
> to:
> sup.set_atoms(moving, fixed)
> then things agree.  I suspect something is flipped in the logic of
> your script regarding the frames of reference.
>
> Also, at the end you do sup.apply(moving), but you have already
> manually moved these atoms, so won't your PDB file have them moved
> twice?
>
> Peter
>
>   


From biopython at maubp.freeserve.co.uk  Tue Mar 24 08:41:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Mar 2009 12:41:53 +0000
Subject: [BioPython] Superimposer
In-Reply-To: <49C8CF98.30809@op.pl>
References: <49C8C024.60403@op.pl>
	<320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>
	<49C8CF98.30809@op.pl>
Message-ID: <320fb6e00903240541p5fa8e043wc3363b18b34af37b@mail.gmail.com>

On Tue, Mar 24, 2009 at 12:18 PM, mitlox <mitlox at op.pl> wrote:
> Thank you for you email. ?I would like only rotate and translate a pdb file
> that I can see the result in a pdb viewer.

I see.

> Maybe I do not need the Superimposer object to rotate and translate a pdb
> file with known rotation matrix and translation vector?

Correct.

> Do you know how could I rotate and translate a pdb file?

You've got most of the steps already.  This is my suggestion:

import numpy
from Bio import PDB

pdb_fix = "1z9g.pdb"
pdb_mov = "1z9g_moved.pdb"

structure = PDB.PDBParser().get_structure("FIXED", pdb_fix)

rot=numpy.identity(3).astype('f')
tran=numpy.array((-0.99996603, -2.00002559, -2.99998285))
rot=numpy.array(((+0.19411441, -0.85385353, +0.48296351),
                 (+0.94858827, +0.28884874, +0.12940907),
                 (-0.24999979, +0.43301335, +0.86602514)))

print "Applying transformation..."
for atom in structure.get_atoms() :
    atom.transform(rot, tran)

print "Saving transformed structure as PDB file %s" % pdb_mov
io=PDB.PDBIO()
io.set_structure(structure)
io.save(pdb_mov)
print "Done"

NOTE - When giving a translation mapping as a translation vector and
a rotation matrix there is some ambiguity about which order to apply them
in.  If the results using Bio.PDB don't match what you expect, you may
want to double check this first.

Peter


From cjfields at illinois.edu  Tue Mar 24 12:51:32 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 24 Mar 2009 11:51:32 -0500
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
Message-ID: <656D2F16-80DD-4976-90FE-2BCB8802093E@illinois.edu>

On Mar 24, 2009, at 5:00 AM, Peter wrote:

> ...
>> From reading the documentation there is a "fformat1" argument which
> *might* do what you want - you could try this out on the command line
> and see.  Note that this argument is not currently supported in the
> Bio.Emboss.Applications wrapper, but that would be easy to add.  If
> this argument doesn't do what you want, you'd have to ask the EMBOSS
> people about alternative output formats. Alternatively, you might
> investigate the original Whitehead version of primer3.

Peter,

Not sure if this will be a problem for the BioPython wrapper for  
primer3, but the latest Primer3 version on Sourceforge (v2.0.0a)  
radically changes the various input parameters.  I had to rewrite a  
bunch of code to handle those as well as older (v1) primer3 params.

> Note that if you do succeed in changing the output format, you may
> need a new parser to read it.
>
> Peter

primer3 input and output is BoulderIO (which I think is an essentially  
obsolete format Lincoln Stein wrote up many years ago).  It's very  
easy to parse, just simple key-value pairings.

chris

From nir at rosettadesigngroup.com  Wed Mar 25 12:18:24 2009
From: nir at rosettadesigngroup.com (Nir London)
Date: Wed, 25 Mar 2009 18:18:24 +0200
Subject: [BioPython] Rosetta Academic Training Webinar
Message-ID: <88F0F36A-FC4D-4A9C-AC31-5B883C3F92CB@rosettadesigngroup.com>

The Rosetta Design Group is proud to present the first webinar in the  
Rosetta Academic Workshop Series. For the first webinar, we have  
selected to focus on Protein-Protein Docking based on the answers to  
the interest poll. We hope this will be the first in a line of helpful  
and inspiring webinars to kick-off our Rosetta Academic Workshop Series.

What: Protein-Protein Docking
When: May 4th 2009, 0800-1000 AM EST

Where: Your office!

Click here for more details and registration

(For non html emails: http://rosettadesigngroup.com/RDGLS/index.php?sid=54479&lang=en 
  )

Pleas note: This is not a promotional webinar. Rosetta is open-source  
and freeware for academic and non-profit organizations and can be  
downloaded here from University of Washington's TechTransfer Digital  
Ventures. The majority of the webinar is concerned with Rosetta 2.3.0.  
Rosetta 3.0 is still a beta version.

Hope to see you there,

Nir London.

Rosetta Design Group | http://rosettadesigngroup.com/

From biopython.chen at gmail.com  Wed Mar 25 22:59:04 2009
From: biopython.chen at gmail.com (chen Ku)
Date: Wed, 25 Mar 2009 19:59:04 -0700
Subject: [BioPython] how to retrieve data from PDB
Message-ID: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>

Dear all,
                I need your help in writing code to retrieve some of the pdb
structures.

Problem definition
 I just want to use some PDB file not all 50,000.

> I want to apply one python code so that I can know transcription factor
binding to DNA only out of all pdb data. So please guide me how to proceed
for this.I raed some published article on this dataset and just want to do
by python and not by manually.This is one of our course work in structural
biology so trying by my own and taking some help of you all. I need a
general code where I can check this kind of things by changing field
name.Any help will be grateful for me as I am  a beginner in python.


Regards
Chen

From lueck at ipk-gatersleben.de  Thu Mar 26 05:42:42 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Thu, 26 Mar 2009 10:42:42 +0100
Subject: [BioPython] Emboss eprimer3
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
Message-ID: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>

Hi!

I got a patch to add a '-originalformat' argument. If someone is interested 
too, I could send it to him or the mailing list.

>>>Note that if you do succeed in changing the output format, you may need a 
>>>new parser to read it.

This is no problem. I just need the data ;-)

>>> I don't know if there is a primer3 argument for limiting the G or C's at 
>>> the end - have you asked on the EMBOSS mailing list?

Yes, no answer yet.

Kind regards
Stefanie


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at lists.open-bio.org>
Sent: Tuesday, March 24, 2009 11:00 AM
Subject: Re: [BioPython] Emboss eprimer3


2009/3/24 Stefanie L?ck <lueck at ipk-gatersleben.de>:
> Hi!
>
> I have some questions about eprimer3 from Emboss which I use over Python 
> to design primers in a batch mode:
>
> 1) I'm using the GCclamp function (value=1). Is it possible to limit the G 
> or C's at the end to maximum of one G or C?

OK, you're using the gcclamp argument (i.e. GC clamp), which is
supported by the Bio.Emboss.Applications wrapper.
http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html

I don't know if there is a primer3 argument for limiting the G or C's
at the end - have you asked on the EMBOSS mailing list?

> 2) Is there a setting to get the original primer3 output? The emboss 
> output is for hundrets of primers not very usefull and many informations 
> are missing.

>From reading the documentation there is a "fformat1" argument which
*might* do what you want - you could try this out on the command line
and see.  Note that this argument is not currently supported in the
Bio.Emboss.Applications wrapper, but that would be easy to add.  If
this argument doesn't do what you want, you'd have to ask the EMBOSS
people about alternative output formats. Alternatively, you might
investigate the original Whitehead version of primer3.

Note that if you do succeed in changing the output format, you may
need a new parser to read it.

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar 26 06:23:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Mar 2009 10:23:01 +0000
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
	<005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00903260323p50f80c50w1ab07c8892518190@mail.gmail.com>

On Thu, Mar 26, 2009 at 9:42 AM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> I got a patch to add a '-originalformat' argument. If someone is interested
> too, I could send it to him or the mailing list.

Could you file an bug on bugzilla please, and the (after the bug is
filed) you can attach the patch.  I'll look at this (if Brad doesn't
first) - if you can also include a short example that would be
excellent.

Thank you,

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar 26 07:04:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Mar 2009 11:04:29 +0000
Subject: [BioPython] how to retrieve data from PDB
In-Reply-To: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
Message-ID: <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com>

On Thu, Mar 26, 2009 at 2:59 AM, chen Ku <biopython.chen at gmail.com> wrote:
> Dear all,
> ? ? ? ? ? ? ? ?I need your help in writing code to retrieve some of the pdb
> structures.
>
> Problem definition
> ?I just want to use some PDB file not all 50,000.
>
>> I want to apply one python code so that I can know transcription factor
> binding to DNA only out of all pdb data. So please guide me how to proceed
> for this.

According to the website, there are about 2250 protein structures in
complex with nucleotides - and I assume some of these are for
transcription factors with DNA:
http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=molType-protein-nucleic-complex&seqid=100

I assume you'll want to search these PDB for entries which are
transcription factors binding to DNA, but I don't know enough about
the PDB search options to advise you.

Peter


From jblanca at btc.upv.es  Thu Mar 26 07:48:02 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Thu, 26 Mar 2009 12:48:02 +0100
Subject: [BioPython] about the SeqRecord slicing
Message-ID: <200903261248.02279.jblanca@btc.upv.es>

Hi:
I'm working with the SeqRecord slicing from cvs and I think that the behaviour 
could be sligthly changed. In fact that same opinion is written in the 
__getitem__ method:

        if isinstance(index, int) :
            #NOTE - The sequence level annotation like the id, name, etc
            #do not really apply to a single character.  However, should
            #we try and expose any per-letter-annotation here?  If so how?
            return self.seq[index]

I don't like the fact that the SeqRecord returns different classes depending 
on the index type. I think is better to return always a SeqRecord because:
- It simplifies the interface. It's easier to deal with the SeqRecord class if 
its behaviour is simple. Otherwise we have to check in the code that uses the 
SeqRecord if it's returning an str or a SeqRecord.
- It looses the per-letter-annotation. I'm working with qualities and I'm 
interested in keeping them.
- It's redundant because if we want to slice the seq property we can do it 
with: seqrec.seq[index]
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Thu Mar 26 08:05:25 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Mar 2009 12:05:25 +0000
Subject: [BioPython] about the SeqRecord slicing
In-Reply-To: <200903261248.02279.jblanca@btc.upv.es>
References: <200903261248.02279.jblanca@btc.upv.es>
Message-ID: <320fb6e00903260505j387279b7kfa4c69c33efe5487@mail.gmail.com>

On Thu, Mar 26, 2009 at 11:48 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
> I'm working with the SeqRecord slicing from cvs and I think that the behaviour
> could be sligthly changed. In fact that same opinion is written in the
> __getitem__ method:
>
> ? ? ? ?if isinstance(index, int) :
> ? ? ? ? ? ?#NOTE - The sequence level annotation like the id, name, etc
> ? ? ? ? ? ?#do not really apply to a single character. ?However, should
> ? ? ? ? ? ?#we try and expose any per-letter-annotation here? ?If so how?
> ? ? ? ? ? ?return self.seq[index]
>
> I don't like the fact that the SeqRecord returns different classes depending
> on the index type. I think is better to return always a SeqRecord because:
> - It simplifies the interface. It's easier to deal with the SeqRecord class if
> its behaviour is simple. Otherwise we have to check in the code that uses the
> SeqRecord if it's returning an str or a SeqRecord.
> - It looses the per-letter-annotation. I'm working with qualities and I'm
> interested in keeping them.
> - It's redundant because if we want to slice the seq property we can do it
> with: seqrec.seq[index]
> Best regards,

Hi Jose,

As we are talking about the CVS code, maybe this could have been on
the dev mailing list, but as its of general interest let's carry on
here for now.

You note that (currently in CVS) the new SeqRecord slicing returns a
SeqRecord for a slice, but a single letter string for a single integer
index.

This isn't so different from the Seq object - it returns a new Seq
object for a slice, but a single letter string for a single integer
index:
>>> from Bio.Seq import Seq
>>> s = Seq("ACGT")
>>> s
Seq('ACGT', Alphabet())
>>> s[0]
'A'
>>> s[0:3]
Seq('ACG', Alphabet())

More generally, consider lists in Python:
>>> x = [1,2,3,4,5]
>>> x[0]
1
>>> x[0:3]
[1, 2, 3]

So I don't agree with this expectation that slicing and indexing a
SeqRecord should automatically both give a SeqRecord.  You really want
a SeqRecord for a single character string?

Can you give me an example of where you want to pull out a single
character from a SeqRecord, and its quality?  I would consider things
like this quite elegant:

for letter, quality in zip(record.seq,
record.letter_annotations("phred_quality") :
   #do stuff

Peter


From chapmanb at 50mail.com  Thu Mar 26 08:40:45 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 26 Mar 2009 08:40:45 -0400
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
	<005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
Message-ID: <20090326124045.GD21577@sobchak.mgh.harvard.edu>

Hi all;

Stefanie:
> I got a patch to add a '-originalformat' argument. If someone is interested 
> too, I could send it to him or the mailing list.

Is this a patch to EMBOSS itself? If so, did the developers indicate
it would be in future versions of EMBOSS?

If that's the case, we can easily add this option to the commandline
interface. You need a:

           _Option(["-originalformat"], ["input"], None, 0),

line in Bio.Emboss.Applications.Primer3Commandline.

> >>>Note that if you do succeed in changing the output format, you may need a 
> >>>new parser to read it.
> 
> This is no problem. I just need the data ;-)

Out of curiosity, what parameter did you find useful from that
output that is not in the eprimer3 format output?

> >>> I don't know if there is a primer3 argument for limiting the G or C's at 
> >>> the end - have you asked on the EMBOSS mailing list?
> 
> Yes, no answer yet.

What I do in cases like this is ask for more primers
(-numreturn) and then post-parse them to pull out the ones that
satisfy my additional criteria. The output is ordered by primer3's
ranking, so the first one that passes the criteria would move on.
If none are satisfactory, then you can also build in a logic to
decide if any are good enough for your use (for example, 2 G/Cs at
the end) and pick one from this remaining group with less stringency.

Brad


> 
> Kind regards
> Stefanie
> 
> 
> 
> ----- Original Message ----- 
> From: "Peter" <biopython at maubp.freeserve.co.uk>
> To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
> Cc: <biopython at lists.open-bio.org>
> Sent: Tuesday, March 24, 2009 11:00 AM
> Subject: Re: [BioPython] Emboss eprimer3
> 
> 
> 2009/3/24 Stefanie L?ck <lueck at ipk-gatersleben.de>:
> > Hi!
> >
> > I have some questions about eprimer3 from Emboss which I use over Python 
> > to design primers in a batch mode:
> >
> > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G 
> > or C's at the end to maximum of one G or C?
> 
> OK, you're using the gcclamp argument (i.e. GC clamp), which is
> supported by the Bio.Emboss.Applications wrapper.
> http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html
> 
> I don't know if there is a primer3 argument for limiting the G or C's
> at the end - have you asked on the EMBOSS mailing list?
> 
> > 2) Is there a setting to get the original primer3 output? The emboss 
> > output is for hundrets of primers not very usefull and many informations 
> > are missing.
> 
> >From reading the documentation there is a "fformat1" argument which
> *might* do what you want - you could try this out on the command line
> and see.  Note that this argument is not currently supported in the
> Bio.Emboss.Applications wrapper, but that would be easy to add.  If
> this argument doesn't do what you want, you'd have to ask the EMBOSS
> people about alternative output formats. Alternatively, you might
> investigate the original Whitehead version of primer3.
> 
> Note that if you do succeed in changing the output format, you may
> need a new parser to read it.
> 
> Peter
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From biopython at maubp.freeserve.co.uk  Fri Mar 27 08:18:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 27 Mar 2009 12:18:04 +0000
Subject: [BioPython] how to retrieve data from PDB
In-Reply-To: <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com>
References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
	<320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com>
	<4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com>
Message-ID: <320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com>

On Fri, Mar 27, 2009 at 2:53 AM, chen Ku <biopython.chen at gmail.com> wrote:
> Thank you so much for the guidance but I need the coding part in python to
> retrieve the data.
>
> Any help will be helpful for me.

Have a look at the Bio.PDB.PDBList module in Biopython - this may do
what you want.

Peter

From p.j.a.cock at googlemail.com  Fri Mar 27 13:31:55 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 27 Mar 2009 17:31:55 +0000
Subject: [BioPython] Biopython application note published
Message-ID: <320fb6e00903271031k2bd31464k8aaa075f8de39c82@mail.gmail.com>

Dear all,

An Application Note describing Biopython has recently been accepted
for publication in the Oxford Journal Bioinformatics. An advance copy
of the Open Access article is available online:

P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke,
I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski and M.J.L. de Hoon
(2009) Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics,
doi:10.1093/bioinformatics/btp163
http://dx.doi.org/10.1093/bioinformatics/btp163

This was announced at the start of the week on our news page (to which
you can subscribe using the RSS or Atom feeds), but was worth
repeating for the mailing lists.  See
http://news.open-bio.org/news/2009/03/biopython-paper-published/

Peter

From biopython at maubp.freeserve.co.uk  Tue Mar 31 06:08:08 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 31 Mar 2009 11:08:08 +0100
Subject: [BioPython] how to retrieve data from PDB
In-Reply-To: <4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com>
References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
	<320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com>
	<4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com>
	<320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com>
	<4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com>
Message-ID: <320fb6e00903310308q38168dbfx447c78c6da5454ee@mail.gmail.com>

On Tue, Mar 31, 2009 at 10:45 AM, chen Ku <biopython.chen at gmail.com> wrote:
> Dear peter,
> ????????????????? thanks for the idea.I think I need to download all the pdb
> files first and then can use command on python mode. Can you please write
> one syntax to start with or give me the practical documentation so that I
> can try out and play with this PDBList.

Hi Chen,

To learn about the PDBList functionality, see page 4 of "The Biopython
Structural Bioinformatics FAQ" - this has some examples:
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

You can also read about PDBList from the built in help,
>>> from Bio import PDB
>>> help(PDB.PDBList)
Or online at http://biopython.org/DIST/docs/api/Bio.PDB.PDBList%27.PDBList-class.html

If you really do want to download all 56,000+ PDB files (and I don't
think this is a good idea), instead of using Python, you might also
consider using the command line tool rsync, see:
http://www.pdb.org/pdb/general_information/news_publications/newsletters/2003q3/focus_rsync.html

However, as I said before, you only want transcription factors with
DNA, so at most you'll need to download the 2250 protein structures in
complex with nucleotides.  I strongly urge you to find out more about
searching the PDB in order to get a list of just the few PDB reference
codes that you'll actually need - and download just those.

Peter


From hermifi at yahoo.com  Tue Mar 31 23:56:22 2009
From: hermifi at yahoo.com (Hermella Woldemdihin)
Date: Tue, 31 Mar 2009 20:56:22 -0700 (PDT)
Subject: [BioPython] HELP!
Message-ID: <513066.92437.qm@web111011.mail.gq1.yahoo.com>

Hi everyone,
I am trying to write a bio-python script that uses SwissProt accession numbers to download a sequence objects and then run remote blast with the sequences. Then download good hit sequences listed in Blast results and print their sequences.I am using a Windows based system with bio-python 2.5, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks.
Hermi


From winda002 at student.otago.ac.nz  Tue Mar  3 22:03:36 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 04 Mar 2009 11:03:36 +1300
Subject: [BioPython] ACE contig to alignment
Message-ID: <49ADA938.80408@student.otago.ac.nz>

Hi all,

I'd like to start by thanking everyone that's contributed to biopython 
and especially the cookbook/tutorial - its been a great help to this 
empiricist getting into some (decidedly amateur) bioinformatics.However, 
for the first time I've run into a problem the available docs can't help 
me with.

I want to be able to represent all of the reads that contribute to a 454 
sequencing contig as a generic biopython alignment. I've written some 
code that I thought would pad/cut the reads to size and add them to an 
alignment but when I run it a significant minority of the contigs in the 
files I'm working with have misalignments. I was wondering if someone 
more familiar with the ace parser or generic alignment class could tell 
me if I'm making some elementary mistake (it is possible that original 
alignment was bad, just seems more likely I did something dumb). I can 
send along an ACE file if you want to run the script (didn't want to 
spam the list with attachments).

Thanks in advance for any pointers and I'm sorry to force people to read 
what I'm sure is inelegant code:
 
from Bio.Sequencing import Ace
from Bio.Align.Generic import Alignment
from Bio.Alphabet import IUPAC, Gapped

ace_handle = open('eldoni.ace', 'r')
contigs = Ace.parse(ace_handle)
alignments = [] #start the list to which we'll add the contig data

for contig in contigs:    
  conname = contig.name + " numreads=" + str(contig.nreads)
  conlength = len(contig.sequence)
  align = Alignment(Gapped(IUPAC.ambiguous_dna, "*"))
  for readn in range(len(contig.reads)):
    start = contig.af[readn].padded_start # position rel to consensus
    if start < 1:
      # If 'start' is negative or zero we need to ignore bases
      readseq =  contig.reads[readn].rd.sequence[-1 * start+1:]
    else:
      # If it's larger then the start needs to be padded with gaps
      readseq =  (start-1) * '*' + contig.reads[readn].rd.sequence
    #Finally, pad the end then cut to size
    readseq = readseq + (conlength-len(readseq)) * '*'
    readseq = readseq[:conlength]
    align.add_sequence(readn+1, readseq)
  condata = conname, align
  alignments.append(condata)

-- 
PhD Student
Allan Wilson Centre 
Department of Zoology
University of Otago, PO Box 56, Dunedin 9054

p: +64-3-4798459
m: +64-27-3326815 
e: winda002 at student.otago.ac.nz


From winda002 at student.otago.ac.nz  Wed Mar  4 02:40:08 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 04 Mar 2009 15:40:08 +1300
Subject: [BioPython] ACE contig to alignment (found my error)
In-Reply-To: <49ADA938.80408@student.otago.ac.nz>
References: <49ADA938.80408@student.otago.ac.nz>
Message-ID: <49ADEA08.8070400@student.otago.ac.nz>

Hi again all,

After digging around a little more I realised the dumb mistake I made. 
In case anyone was interested and to prevent future suffering by getting 
the answer on to google:

The code as written is adding the entirety of each read to the alignment 
but when the assembly was made some reads where clipped on either side 
for quality. Including the low quality bases from each read makes some 
of the alignments nasty.  In my case "contig.reads[readn].qa" contains 
the start and end clipping points needed to get just the 'good' bases of 
each read into the alignment.

Cheers,
David

David Winter wrote:
> Hi all,
>
> I'd like to start by thanking everyone that's contributed to biopython 
> and especially the cookbook/tutorial - its been a great help to this 
> empiricist getting into some (decidedly amateur) 
> bioinformatics.However, for the first time I've run into a problem the 
> available docs can't help me with.
>
> I want to be able to represent all of the reads that contribute to a 
> 454 sequencing contig as a generic biopython alignment. I've written 
> some code that I thought would pad/cut the reads to size and add them 
> to an alignment but when I run it a significant minority of the 
> contigs in the files I'm working with have misalignments. I was 
> wondering if someone more familiar with the ace parser or generic 
> alignment class could tell me if I'm making some elementary mistake 
> (it is possible that original alignment was bad, just seems more 
> likely I did something dumb). I can send along an ACE file if you want 
> to run the script (didn't want to spam the list with attachments).
>
> Thanks in advance for any pointers and I'm sorry to force people to 
> read what I'm sure is inelegant code:
>
> from Bio.Sequencing import Ace
> from Bio.Align.Generic import Alignment
> from Bio.Alphabet import IUPAC, Gapped
>
> ace_handle = open('eldoni.ace', 'r')
> contigs = Ace.parse(ace_handle)
> alignments = [] #start the list to which we'll add the contig data
>
> for contig in contigs:     conname = contig.name + " numreads=" + 
> str(contig.nreads)
>  conlength = len(contig.sequence)
>  align = Alignment(Gapped(IUPAC.ambiguous_dna, "*"))
>  for readn in range(len(contig.reads)):
>    start = contig.af[readn].padded_start # position rel to consensus
>    if start < 1:
>      # If 'start' is negative or zero we need to ignore bases
>      readseq =  contig.reads[readn].rd.sequence[-1 * start+1:]
>    else:
>      # If it's larger then the start needs to be padded with gaps
>      readseq =  (start-1) * '*' + contig.reads[readn].rd.sequence
>    #Finally, pad the end then cut to size
>    readseq = readseq + (conlength-len(readseq)) * '*'
>    readseq = readseq[:conlength]
>    align.add_sequence(readn+1, readseq)
>  condata = conname, align
>  alignments.append(condata)


From rodrigo_faccioli at uol.com.br  Thu Mar  5 04:04:07 2009
From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli)
Date: Thu, 5 Mar 2009 01:04:07 -0300
Subject: [BioPython] Bio.Entez - Help
Message-ID: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com>

I want to know where I can find examples about Bio.Entez. Specifically, I'm
developing a program which has a protein primary sequence and I need to
search its conserved domain and read it to show for user.

I'm reading this link
http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However,
I'm not understanding very well. I know that I will work with CDD database.

I made a simple example which is below.

from Bio import Entrez
Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are
handle = Entrez.esearch(db="cdd",
term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN")
record = Entrez.read(handle)
print record["IdList"]

Thanks for any helps.


-- 
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218


From biopython at maubp.freeserve.co.uk  Thu Mar  5 10:42:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 10:42:02 +0000
Subject: [BioPython] Bio.Entez - Help
In-Reply-To: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com>
References: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com>
Message-ID: <320fb6e00903050242v63a2f38cgc6eddfa3819814e4@mail.gmail.com>

On Thu, Mar 5, 2009 at 4:04 AM, Rodrigo faccioli
<rodrigo_faccioli at uol.com.br> wrote:
> I want to know where I can find examples about Bio.Entez. Specifically, I'm
> developing a program which has a protein primary sequence and I need to
> search its conserved domain and read it to show for user.
>
> I'm reading this link
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However,
> I'm not understanding very well. I know that I will work with CDD database.

The CDD database is one of several protein motif databases the NCBI
make available for use with their tool RPS-BLAST.  CDD is a composite
database which includes domains from PFAM, SMART, KOG etc.

Have a look at  http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
with your example and you'll get a hit to pfam00321.

It sound like what you want is a script which runs RPS-BLAST using
your query protein against the CDD motif database.

You can run BLASTN, BLASTP etc online at the NCBI using a script, but
as far as I know, the NCBI do not make RPS-BLAST (or PSI-BLAST)
available in this way.  I haven't checked this in recent months.

However, I have done task myself using standalone BLAST installed on
my computer, i.e. the tool rpsblast from the NCBI.  You'll also need
to install the databases (which are big - you'll need plenty of disk
space and RAM).  Once this is installed and working, you can rpsblast
this from Biopython using the Bio.Blast.NCBIStandalone.rpsblast(...)
function.

> I made a simple example which is below.
>
> from Bio import Entrez
> Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are
> handle = Entrez.esearch(db="cdd",
> term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN")
> record = Entrez.read(handle)
> print record["IdList"]
>
> Thanks for any helps.

I think if you use Entrez to access the CDD database, you can just
access the domains themselves (using their names - not searching by
sequence), e.g.

>>> from Bio import Entrez
>>> Entrez.email = "Your.Name.Here at example.com"
>>> handle = Entrez.esearch(db="cdd", term="pfam00321", retmode="XML")
>>> record = Entrez.read(handle)
>>> print record["IdList"]
['109381']

You can check this ID works via their website:
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=109381

I've tried a few variations but efetch doesn't seem to support the CDD
database (yet).

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar  5 12:26:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 12:26:13 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
Message-ID: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>

Hi All,

As the following examples show, and the python string method's
docstring clearly states, the python string's count method uses a
non-overlapping search:

>>> "AAA".count("A")
3
>>> "AAA".count("AA") # you might expect 2
1
>>> "BBBB".count("BB") # you might expect 3
2

Up until Biopython 1.44, the Seq object's count method only worked for
single characters.  From Biopython 1.45 onwards it accepted longer
strings and followed the built in python string count behaviour.
However, as Noel pointed out on Bug 2779 our docstring does not make
it clear that this does a non-overlapping search.  In fact, as
Leighton suggests, one might the Seq object to use an overlapping
search in the Seq object's count method.
http://bugzilla.open-bio.org/show_bug.cgi?id=2779

We should either:

(a) stick with the python string compatible behaviour (which has been
a general principle for the Seq class), but document this issue more
clearly as a non-overlapping search does run counter to some potential
biological uses.

or,

(b) Or change the behaviour as Leighton suggests to do an overlapping
search.  This could break any code relying on the old python
string-like behaviour.

What do people here think?  Any preferences?

[I don't want to get into details about the implementation here on the
main list]

Peter


From baoilleach at gmail.com  Thu Mar  5 13:11:31 2009
From: baoilleach at gmail.com (Noel O'Boyle)
Date: Thu, 5 Mar 2009 13:11:31 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
Message-ID: <a882e48b0903050511u2495a27eu8fcd619e9a1a94ff@mail.gmail.com>

+1 for (b)

Seq.count() should behave like a biological sequence.

Here's an example in the wild of this type of analysis:
http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14

It's from a bioinformatics textbook with example code in Matlab. I was
helping a colleague who was trying to reproduce the analysis with
BioPython. Everything was fine until the dimer frequencies were found
to disagree. After implementing the count ourselves, we were able to
reproduce the results. It was then we realised that BioPython was
behaving in an unexpected and non-useful way.

- Noel


From biopython at maubp.freeserve.co.uk  Thu Mar  5 13:26:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 13:26:10 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <a882e48b0903050511u2495a27eu8fcd619e9a1a94ff@mail.gmail.com>
References: <a882e48b0903050511u2495a27eu8fcd619e9a1a94ff@mail.gmail.com>
Message-ID: <320fb6e00903050526r688eadcfv440602c32d294ee8@mail.gmail.com>

On Thu, Mar 5, 2009 at 1:11 PM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> +1 for (b)
>
> Seq.count() should behave like a biological sequence.
>
> Here's an example in the wild of this type of analysis:
> http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14
>
> It's from a bioinformatics textbook with example code in Matlab. I was
> helping a colleague who was trying to reproduce the analysis with
> BioPython. Everything was fine until the dimer frequencies were found
> to disagree. After implementing the count ourselves, we were able to
> reproduce the results. It was then we realised that BioPython was
> behaving in an unexpected and non-useful way.

I agree that in this context it is not useful to have the Seq object
count do an non-overlapping search.

However, calling it "unexpected" is debatable, and could probably
depend on the user's background background.  If you already know
Python before using Biopython, I would argue that the non-overlapping
search is expected because that is what python strings do.  On the
other hand, I'm sure many Biopython users learn Python and Biopython
together - and one might still argue having strings and Seq objects do
different things is unexpected.

Overall between options (a) and (b), I'd pick consistency with the
python string (a), even if it isn't ideal.

There is another idea, let's call this option (c).  Give the Seq
object's count method an optional boolean argument to enable an
overlapping search (which I would want to default to matching the
python string behaviour).  This makes switching between string and Seq
objects easier, and makes the more useful (but probably slower)
overlap aware count option quite accessible and discoverable.

Peter


From bartek at rezolwenta.eu.org  Thu Mar  5 13:28:14 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 5 Mar 2009 14:28:14 +0100
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
Message-ID: <8b34ec180903050528m7a3815c8l3048046e42f0ce00@mail.gmail.com>

On Thu, Mar 5, 2009 at 1:26 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:

> (a) stick with the python string compatible behaviour (which has been
> a general principle for the Seq class), but document this issue more
> clearly as a non-overlapping search does run counter to some potential
> biological uses.
>
> or,
>
> (b) Or change the behaviour as Leighton suggests to do an overlapping
> search. ?This could break any code relying on the old python
> string-like behaviour.
>
> What do people here think? ?Any preferences?
>
> [I don't want to get into details about the implementation here on the
> main list]
>

I don't use the count method much, so I don't have a strong opinion on that.

As Leighton pointed out, searching for sequences looks like  a good
job for Bio.Motif

It's currently doable, but (since Bio.Motif mostly deals with more
complex motifs than a single sequence)
the interface is not polished and it's not optimized for performance.

Currently the code to do this would look like this:

m=Bio.Motif.Motif()
m.add_instance(Seq("GG",m.alphabet))
for i in m.search_instances(your_long sequence):
    print "found GG at position",i

If there is a need to keep backwards compatibility for .count(), I can
make changes to Bio.Motif to make it easier for people to use it.

-- 
Bartek


From lpritc at scri.ac.uk  Thu Mar  5 13:34:03 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 05 Mar 2009 13:34:03 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
Message-ID: <C5D5854B.1E785%lpritc@scri.ac.uk>

Hi,

On 05/03/2009 12:26, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> We should either:
> 
> (a) stick with the python string compatible behaviour (which has been
> a general principle for the Seq class), but document this issue more
> clearly as a non-overlapping search does run counter to some potential
> biological uses.
> 
> or,
> 
> (b) Or change the behaviour as Leighton suggests to do an overlapping
> search.  This could break any code relying on the old python
> string-like behaviour.
> 
> What do people here think?  Any preferences?

Not surprisingly, I favour (b).

The intended domain of use for Seq is as a proxy for a biological entity and
I think that, just as we extend methods to reflect useful
biologically-themed operations, we should also override methods as
appropriate to reflect those same themes.

I can think of a number of run-of-the-mill use cases where we would want to
know about the count of (potentially) overlapping matches of a subsequence
in a biological sequence, for short sequence repeats (SSRs), restriction
sites, protein sequence motifs, and so on.  Also, if we want simply to test
the expected number of occurrences of the dimer 'AA' in a larger sequence
with a given base composition, a non-overlapping count() method will give a
misleading answer, as it will underreport occurrences of 'AA' in odd-length
runs of consecutive 'A's.  I think that the overlapping approach (b) should
at least be a default setting, even if we choose to make overlap/non-overlap
an argument to the method.

For some searches that potentially could have overlaps we might want to know
what biological question is being asked before choosing which approach to
take.  We may, for example, desire different behaviour from query sequences
like 'AGCCAG' depending on circumstances.  This query on 'AGCCAGCCAG' will
return 1 if there is no overlap is allowed, and 2 if an overlap is allowed.
The same query on 'AGCCAGAGCCAG' will return 2 in both cases.  If we care
about 'AGCCAG' as a restriction site, then we would want an overlapping
search.  If we care about 'AGCCAG' as a simple repeat unit, then we might
want a non-overlapping search instead (assuming that the circumstances of
the search are such that this is a sensible answer).  Having the option
might be useful.

A non-overlapping search might also be useful in those cases where existing
code already corrects for nonintuitive behaviour of count().  This is only
going to apply to code that has been produced since release 1.45, so may
only have limited impact, if any.  I would argue that, since a correction
was needed, by parsimony the original behaviour was probably what required
the change.

On the whole, I think that an overlapping count() is the most intuitive and
most likely use case.  I see that there's an argument for consistency with
string.count(), in that dyed-in-the-wool programmers might find it hard to
shift mental gears from one to the other, but I'm not sure that it's a good
argument, for the following reason.

The following statements are true:

A String is a Python sequence type.  Its count() method returns a
non-overlapping count of the query substring.

A List is a Python sequence type.  Its count() method returns the number of
elements that match the query.

A Tuple is a Python sequence type.  It doesn't have a count() method,
although you might imagine that it could stand to have one.

There isn't any cross-sequence object consistency regarding count().  Should
we choose String-like or List-like behaviour when dealing with a MutableSeq?
I don't think that we should seek consistency with String at the expense of
utility or biological intuition, when:

A Seq/MutableSeq is a (Bio)Python sequence type.  Its count() method returns
the overlapping count of the query substring.

Fits nicely with the other three statements, in that none of them are
consistent with any other ;)

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From mjldehoon at yahoo.com  Thu Mar  5 14:49:10 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 5 Mar 2009 06:49:10 -0800 (PST)
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
Message-ID: <418103.38901.qm@web62405.mail.re1.yahoo.com>


I vote (b).
Another option is to continue to use count() for a Python-style count, and to add a new method that does a overlapping-type count. For this new method we'd need a clear but short name, and I can't think of anything now.

--Michiel.


--- On Thu, 3/5/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [BioPython] The count method of a Seq (or MutableSeq) object
> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
> Date: Thursday, March 5, 2009, 7:26 AM
> Hi All,
> 
> As the following examples show, and the python string
> method's
> docstring clearly states, the python string's count
> method uses a
> non-overlapping search:
> 
> >>> "AAA".count("A")
> 3
> >>> "AAA".count("AA") # you
> might expect 2
> 1
> >>> "BBBB".count("BB") # you
> might expect 3
> 2
> 
> Up until Biopython 1.44, the Seq object's count method
> only worked for
> single characters.  From Biopython 1.45 onwards it accepted
> longer
> strings and followed the built in python string count
> behaviour.
> However, as Noel pointed out on Bug 2779 our docstring does
> not make
> it clear that this does a non-overlapping search.  In fact,
> as
> Leighton suggests, one might the Seq object to use an
> overlapping
> search in the Seq object's count method.
> http://bugzilla.open-bio.org/show_bug.cgi?id=2779
> 
> We should either:
> 
> (a) stick with the python string compatible behaviour
> (which has been
> a general principle for the Seq class), but document this
> issue more
> clearly as a non-overlapping search does run counter to
> some potential
> biological uses.
> 
> or,
> 
> (b) Or change the behaviour as Leighton suggests to do an
> overlapping
> search.  This could break any code relying on the old
> python
> string-like behaviour.
> 
> What do people here think?  Any preferences?
> 
> [I don't want to get into details about the
> implementation here on the
> main list]
> 
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Thu Mar  5 15:05:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 15:05:39 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <418103.38901.qm@web62405.mail.re1.yahoo.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
Message-ID: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>

On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
>
> I vote (b).
> Another option is to continue to use count() for a Python-style count,
> and to add a new method that does a overlapping-type count. For this
> new method we'd need a clear but short name, and I can't think of
> anything now.
>
> --Michiel.

Did you like plan (c), which preserves the Python string style count
as the default but offers the non-overlapping count via an optional
argument?

i.e.
>>> from Bio.Seq import Seq
>>> nuc = Seq("AAAA")
>>> nuc.count("AA") #default is non-overlapping
2
>>> nuc.count("AA", overlap=True)
3
>>> nuc.count("AA", overlap=False)
2

Peter


From dalloliogm at gmail.com  Thu Mar  5 15:10:59 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 5 Mar 2009 16:10:59 +0100
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
Message-ID: <5aa3b3570903050710hb407258k6fca86cf1bf9520f@mail.gmail.com>

On Thu, Mar 5, 2009 at 4:05 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>
>>
>> I vote (b).
>> Another option is to continue to use count() for a Python-style count,
>> and to add a new method that does a overlapping-type count. For this
>> new method we'd need a clear but short name, and I can't think of
>> anything now.
>>
>> --Michiel.
>
> Did you like plan (c), which preserves the Python string style count
> as the default but offers the non-overlapping count via an optional
> argument?
>
> i.e.
>>>> from Bio.Seq import Seq
>>>> nuc = Seq("AAAA")
>>>> nuc.count("AA") #default is non-overlapping
> 2
>>>> nuc.count("AA", overlap=True)
> 3
>>>> nuc.count("AA", overlap=False)
> 2


Imho this is the best solution. If I can say, I expect a .count()
method to act like the homonymous method in python strings.

A good doctest example (similar to the existing one) would be nice, too.


>
> Peter
> _______________________________________________
> BioPython mailing list ?- ?BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it


From baoilleach at gmail.com  Thu Mar  5 15:23:42 2009
From: baoilleach at gmail.com (Noel O'Boyle)
Date: Thu, 5 Mar 2009 15:23:42 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
Message-ID: <a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>

2009/3/5 Peter <biopython at maubp.freeserve.co.uk>:
> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>
>>
>> I vote (b).
>> Another option is to continue to use count() for a Python-style count,
>> and to add a new method that does a overlapping-type count. For this
>> new method we'd need a clear but short name, and I can't think of
>> anything now.
>>
>> --Michiel.
>
> Did you like plan (c), which preserves the Python string style count
> as the default but offers the non-overlapping count via an optional
> argument?
>
> i.e.
>>>> from Bio.Seq import Seq
>>>> nuc = Seq("AAAA")
>>>> nuc.count("AA") #default is non-overlapping
> 2
>>>> nuc.count("AA", overlap=True)
> 3
>>>> nuc.count("AA", overlap=False)
> 2
>
> Peter

I think we are arguing here over which should be the default value.

Several people here believe that behaviour analagous to Python's
string.count will reduce bug reports and user confusion. However,
no-one except Leighton has been able to come up with a single use case
where the current behaviour is useful (and even that example, with
respect, was flimsy). So we end up with a method with adheres
magnificently to the principle of least surprise, but which is of no
use to users. Aren't you trying to provide methods which are useful
for biological analysis? Isn't that the purpose of wrapping the string
in the first place?

Noel (getting far too excited over painting this bikeshed)


From bsouthey at gmail.com  Thu Mar  5 16:28:11 2009
From: bsouthey at gmail.com (Bruce Southey)
Date: Thu, 5 Mar 2009 10:28:11 -0600
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
Message-ID: <bbcd77d00903050828y5cd3ac34n2036375505439aa9@mail.gmail.com>

Hi,
This is a little deja vu as I feel this type of thing has come up
before. While I can not speak for anyone else, if I sound different to
that, then I was obviously convinced by those arguments as  that
sounds better than I forgot :-)

More seriously, ignoring the reading fame or the genetic code when
counting is rather bad form!

I can not think of a relevant case involving a protein sequence -
although counting pairs of cysteines in insulin-like sequences could
be a situation of importance (related to disulphide bonds).

An example for nucleic sequences, counting 'TTT' in the madeup
sequence  'TTTTTTTGG' can be two in frames 1 and 2 but only one in
frame 3.

Also, a weaker concern is that the sum of counts is greater than or
equal to the length of the sequence is not desirable property unless
the user is informed that duplicates were found.
In the above case, seven sounds rather wrong when one says that a DNA
sequence of nine DNA bases can produce seven Leucines!

Yes, context is everything because 3 different results is not nice.

Don't get me wrong, I know that finding duplicates is important just
that it should not be here - there must different functions.

Thus, I vote for (a) and I also prefer that default syntax is
consistent with Python language.

If this change is done, then all of Biopython must be revised to be
consistent - like reading frames and similar discussion...

Bruce


On Thu, Mar 5, 2009 at 9:23 AM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> 2009/3/5 Peter <biopython at maubp.freeserve.co.uk>:
>> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>>
>>>
>>> I vote (b).
>>> Another option is to continue to use count() for a Python-style count,
>>> and to add a new method that does a overlapping-type count. For this
>>> new method we'd need a clear but short name, and I can't think of
>>> anything now.
>>>
>>> --Michiel.
>>
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>>
>> i.e.
>>>>> from Bio.Seq import Seq
>>>>> nuc = Seq("AAAA")
>>>>> nuc.count("AA") #default is non-overlapping
>> 2
>>>>> nuc.count("AA", overlap=True)
>> 3
>>>>> nuc.count("AA", overlap=False)
>> 2
>>
>> Peter
>
> I think we are arguing here over which should be the default value.
>
> Several people here believe that behaviour analagous to Python's
> string.count will reduce bug reports and user confusion. However,
> no-one except Leighton has been able to come up with a single use case
> where the current behaviour is useful (and even that example, with
> respect, was flimsy). So we end up with a method with adheres
> magnificently to the principle of least surprise, but which is of no
> use to users. Aren't you trying to provide methods which are useful
> for biological analysis? Isn't that the purpose of wrapping the string
> in the first place?
>
> Noel (getting far too excited over painting this bikeshed)
> _______________________________________________
> BioPython mailing list ?- ?BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From biopython at maubp.freeserve.co.uk  Thu Mar  5 16:34:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 16:34:37 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <bbcd77d00903050828y5cd3ac34n2036375505439aa9@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
	<bbcd77d00903050828y5cd3ac34n2036375505439aa9@mail.gmail.com>
Message-ID: <320fb6e00903050834i32bd8d64w672e53b6ef1dbf56@mail.gmail.com>

On Thu, Mar 5, 2009 at 4:28 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> Hi,
> This is a little deja vu as I feel this type of thing has come up
> before. While I can not speak for anyone else, if I sound different to
> that, then I was obviously convinced by those arguments as ?that
> sounds better than I forgot :-)
>
> More seriously, ignoring the reading fame or the genetic code when
> counting is rather bad form!

Why?  In many situations they are irrelevant.  Consider counting
restriction enzyme digest sites for example, plus of counting in any
protein sequences.

> I can not think of a relevant case involving a protein sequence -
> although counting pairs of cysteines in insulin-like sequences could
> be a situation of importance (related to disulphide bonds).
>
> An example for nucleic sequences, counting 'TTT' in the madeup
> sequence ?'TTTTTTTGG' can be two in frames 1 and 2 but only one in
> frame 3.

Giving an answer of 2 (using a non overlapping search like the python
string method) or 5 (using an overlapping search) are valid expected
outcomes for "TTT" in "TTTTTTTGG".

Here you seem want to count codons - which is by its nature a frame
dependent task.

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar  5 16:35:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 5 Mar 2009 16:35:10 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com>
	<418103.38901.qm@web62405.mail.re1.yahoo.com>
	<320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<a882e48b0903050723m9532fd5lb81dd26e85925c62@mail.gmail.com>
Message-ID: <320fb6e00903050835h2c548083jda67b5f50fcfc842@mail.gmail.com>

On Thu, Mar 5, 2009 at 3:23 PM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> I think we are arguing here over which should be the default value.
>
> Several people here believe that behaviour analagous to Python's
> string.count will reduce bug reports and user confusion. However,
> no-one except Leighton has been able to come up with a single use case
> where the current behaviour is useful (and even that example, with
> respect, was flimsy). So we end up with a method with adheres
> magnificently to the principle of least surprise, but which is of no
> use to users. Aren't you trying to provide methods which are useful
> for biological analysis? Isn't that the purpose of wrapping the string
> in the first place?
>
> Noel (getting far too excited over painting this bikeshed)

If we hadn't been shipping Biopython with the old non-overlapping
python-string-like count method for  the last year, I would have
probably have been more willing to agree that the Seq count method
could differ from the python-string and use an overlapping search.
However, changing it now also breaks backwards compatibility which
shouldn't be done lightly.  We could still do this (implementation
discussion on the dev list or the Bug 2779), but will have to make
this change very clear in the release notes.

Peter


From mjldehoon at yahoo.com  Fri Mar  6 11:52:58 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 6 Mar 2009 03:52:58 -0800 (PST)
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
Message-ID: <791065.98994.qm@web62403.mail.re1.yahoo.com>


> > Another option is to continue to use count() for a Python-style count,
> > and to add a new method that does a overlapping-type count. For this
> > new method we'd need a clear but short name, and I can't think of
> > anything now.
> >
> Did you like plan (c), which preserves the Python string style count
> as the default but offers the non-overlapping count via an optional
> argument?
> 
It's also OK, but if we use a different method name we can leave count() untouched altogether.

--Michiel.


From biopython at maubp.freeserve.co.uk  Fri Mar  6 12:07:57 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 12:07:57 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com>
	<791065.98994.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00903060407u7383545fp80fc8b81899a33a7@mail.gmail.com>

On Fri, Mar 6, 2009 at 11:52 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> > Another option is to continue to use count() for a Python-style count,
>> > and to add a new method that does a overlapping-type count. For this
>> > new method we'd need a clear but short name, and I can't think of
>> > anything now.
>>
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>
> It's also OK, but if we use a different method name we can leave count() untouched altogether.

Looking back, Sebastian Bassi raised this issue back in 2003 on this
mailing list, and his overlap-aware-count implementation is used
internally by Bio.SeqUtils.MeltingTemp, see:
http://lists.open-bio.org/pipermail/biopython/2003-November/001741.html
http://lists.open-bio.org/pipermail/biopython/2003-November/001742.html
etc

Sebastian also posted an enhancement request for adding an overlap
aware counting method to the python base string, with "overcount" as a
possible name.   I don't know what happened to his bug report, it
seems to have been marked private:
http://mail.python.org/pipermail/python-bugs-list/2003-November/021239.html

I don't really like the name "overcount", but as another suggestion
how about "count_ol" which is short for count-with-overlaps?

Peter


From lpritc at scri.ac.uk  Fri Mar  6 12:15:59 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 06 Mar 2009 12:15:59 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com>
Message-ID: <C5D6C47F.1E880%lpritc@scri.ac.uk>

In the spirit of being blindingly obvious, how about:

Seq.overlapping_count()

;)

L.


On 06/03/2009 11:52, "Michiel de Hoon" <mjldehoon at yahoo.com> wrote:

> 
>>> Another option is to continue to use count() for a Python-style count,
>>> and to add a new method that does a overlapping-type count. For this
>>> new method we'd need a clear but short name, and I can't think of
>>> anything now.
>>> 
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>> 
> It's also OK, but if we use a different method name we can leave count()
> untouched altogether.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From chapmanb at 50mail.com  Fri Mar  6 13:14:04 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 6 Mar 2009 08:14:04 -0500
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <C5D6C47F.1E880%lpritc@scri.ac.uk>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
Message-ID: <20090306131404.GJ69627@sobchak.mgh.harvard.edu>

Hey all;
Great discussion on this. My preference is for a new function,
and I like Leighton's naming suggestion.

Also, unless someone has a use case for the current count()
function, we should deprecate and eventually remove it. Overriding
the string API where it makes sense is good, but here it seems to be
creating confusion and not solving a problem. If someone needs the
real string count, they can always do str(your_seq).count("GG").

Brad

> In the spirit of being blindingly obvious, how about:
> 
> Seq.overlapping_count()
> 
> ;)
> 
> L.
> 
> 
> On 06/03/2009 11:52, "Michiel de Hoon" <mjldehoon at yahoo.com> wrote:
> 
> > 
> >>> Another option is to continue to use count() for a Python-style count,
> >>> and to add a new method that does a overlapping-type count. For this
> >>> new method we'd need a clear but short name, and I can't think of
> >>> anything now.
> >>> 
> >> Did you like plan (c), which preserves the Python string style count
> >> as the default but offers the non-overlapping count via an optional
> >> argument?
> >> 
> > It's also OK, but if we use a different method name we can leave count()
> > untouched altogether.
> 
> -- 
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
> 
> 
> ______________________________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.  
> The Scottish Crop Research Institute is a charitable company limited by
> guarantee. 
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
> 
> 
> DISCLAIMER:
> 
> This email is from the Scottish Crop Research Institute, but the views 
> expressed by the sender are not necessarily the views of SCRI and its 
> subsidiaries.  This email and any files transmitted with it are
> confidential
> 
> to the intended recipient at the e-mail address to which it has been 
> addressed.  It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this
> 
> confidentiality and you must not use, disclose, copy, print or rely on
> this 
> e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
> name of the sender and delete the email from your system.
> 
> Although SCRI has taken reasonable precautions to ensure no viruses are 
> present in this email, neither the Institute nor the sender accepts any 
> responsibility for any viruses, and it is your responsibility to scan
> the email and the attachments (if any).
> ______________________________________________________________________
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Fri Mar  6 14:13:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 14:13:42 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <20090306131404.GJ69627@sobchak.mgh.harvard.edu>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>

On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hey all;
> Great discussion on this. My preference is for a new function,
> and I like Leighton's naming suggestion.

Yes, "overlapping_count" is a reasonable choice.  Its a bit long, but
it is clear.

> Also, unless someone has a use case for the current count()
> function, we should deprecate and eventually remove it. Overriding
> the string API where it makes sense is good, but here it seems to be
> creating confusion and not solving a problem. If someone needs the
> real string count, they can always do str(your_seq).count("GG").

There is the very common use case of my_seq.count("A"), or similar,
with single character search strings, and lots of code does this (both
in Biopython and I'm sure user's scripts).  For single letters of
course, a non-overlapping count and an overlapping count do the same
thing - deprecating the count method would cause a lot of unnecessary
upheaval.

Ignoring that, given we want the Seq to generally behave like a python
string, I think removing the count method would still be a bad idea.

[As a compromise, assuming we add an overlapping_count method and do a
Biopython 1.50 beta release, the beta release could include a warning
in the count method when used with a multi-character search string,
suggesting the user might in fact need a non-overlapping count.  Or is
this a bit too crazy?]

Peter


From bsouthey at gmail.com  Fri Mar  6 15:06:07 2009
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 06 Mar 2009 09:06:07 -0600
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>	<C5D6C47F.1E880%lpritc@scri.ac.uk>	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
Message-ID: <49B13BDF.9030908@gmail.com>

Peter wrote:
> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>   
>> Hey all;
>> Great discussion on this. My preference is for a new function,
>> and I like Leighton's naming suggestion.
>>     
>
> Yes, "overlapping_count" is a reasonable choice.  Its a bit long, but
> it is clear.
>
>   
>> Also, unless someone has a use case for the current count()
>> function, we should deprecate and eventually remove it. Overriding
>> the string API where it makes sense is good, but here it seems to be
>> creating confusion and not solving a problem. If someone needs the
>> real string count, they can always do str(your_seq).count("GG").
>>     
I have already given one user case where overlapping counts is totally 
inappropriate! Unique codon counting is extremely important in many 
areas including gene prediction (possible splicing sites) and molecular 
evolution (like codon usage).

Another valid case given was DNA restriction sites were you may want 
both overlapping and unique counts. For example, if DNA is digested by 
one enzyme that has unique sites in the sequence then followed by a 
second enzyme that has unique sites in the digested product but possibly 
duplicates in the original sequence.

I just do not understand you logic of requiring a conversion when the 
Seq object is designed to 'behave like a python string'.

>
> There is the very common use case of my_seq.count("A"), or similar,
> with single character search strings, and lots of code does this (both
> in Biopython and I'm sure user's scripts).  For single letters of
> course, a non-overlapping count and an overlapping count do the same
> thing - deprecating the count method would cause a lot of unnecessary
> upheaval.
>
> Ignoring that, given we want the Seq to generally behave like a python
> string, I think removing the count method would still be a bad idea.
>   
I agree.
> [As a compromise, assuming we add an overlapping_count method and do a
> Biopython 1.50 beta release, the beta release could include a warning
> in the count method when used with a multi-character search string,
> suggesting the user might in fact need a non-overlapping count.  Or is
> this a bit too crazy?]
>   
Yes it is too crazy and does not fit into the current established 
behavior of Biopython.

Bruce


From biopython at maubp.freeserve.co.uk  Fri Mar  6 15:15:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 15:15:24 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <49B13BDF.9030908@gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
	<49B13BDF.9030908@gmail.com>
Message-ID: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>

On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> I have already given one user case where overlapping counts is totally
> inappropriate! Unique codon counting is extremely important in many areas
> including gene prediction (possible splicing sites) and molecular evolution
> (like codon usage).

For codon counting NEITHER the current non-overlapping count nor the
suggested overlapping count would be suitable.  So this doesn't really
affect the overlapping versus non-overlapping debate.

Peter


From bsouthey at gmail.com  Fri Mar  6 15:34:42 2009
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 06 Mar 2009 09:34:42 -0600
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>	
	<C5D6C47F.1E880%lpritc@scri.ac.uk>	
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>	
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>	
	<49B13BDF.9030908@gmail.com>
	<320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>
Message-ID: <49B14292.6080806@gmail.com>

Peter wrote:
> On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>   
>> I have already given one user case where overlapping counts is totally
>> inappropriate! Unique codon counting is extremely important in many areas
>> including gene prediction (possible splicing sites) and molecular evolution
>> (like codon usage).
>>     
>
> For codon counting NEITHER the current non-overlapping count nor the
> suggested overlapping count would be suitable.  So this doesn't really
> affect the overlapping versus non-overlapping debate.
>
> Peter
>   
With due respect, this does not make any sense.

If it is a cDNA then I can count say the different Lysine codons to find 
any usage bias using seq.count('AAA')/ 
(seq.count('AAA')+seq.count('AAG'). (Actually I am more interested in 
the occurrence of specific multiple codons than single codons.)
If you want the forward frames then just seq[0:].count('AAA'), 
seq[1:].count('AAA') and seq[2:].count('AAA') for frames 1, 2, and 3, 
respectively.

As you pointed out single characters are not relevant so what is relevant?

Bruce


From biopython at maubp.freeserve.co.uk  Fri Mar  6 15:46:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 6 Mar 2009 15:46:19 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <49B14292.6080806@gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
	<49B13BDF.9030908@gmail.com>
	<320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com>
	<49B14292.6080806@gmail.com>
Message-ID: <320fb6e00903060746r309216e7t36d00434993a8cfb@mail.gmail.com>

On Fri, Mar 6, 2009 at 3:34 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>
>> For codon counting NEITHER the current non-overlapping count nor the
>> suggested overlapping count would be suitable. ?So this doesn't really
>> affect the overlapping versus non-overlapping debate.
>>
>> Peter
>
> With due respect, this does not make any sense.
>
> If it is a cDNA then I can count say the different Lysine codons to find any
> usage bias using seq.count('AAA')/ (seq.count('AAA')+seq.count('AAG').
> (Actually I am more interested in the occurrence of specific multiple codons
> than single codons.)

If you have the (short) CDS "TAAAAAAAAAAG" which codes for "LKKK",
then the codon count for "AAA" is 2 and the codon count for "AAG" is
1.

Using the (standard python) non overlapping count method,
"TAAAAAAAAAAG".count("AAA") = 3 and "TAAAAAAAAAAG".count("AAG") = 1
which does not do what you want.

Using a hypothetical overlapping count method,
"TAAAAAAAAAAG".overlapping_count("AAA") = 8 and
"TAAAAAAAAAAG".overlapping_count("AAG") = 1 which does not do what you
want.

i.e. As I said, for codon counting NEITHER the current non-overlapping
count nor the suggested overlapping count would be suitable.

You seem to be asking for something different - a codon counting
method, which is a special case of a non-overlapping count.

Peter


From lpritc at scri.ac.uk  Fri Mar  6 15:47:37 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 06 Mar 2009 15:47:37 +0000
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <49B13BDF.9030908@gmail.com>
Message-ID: <C5D6F619.1E8B4%lpritc@scri.ac.uk>

On 06/03/2009 15:06, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Peter wrote:
>> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
unless someone has a use case for the current count()
>>> function, we should deprecate and eventually remove it. Overriding
>>> the string API where it makes sense is good, but here it seems to be
>>> creating confusion and not solving a problem. If someone needs the
>>> real string count, they can always do str(your_seq).count("GG").
>>>     
> I have already given one user case where overlapping counts is totally
> inappropriate! Unique codon counting is extremely important in many
> areas including gene prediction (possible splicing sites) and molecular
> evolution (like codon usage).

We're not discussing codon counting though, we're discussing counting
occurrences of an arbitrary substring in a sequence.  They're not the same
operation, even though they both involve counting.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From chapmanb at 50mail.com  Fri Mar  6 22:46:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 6 Mar 2009 17:46:39 -0500
Subject: [BioPython] The count method of a Seq (or MutableSeq) object
In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
References: <791065.98994.qm@web62403.mail.re1.yahoo.com>
	<C5D6C47F.1E880%lpritc@scri.ac.uk>
	<20090306131404.GJ69627@sobchak.mgh.harvard.edu>
	<320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com>
Message-ID: <20090306224639.GM69627@sobchak.mgh.harvard.edu>

Me:
> > Also, unless someone has a use case for the current count()
> > function, we should deprecate and eventually remove it. Overriding
> > the string API where it makes sense is good, but here it seems to be
> > creating confusion and not solving a problem. If someone needs the
> > real string count, they can always do str(your_seq).count("GG").

Bruce:
> I have already given one user case where overlapping counts is totally 
> inappropriate! Unique codon counting

Sorry, I was a bit terse in my previous e-mail. My thought on
deprecation was actually based on your and Noel's emails; both of
you presented cases where you had biological expectations for count
which are not met by the standard string count behaviour. 

For Noel, this is handled by the proposed overlapping_count
function. For your example, I think it would be better handled by
functionality that returned a list of codons, like:

Seq("ATGGAACAT").codon_list(phase=0)
["ATG", "GAA", "CAT"]

Bruce:
> I just do not understand you logic of requiring a conversion when the 
> Seq object is designed to 'behave like a python string'.

This is representing a biological sequence, so I think where a biologist
user's intuition opposes what a standard python string does we
should evaluate for an option that is more in line with expectations.
My point about the string was just that if you are thinking as a python programmer
and really want python string behavior, it is pretty easy to get.

Peter:
> There is the very common use case of my_seq.count("A"), or similar,
> with single character search strings, and lots of code does this (both
> in Biopython and I'm sure user's scripts).  For single letters of
> course, a non-overlapping count and an overlapping count do the same
> thing - deprecating the count method would cause a lot of unnecessary
> upheaval.

Good point; I totally overlooked that. Retract my suggestion. I do
like your warning idea, but maybe we can get by here with
documentation and by highlighting the alternative fuctions.
It looked like you're already all over the documentation, so
hopefully the new functionality will fix up any confusion,

Thanks all,
Brad


From chapmanb at 50mail.com  Sun Mar  8 16:29:41 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 8 Mar 2009 12:29:41 -0400
Subject: [BioPython] Initial work on a GFF parser
Message-ID: <20090308162941.GA99653@kunkel>

Hi all;
Generic Feature Format (GFF) is a nice tab delimited file format
that we don't have full support for in Biopython. Michael Hoffman
contributed code to work with GFF MySQL databases (in Bio.GFF), but
we don't have a GFF parser for the flatfiles. Looking back over the
list archives, this has come up a couple of times without a finished
solution being implemented. GFF suffers from the curse of being too easy
to hack together a solution for parsing a very specific problem, while
generating a good standard parser takes more work.

Recently, Peter brought up GFF on the BioSQL mailing list, which
made me interested in digging into GFF as an input and output flat
file format for BioSQL databases. Towards this end I put together an
initial implementation of a GFF (version 3) parser for Biopython. A
write up and the code are here:

http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/

As described in the post, the GFF interface will be a bit different
from the standard SeqIO interface, since GFF stores features
separately from the sequences and also doesn't require features for
a record to be grouped together.

As a result, the interface is up for discussion and the best path is to
start with an implementation and see where it takes us. I'd be grateful
for any feedback and code from those who are interested. We can discuss
on the development mailing list or on the blog, and move towards getting
stable full featured GFF parsing in Biopython.

Brad


From biopython at maubp.freeserve.co.uk  Mon Mar  9 10:14:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 9 Mar 2009 10:14:55 +0000
Subject: [BioPython] Initial work on a GFF parser
In-Reply-To: <20090308162941.GA99653@kunkel>
References: <20090308162941.GA99653@kunkel>
Message-ID: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com>

On Sun, Mar 8, 2009 at 4:29 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Generic Feature Format (GFF) is a nice tab delimited file format
> that we don't have full support for in Biopython. Michael Hoffman
> contributed code to work with GFF MySQL databases (in Bio.GFF), but
> we don't have a GFF parser for the flatfiles. Looking back over the
> list archives, this has come up a couple of times without a finished
> solution being implemented. GFF suffers from the curse of being too easy
> to hack together a solution for parsing a very specific problem, while
> generating a good standard parser takes more work.

You're right about creating a good general parser taking more work ;)

See also enhancement Bug 2762, GFF capability in SeqIO, which has some
discussion.

Also, it wasn't clear from your blog if you are thinking about just
GFF version 3, or something more general, coping with the assorted
comparatively ill defined GFF2 variants.

> Recently, Peter brought up GFF on the BioSQL mailing list, which
> made me interested in digging into GFF as an input and output flat
> file format for BioSQL databases. Towards this end I put together an
> initial implementation of a GFF (version 3) parser for Biopython. A
> write up and the code are here:
>
> http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/
>
> As described in the post, the GFF interface will be a bit different
> from the standard SeqIO interface, since GFF stores features
> separately from the sequences and also doesn't require features for
> a record to be grouped together.

Regarding where to put this code, if it isn't going to support the
Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but
maybe Bio.GFF or Bio.GFF3 instead.

However, you could still fit gff(3) files into Bio.SeqIO, its just
that the sequence may not be present.  This would be similar GenBank
files usually have a long list of features plus the full sequence, but
the sequence itself may be missing - for example if there is a just a
CONTIG line.  Or QUAL files from sequencing where there is never a
sequence.

As with GenBank files for large genome/chromosome, for a typical GFF
file for Bio.SeqIO we'd just return a single SeqRecord containing all
the features - within the SeqIO API there is no way to offer memory
efficient iteration over the features themselves.

Maybe we need to invent Bio.FeatureIO for this?  You could consider
GenBank/EMBL feature tables, GFF files, NCBI protein tables, and
probably a few other formats too.

> As a result, the interface is up for discussion and the best path is to
> start with an implementation and see where it takes us. I'd be grateful
> for any feedback and code from those who are interested. We can discuss
> on the development mailing list or on the blog, and move towards getting
> stable full featured GFF parsing in Biopython.

>From the blog post it sounds like you are using sub-features to store
the parent/child relationship between say mRNAs and genes.  This is
elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to
cope with the general parent (part-of) relationships allowed in GFF
files - for example an exon may have multiple parents.

There is also the complication that when parsing GenBank files, a gene
or CDS feature with a join-location ends up represented using
sub-features (which probably would be represented with an explicit
intron/exon structure in GFF files) [This is something I don't really
like with the current object structure].  We'd want things to be
fairly uniform between the parsers - for one thing our BioSQL code
currently records a feature with subfeatures as a single feature in
the database.

Peter


From chapmanb at 50mail.com  Mon Mar  9 22:42:24 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 9 Mar 2009 18:42:24 -0400
Subject: [BioPython] Initial work on a GFF parser
In-Reply-To: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com>
References: <20090308162941.GA99653@kunkel>
	<320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com>
Message-ID: <20090309224224.GA4481@sobchak.mgh.harvard.edu>

Peter;
Thanks much for the feedback.

> See also enhancement Bug 2762, GFF capability in SeqIO, which has some
> discussion.
> 
> Also, it wasn't clear from your blog if you are thinking about just
> GFF version 3, or something more general, coping with the assorted
> comparatively ill defined GFF2 variants.

Bug 2762 had a lot of good background and ideas which helped in
getting started. I did take the sub_feature route instead of the
flattened method Leighton suggested there.

Right now this tackles GFF3. The hard part is going to be
getting a framework in place, and then GFF2 or GFT (or GFF2.5 or
whatever they call it) support could be added.

> Regarding where to put this code, if it isn't going to support the
> Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but
> maybe Bio.GFF or Bio.GFF3 instead.
> 
> However, you could still fit gff(3) files into Bio.SeqIO, its just
> that the sequence may not be present.  This would be similar GenBank
> files usually have a long list of features plus the full sequence, but
> the sequence itself may be missing - for example if there is a just a
> CONTIG line.  Or QUAL files from sequencing where there is never a
> sequence.

Yes, where it lives is a good topic for debate. For GFF files, you'd
at least like the option to add new features to an existing sequence
record, which is what I do here. It would be easy enough to create
new blank records if one is not present initially. The difficult
thing with adding this to the existing syntax is that the GFF files
are not ordered for efficient iteration. You essentially have to
parse the whole file, so something like this would handle the
syntax:

seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
final_seq_dict = SeqIO.add_features(gff_handle, "gff3", initial_dict=seq_dict)

Along these lines, I liked the way you did a sequence/quality dual
iterator for quality output and think that works well when ordering of
the records in multiple files is stable.

> As with GenBank files for large genome/chromosome, for a typical GFF
> file for Bio.SeqIO we'd just return a single SeqRecord containing all
> the features - within the SeqIO API there is no way to offer memory
> efficient iteration over the features themselves.
> 
> Maybe we need to invent Bio.FeatureIO for this?  You could consider
> GenBank/EMBL feature tables, GFF files, NCBI protein tables, and
> probably a few other formats too.

FeatureIO is something BioPerl has; this page describes the status
of GFF in BioPerl but is over a year old so things may have changed:

http://www.bioperl.org/wiki/GFF_code_audit

The iteration model still falls apart because of the undefined
ordering of the file. That is why I settled on the filter approach
to limit what you get to a reasonable memory size but still guarantee
you've pulled all relevant features before building the parent/child
relationships and features. This could also apply to data that comes off
cluster runs where the output order will not necessarily correlate with
the inputs.

The filtering approach could also be useful for large GenBank files,
as you could skip adding features and parsing locations for elements you
are not interested in. If others find this approach intuitive, it
would be worth looking at there as well.

> From the blog post it sounds like you are using sub-features to store
> the parent/child relationship between say mRNAs and genes.  This is
> elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to
> cope with the general parent (part-of) relationships allowed in GFF
> files - for example an exon may have multiple parents.

For these the exon is added as a sub_feature to all of its parents. The
shared feature is the same one in memory. t_nested_multiparent_features
in the test code demonstrates this. How we output it to BioSQL is up for
debate but we should also be able to do some sharing there; duplication
is also not too bad of an option if it makes it cleaner since these are
not likely to be deeply nested.

> There is also the complication that when parsing GenBank files, a gene
> or CDS feature with a join-location ends up represented using
> sub-features (which probably would be represented with an explicit
> intron/exon structure in GFF files) [This is something I don't really
> like with the current object structure].  We'd want things to be
> fairly uniform between the parsers - for one thing our BioSQL code
> currently records a feature with subfeatures as a single feature in
> the database.

BioSQL definitely needs work to handle sub_features more generally.
The seqfeature_relationship table in BioSQL can handle these but it
needs to be coded. I agree with you that the way we do it now is a
little too GenBank specific. This is a bit of a larger project since
we should coordinate with the other projects, but as long as we
continue to support the same location mechanism they use currently
it will be back-compatible with older code.

Thanks again for the thoughts,
Brad


From hlapp at gmx.net  Tue Mar 10 03:36:30 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 9 Mar 2009 23:36:30 -0400
Subject: [BioPython] Google Summer of Code: Call for Bio* Volunteers
In-Reply-To: <D9B5E438-4FBA-4073-B740-FB0D72C5790B@gmx.net>
References: <D9B5E438-4FBA-4073-B740-FB0D72C5790B@gmx.net>
Message-ID: <BE9FC557-510B-4A41-99AD-E0DA1C102CB3@gmx.net>

You may recall my message to the developer lists of several O|B|F  
projects in February about the idea of O|B|F applying to Google Summer  
of Code as a mentoring organization [1].

I felt that the response to this was very positive and encouraging.  
Although late (sorry, been swamped too much), I've now put up the  
skeleton of an ideas page at

http://open-bio.org/wiki/Google_Summer_Code_2009

I basically modeled (in fact, largely copied) this page after the  
NESCent Phyloinformatics Summer of Code ideas pages, which I think  
worked pretty well. We can completely rework this, though - any  
feedback and suggestions are very much welcome.

In the meantime, I need all developers to double check the information  
under 'Contact'. Would the open-bio-l mailing list indeed reach the  
prospective mentors and other devs? Will be you be fine with students  
asking for feedback to their applications on the developers (i.e.,  
this) list? Is there a blessed IRC where at least some of the  
prospective mentors hang out for students to ask questions during the  
time they apply?

I also need space for the reference information for all projects that  
will participate with at least one project idea (I would hope that  
that's all projects) to be added in the 'Open-Bio projects involved'  
section.

*****
Most important of all, if you can volunteer to mentor a project,  
please post a project idea to the page in the respective section,  
using the idea template that's there already (copy, paste, and edit).
*****

The deadline for organization applications is Friday this week, Mar  
13, which is very soon. The ideas page is a major factor and component  
in how Google scores new mentoring organizations - the more we can  
show the resourcefulness and diversity of our member projects the more  
competitive I think we'll be. So all those who responded with ideas or  
willingness to help out as primary or secondary mentores earlier, I  
need you to think about and put up your idea(s) now.

Cheers,

	-hilmar

[1] http://tinyurl.com/ck7tqe

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From dalloliogm at gmail.com  Tue Mar 10 17:06:27 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 10 Mar 2009 18:06:27 +0100
Subject: [BioPython] can biopython query KEGG directly?
Message-ID: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>

Hi,
is it possible to query the KEGG database with biopython?

Actually I can do it with the kegg's wsdl apis and the python suds
library and it works very well, but I was wondering whether there is
something more integrated with biopython.
For example, if there is something similar to Entrez, that can
automatically retrieve a sequence from ncbi and transform it to a
SeqRecord object.


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Tue Mar 10 18:08:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 10 Mar 2009 18:08:01 +0000
Subject: [BioPython] can biopython query KEGG directly?
In-Reply-To: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
Message-ID: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>

On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> is it possible to query the KEGG database with biopython?

I don't think there is any wrapper for the KEGG online API (yet).  See:
http://www.genome.jp/kegg/soap/doc/keggapi_manual.html

This does sound like a worthwhile addition (especially if the SOAP
stuff can be done using only core python libraries included in Python
2.4+)

> .. and transform it to a SeqRecord object.

We still need a Bio.KEGG gene parser, see also:
http://bioperl.org/wiki/KEGG_sequence_format
http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html
Once that is done, a KEGG wrapper in Bio.SeqIO would make sense.

Peter


From matzke at berkeley.edu  Wed Mar 11 01:18:12 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 10 Mar 2009 18:18:12 -0700
Subject: [BioPython] GSoC project: Biogeographical and community
	phylogenetics for BioPython
Message-ID: <49B71154.5060109@berkeley.edu>

On the advice of Mauricio & Hilmar, I have posted a draft proposal for a 
Google Summer of Code project: Biogeographical and community 
phylogenetics for BioPython.

http://open-bio.org/wiki/Google_Summer_Code_2009#Biogeographical_and_community_phylogenetics_for_BioPython

Comments welcome on- or off-list.  Cheers!

PS: Also, additional suggestions for pertinent members would be appreciated.

Nick


-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================


From dalloliogm at gmail.com  Thu Mar 12 12:33:04 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 12 Mar 2009 13:33:04 +0100
Subject: [BioPython] can biopython query KEGG directly?
In-Reply-To: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>
References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
	<320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>
Message-ID: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com>

On Tue, Mar 10, 2009 at 7:08 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Hi,
> > is it possible to query the KEGG database with biopython?
>
> I don't think there is any wrapper for the KEGG online API (yet).  See:
> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html


well, if someone is in a hurry to query KEGG with soap, I have some scripts
(but they use the suds library).


>
>
> This does sound like a worthwhile addition (especially if the SOAP
> stuff can be done using only core python libraries included in Python
> 2.4+)


I am not sure if the SOAPpy library is the one included in the core python
libraries, and if it is since python 2.4.
For what I know, SOAPpy has ceased developed since 2005 (see
http://pywebsvcs.sourceforge.net/).
I couldn't test this library, because I still didn't managed to get it
working under an http proxy :-(.


>
>
> > .. and transform it to a SeqRecord object.
>
> We still need a Bio.KEGG gene parser, see also:
> http://bioperl.org/wiki/KEGG_sequence_format
> http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html
> Once that is done, a KEGG wrapper in Bio.SeqIO would make sense.
>

I am just curious, but in which object a Kegg gene file would be transposed?
A SeqRecord? And how, exactly? I suppose all the features will go in
SeqRecord.features... but is there any standard convention to do so?
For example, the codon usage table, class, dblinks, and all the other
fields.. how they would be stored?


>
> Peter
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Thu Mar 12 14:15:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 12 Mar 2009 14:15:06 +0000
Subject: [BioPython] can biopython query KEGG directly?
In-Reply-To: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com>
References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com>
	<320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com>
	<5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com>
Message-ID: <320fb6e00903120715n7ad57282h529150e22da826e9@mail.gmail.com>

On Thu, Mar 12, 2009 at 12:33 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
>> We still need a Bio.KEGG gene parser, see also:
>> http://bioperl.org/wiki/KEGG_sequence_format
>> http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html
>> Once that is done, a KEGG wrapper in Bio.SeqIO would make sense.
>
> I am just curious, but in which object a Kegg gene file would be transposed?
> A SeqRecord? And how, exactly? I suppose all the features will go in
> SeqRecord.features... but is there any standard convention to do so?
> For example, the codon usage table, class, dblinks, and all the other
> fields.. how they would be stored?

Bio.SeqIO only deals with SeqRecord objects.  If we had a KEGG gene
parser in Bio.KEGG (written in the same style as the rest of Bio.KEGG
ideally), then it would make sense to add a KEGG gene format to
Bio.SeqIO, where the KEGG gene records would be parsed using Bio.KEGG
and then converted into SeqRecord objects.  At a minimum this would
mean their id/name/description and sequence - even just that would
still be useful I feel.  For any richer annotation, the convention is
to mimic the GenBank parser as closely as possible.  See
http://biopython.org/wiki/SeqIO_dev

Peter


From matzke at berkeley.edu  Sat Mar 14 04:59:37 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Fri, 13 Mar 2009 21:59:37 -0700
Subject: [BioPython] Getting protein structure names from primary IDs
Message-ID: <49BB39B9.2080206@berkeley.edu>

Hi all,

This has got to be trivial, but I can't find a hint about the solution 
online.

I want to:

1. Search NCBI's structure database for structures from a certain group

from Bio import Entrez
handle = Entrez.einfo()
record = Entrez.read(handle)
print "Search the structure database on Organism = Drosophila"
Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
#handle = Entrez.esearch(db="structure", term="Drosophila")
handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]")

pdb_record = Entrez.read(handle)
print pdb_record	#["IdList"]

pdblist = pdb_record["IdList"]


OK, now I have a list of primary IDs for the protein structures from 
Drosophila.


2. Download those structures.  Apparently I have to do this from RSCB 
and not NCBI? (NCBI efetch has no information on efetching from the 
structure database, and I tried a few obvious methods on analogy to 
other databases without result)

This will download from RSCB, but apparently you need the structure 
name, not the NCBI primary ID.


from Bio.PDB import *
pdbl=PDBList()
pdbl.retrieve_pdb_file('1FAT')


So, how do I get from primary ID to structure name?  I'm sure I'm 
missing something obvious.

Cheers,
Nick


-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================


From matzke at berkeley.edu  Sat Mar 14 05:05:47 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Fri, 13 Mar 2009 22:05:47 -0700
Subject: [BioPython] Getting protein structure names from primary IDs
In-Reply-To: <49BB39B9.2080206@berkeley.edu>
References: <49BB39B9.2080206@berkeley.edu>
Message-ID: <49BB3B2B.9080900@berkeley.edu>

Hi again -- Esummary was what I needed, so nevermind!

Sorry for the trouble,
Nick


Nick Matzke wrote:
> Hi all,
> 
> This has got to be trivial, but I can't find a hint about the solution 
> online.
> 
> I want to:
> 
> 1. Search NCBI's structure database for structures from a certain group
> 
> from Bio import Entrez
> handle = Entrez.einfo()
> record = Entrez.read(handle)
> print "Search the structure database on Organism = Drosophila"
> Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
> #handle = Entrez.esearch(db="structure", term="Drosophila")
> handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]")
> 
> pdb_record = Entrez.read(handle)
> print pdb_record    #["IdList"]
> 
> pdblist = pdb_record["IdList"]
> 
> 
> 
> OK, now I have a list of primary IDs for the protein structures from 
> Drosophila.
> 
> 
> 
> 2. Download those structures.  Apparently I have to do this from RSCB 
> and not NCBI? (NCBI efetch has no information on efetching from the 
> structure database, and I tried a few obvious methods on analogy to 
> other databases without result)
> 
> This will download from RSCB, but apparently you need the structure 
> name, not the NCBI primary ID.
> 
> 
> from Bio.PDB import *
> pdbl=PDBList()
> pdbl.retrieve_pdb_file('1FAT')
> 
> 
> So, how do I get from primary ID to structure name?  I'm sure I'm 
> missing something obvious.
> 
> Cheers,
> Nick
> 
> 
> 
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================


From hlapp at gmx.net  Sat Mar 14 22:59:57 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 14 Mar 2009 18:59:57 -0400
Subject: [BioPython] Google Summer of Code: application submitted,
	action needed
In-Reply-To: <BE9FC557-510B-4A41-99AD-E0DA1C102CB3@gmx.net>
References: <D9B5E438-4FBA-4073-B740-FB0D72C5790B@gmx.net>
	<BE9FC557-510B-4A41-99AD-E0DA1C102CB3@gmx.net>
Message-ID: <71A1E85A-2007-4FAE-A03B-475000C5CD38@gmx.net>

Hi all,

I have submitted the application yesterday for O|B|F participating in  
the 2009 Google Summer of Code as a mentoring organization. The  
application is at

http://docs.google.com/Doc?id=dhs98hzv_7zn8bxqjm

and is also linked to from the ideas page at

http://open-bio.org/wiki/Google_Summer_of_Code_2009

Now keep your fingers crossed, Google is slated to announce  
acceptances on March 18.

This is the last cross-project message re: Summer of Code that  
addresses mentors and our projects; future messages that I'll post  
across projects will be primarily for students such as announcing  
whether we are accepted or not and issuing calls for application.

**What we need most and right now is action from our projects'  
developers and from possible mentors.** Google admins will start  
reviewing organization applications on Monday. The ideas page has 6  
project ideas right now - though the ideas are good ones, the quantity  
won't be particularly impressive to Google.

Therefore, if you have an idea for a summer project for a student  
please use the C& template (it is commented out now but you'll see it  
when you pull the Ideas section into the editor) and put it up there  
ASAP. If you're not sure yet who'll mentor, put tentative names there.  
We don't need a full commitment from mentors until the student  
application period starts (March 23).

Next, for all projects, the leads and/or volunteers should check the  
reference information for their project:

http://open-bio.org/wiki/Google_Summer_of_Code_2009#Open-Bio_projects_involved

I just culled these links from the various project websites - it'd be  
much appreciated if going forward everyone can lend a hand in this.  
Please review what's there and add or fix as you see fit. *These links  
must be correct and complete - otherwise potential students may not  
find you.*

Finally, all prospective mentors, primary or secondary, committed or  
not, and anyone else who would like to volunteer to help out, should  
subscribe themselves ASAP to the mailing list for communicating GSoC- 
related administrivia:

http://lists.open-bio.org/mailman/listinfo/gsoc

I will *not* cross-post all administrative announcements or requests  
for information, and so you *will* miss information if you don't  
subscribe yourself there. (Note: students will be subscribed there  
only *after* acceptance).

Those who are considering to mentor, primary or helping out, please  
also add yourselves to the Mentors section on the Ideas page (and  
check your link if you're already there):

http://open-bio.org/wiki/Google_Summer_of_Code_2009#Mentors

Cheers everyone, and fingers crossed!

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From mjldehoon at yahoo.com  Sun Mar 15 10:25:43 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sun, 15 Mar 2009 03:25:43 -0700 (PDT)
Subject: [BioPython] Bio.SwissProt.SProt Dictionary, index_file
Message-ID: <653996.59295.qm@web62408.mail.re1.yahoo.com>


Hi everybody,

Does anybody use the Dictionary class or index_file function in Bio.SwissProt.SProt? As far as I can tell these functions are broken.
If there are no users, I suggest we deprecate the Dictionary class and the index_file function in Bio.SwissProt.SProt.

--Michiel


From biopython at maubp.freeserve.co.uk  Mon Mar 16 13:40:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 16 Mar 2009 13:40:13 +0000
Subject: [BioPython] List of publications citing or using Biopython
Message-ID: <320fb6e00903160640j73289abbl51d9f8935184a760@mail.gmail.com>

Hi all,

I've been working on a listing of journal publications citing or using
Biopython for the website:
http://biopython.org/wiki/Publications

If you've published anything that qualifies that isn't listed, this is
a wiki page so you should be able to add it.  If you are unsure if
something is appropriate, please ask here on the mailing list.  For
publications from the 2008 onwards I have tried to add a short note
saying which part(s) of Biopython were used - this should be easy to
write for your own recent papers ;)

If you try editing the page you should see how to add extra entries  -
for anything in PubMed this is really easy.  See the discussion page
for more details:
http://biopython.org/wiki/Talk:Publications

Peter


From matzke at berkeley.edu  Mon Mar 16 19:31:57 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 16 Mar 2009 12:31:57 -0700
Subject: [BioPython] Entrez.einfo error?
Message-ID: <49BEA92D.7040905@berkeley.edu>

Hi all,

This exact code worked fine for me on Friday, I wonder if it could be a 
temporary problem at Entrez?  A similar problem seems to occur with 
other Entrez queries.

Running biopython 1.49 in IPython...

============
from Bio import Entrez

Entrez.email = "matzke at berkeley.edu"

handle = Entrez.einfo(db="structure")


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)

/bioinformatics/pyeg/<ipython console> in <module>()

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
in einfo(cgi, **keywds)
     195     variables = {}
     196     variables.update(keywds)
--> 197     return _open(cgi, variables)
     198
     199 def esummary(cgi=None, **keywds):

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
in _open(cgi, params)
     320     options = urllib.urlencode(params, doseq=True)
     321     cgi += "?" + options
--> 322     handle = urllib.urlopen(cgi)
     323
     324     # Wrap the handle inside an UndoHandle.

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
in urlopen(url, data, proxies)
      80         opener = _urlopener
      81     if data is None:
---> 82         return opener.open(url)
      83     else:
      84         return opener.open(url, data)

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
in open(self, fullurl, data)
     188         try:
     189             if data is None:
--> 190                 return getattr(self, name)(url)
     191             else:
     192                 return getattr(self, name)(url, data)

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
in open_http(self, url, data)
     323         if realhost: h.putheader('Host', realhost)
     324         for args in self.addheaders: h.putheader(*args)
--> 325         h.endheaders()
     326         if data is not None:
     327             h.send(data)

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in endheaders(self)
     858             raise CannotSendHeader()
     859
--> 860         self._send_output()
     861
     862     def request(self, method, url, body=None, headers={}):

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in _send_output(self)
     730         msg = "\r\n".join(self._buffer)
     731         del self._buffer[:]
--> 732         self.send(msg)
     733
     734     def putrequest(self, method, url, skip_host=0, 
skip_accept_encoding=0):

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in send(self, str)
     697         if self.sock is None:
     698             if self.auto_open:
--> 699                 self.connect()
     700             else:
     701                 raise NotConnected()

/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
in connect(self)
     665         msg = "getaddrinfo returns an empty list"
     666         for res in socket.getaddrinfo(self.host, self.port, 0,
--> 667                                       socket.SOCK_STREAM):
     668             af, socktype, proto, canonname, sa = res
     669             try:

IOError: [Errno socket error] (7, 'No address associated with nodename')
 > 
/Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect()
     666         for res in socket.getaddrinfo(self.host, self.port, 0,
--> 667                                       socket.SOCK_STREAM):
     668             af, socktype, proto, canonname, sa = res


ipdb> record = Entrez.read(handle)
*** NameError: name 'Entrez' is not defined

============


-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================


From matzke at berkeley.edu  Mon Mar 16 19:42:22 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 16 Mar 2009 12:42:22 -0700
Subject: [BioPython] Entrez.einfo error?
In-Reply-To: <49BEA92D.7040905@berkeley.edu>
References: <49BEA92D.7040905@berkeley.edu>
Message-ID: <49BEAB9E.7070707@berkeley.edu>

Looks like PubMed is down at the moment also, so it's all an NCBI 
problem.  Cheers!
Nick


Nick Matzke wrote:
> Hi all,
> 
> This exact code worked fine for me on Friday, I wonder if it could be a 
> temporary problem at Entrez?  A similar problem seems to occur with 
> other Entrez queries.
> 
> Running biopython 1.49 in IPython...
> 
> ============
> from Bio import Entrez
> 
> Entrez.email = "matzke at berkeley.edu"
> 
> handle = Entrez.einfo(db="structure")
> 
> 
> ---------------------------------------------------------------------------
> IOError                                   Traceback (most recent call last)
> 
> /bioinformatics/pyeg/<ipython console> in <module>()
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
> in einfo(cgi, **keywds)
>     195     variables = {}
>     196     variables.update(keywds)
> --> 197     return _open(cgi, variables)
>     198
>     199 def esummary(cgi=None, **keywds):
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc 
> in _open(cgi, params)
>     320     options = urllib.urlencode(params, doseq=True)
>     321     cgi += "?" + options
> --> 322     handle = urllib.urlopen(cgi)
>     323
>     324     # Wrap the handle inside an UndoHandle.
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
> in urlopen(url, data, proxies)
>      80         opener = _urlopener
>      81     if data is None:
> ---> 82         return opener.open(url)
>      83     else:
>      84         return opener.open(url, data)
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
> in open(self, fullurl, data)
>     188         try:
>     189             if data is None:
> --> 190                 return getattr(self, name)(url)
>     191             else:
>     192                 return getattr(self, name)(url, data)
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc 
> in open_http(self, url, data)
>     323         if realhost: h.putheader('Host', realhost)
>     324         for args in self.addheaders: h.putheader(*args)
> --> 325         h.endheaders()
>     326         if data is not None:
>     327             h.send(data)
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in endheaders(self)
>     858             raise CannotSendHeader()
>     859
> --> 860         self._send_output()
>     861
>     862     def request(self, method, url, body=None, headers={}):
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in _send_output(self)
>     730         msg = "\r\n".join(self._buffer)
>     731         del self._buffer[:]
> --> 732         self.send(msg)
>     733
>     734     def putrequest(self, method, url, skip_host=0, 
> skip_accept_encoding=0):
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in send(self, str)
>     697         if self.sock is None:
>     698             if self.auto_open:
> --> 699                 self.connect()
>     700             else:
>     701                 raise NotConnected()
> 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc 
> in connect(self)
>     665         msg = "getaddrinfo returns an empty list"
>     666         for res in socket.getaddrinfo(self.host, self.port, 0,
> --> 667                                       socket.SOCK_STREAM):
>     668             af, socktype, proto, canonname, sa = res
>     669             try:
> 
> IOError: [Errno socket error] (7, 'No address associated with nodename')
>  > 
> /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect() 
> 
>     666         for res in socket.getaddrinfo(self.host, self.port, 0,
> --> 667                                       socket.SOCK_STREAM):
>     668             af, socktype, proto, canonname, sa = res
> 
> 
> 
> 
> 
> ipdb> record = Entrez.read(handle)
> *** NameError: name 'Entrez' is not defined
> 
> ============
> 
> 
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================


From biopython at maubp.freeserve.co.uk  Mon Mar 16 19:52:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 16 Mar 2009 19:52:30 +0000
Subject: [BioPython] Entrez.einfo error?
In-Reply-To: <49BEA92D.7040905@berkeley.edu>
References: <49BEA92D.7040905@berkeley.edu>
Message-ID: <320fb6e00903161252s9f41eecx56853a0cc9a76882@mail.gmail.com>

On Mon, Mar 16, 2009 at 7:31 PM, Nick Matzke <matzke at berkeley.edu> wrote:
> Hi all,
>
> This exact code worked fine for me on Friday, I wonder if it could be a
> temporary problem at Entrez?  A similar problem seems to occur with other
> Entrez queries.
>
> Running biopython 1.49 in IPython...
>
> ============
> from Bio import Entrez
> Entrez.email = "matzke at berkeley.edu"
> handle = Entrez.einfo(db="structure")
> ---------------------------------------------------------------------------
> IOError                                   Traceback (most recent call last)
> ...

Yes, I think you were experiencing a temporary problem, either at the
NCBI or somewhere else on the network.  Its working now on my machine
right now.  In general an IOError in Bio.Entrez is a good sign of a
network issue, and for any complex task you may want to explicitly
catch these exceptions.

Peter


From mgenome at gmail.com  Tue Mar 17 12:02:42 2009
From: mgenome at gmail.com (mgenome)
Date: Tue, 17 Mar 2009 21:02:42 +0900
Subject: [BioPython] How can I draw genome comparison figure to publish?
Message-ID: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>

I have the whole genome sequence of a phage to compare it's ORFs to those of
other related phages. I want to draw a comparison figure of two or more
genomes.
Two genomes should be compared by their ORFs similarities calculated by
BLASTP or stretcher etc.

If there is a table like this

ORF1, start, stop, strand, ORF2, start, stop, strand, similarity,
genome1_ORF1, 1, 200, +, genome2_ORF1, 1,  300,  -, 50
genome1_ORF2, 201, 400, +, genome2_ORF3,  320, 500, -, 90
....

the programs or library should draw as follows;
===>   ===> ....
 |          |
 |          |
 |          |
<===  <===  ....
Their different similarities should be represented by different colors of
linker lines.

I examined several programs, but I didn't find the program good enough to
use for publication.
ACT (Artemics) can draw comparison figure but it can not show ORFs well.
inGeno is the program close to what I want. But It cannot compare multiple
genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do
not support comparison of ORFs in genomic level.

Does anybody know a program and library to draw genome comparion figure
showing ORF comparison.  I known that it is stupid to want a perfect program
to fulfill all my requirments, but I want to find program or library to
fulfill a part of my requirements.

Thank you in advance.

Kyoung-Ho Kim, Korea.


From lpritc at scri.ac.uk  Tue Mar 17 12:42:36 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Mar 2009 12:42:36 +0000
Subject: [BioPython] How can I draw genome comparison figure to publish?
In-Reply-To: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>
Message-ID: <C5E54B3C.1F237%lpritc@scri.ac.uk>

Hi Kyoung-Ho,

On 17/03/2009 12:02, "mgenome" <mgenome at gmail.com> wrote:

> Two genomes should be compared by their ORFs similarities calculated by
> BLASTP or stretcher etc.
> 
> If there is a table like this
> 
> ORF1, start, stop, strand, ORF2, start, stop, strand, similarity,
> genome1_ORF1, 1, 200, +, genome2_ORF1, 1,  300,  -, 50
> genome1_ORF2, 201, 400, +, genome2_ORF3,  320, 500, -, 90
> ....
> 
> the programs or library should draw as follows;
> ===>   ===> ....
>  |          |
>  |          |
>  |          |
> <===  <===  ....
> Their different similarities should be represented by different colors of
> linker lines.
> 
> I examined several programs, but I didn't find the program good enough to
> use for publication.
> ACT (Artemics) can draw comparison figure but it can not show ORFs well.
> inGeno is the program close to what I want. But It cannot compare multiple
> genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do
> not support comparison of ORFs in genomic level.

GenomeDiagram does not draw the linker lines you require, I'm afraid.  The
package I would use to do so is ACT, and I have published diagrams created
using ACT (figure 3 in http://dx.doi.org/10.1073/pnas.0402424101).  There is
also M-GCAT (http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html),
which is very similar to ACT, and perhaps so similar that it will have the
same problems when generating publication-quality images to your liking.
GCV (http://zamov.online.fr/projects/gct/) I've never tried.
 
> Does anybody know a program and library to draw genome comparion figure
> showing ORF comparison.  I known that it is stupid to want a perfect program
> to fulfill all my requirments, but I want to find program or library to
> fulfill a part of my requirements.

GenomeDiagram does not currently have a facility to indicate synteny in the
way that you require using linker lines, so it may not be the tool you need
just yet.  However, it has been used to indicate the results of comparisons
between ORFs on the whole-genome level, using the colours of the compared
features to indicate the sequence identities of the matches (e.g. Figure 2
in http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 and
http://apsjournals.apsnet.org/doi/abs/10.1094).

Cheers,

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________


From biopython at maubp.freeserve.co.uk  Tue Mar 17 12:51:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 17 Mar 2009 12:51:18 +0000
Subject: [BioPython] How can I draw genome comparison figure to publish?
In-Reply-To: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>
References: <c3c5c8950903170502h54d1ea7bv6c6631b8c88debdb@mail.gmail.com>
Message-ID: <320fb6e00903170551u284b1f20v4a77fedd7bdbfbed@mail.gmail.com>

On Tue, Mar 17, 2009 at 12:02 PM, mgenome <mgenome at gmail.com> wrote:
> ... I examined several programs, but I didn't find the program good enough
> to use for publication.
> ACT (Artemics) can draw comparison figure but it can not show ORFs well.
> inGeno is the program close to what I want. But It cannot compare multiple
> genomes and I want to draw ORF as arrows. I know GenomeDaigrams in
> python do not support comparison of ORFs in genomic level.

Based on your description, I was going to suggest ACT (Artemics), but
you have already considered this.

GenomeDiagram has been integrated into Biopython and will be part of
Biopython 1.50, and as part of this work it does now support drawing
features (e.g. ORFs) as simple arrows.  GenomeDiagram is very good at
comparative genomics plots - but not the kind you are interested in.
It wouldn't be very elegant, but you might be able to use
GenomeDiagram to draw two linear genome diagrams, and then combine
this and add the comparison lines on yourself with extra code using
ReportLab directly.  This would probably be quite a lot of work...

Peter


From biopython at maubp.freeserve.co.uk  Tue Mar 17 16:52:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 17 Mar 2009 16:52:23 +0000
Subject: [BioPython] Biopython contributors and participants listings
Message-ID: <320fb6e00903170952t329332aer310906da64f49cb6@mail.gmail.com>

Hi all,

We're starting to prepare for the release of Biopython 1.50, so its
seems a good occasion to update the Biopython contributors and
participants listing.  I've just changed the formatting for the wiki
page, and to me at least this looks much nicer now - you can look at
the history and decide for yourselves:
http://biopython.org/wiki/Participants

I see some of you aren't on this participants wiki page and probably
should be (e.g. Tiago), so could I encourage relevant people to add
themselves.  Likewise if you have contributed to the project and think
you have been left out of the contributors file, please let us know:
http://biopython.org/SRC/biopython/CONTRIB
or:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/CONTRIB?cvsroot=biopython

Peter


From biopython at maubp.freeserve.co.uk  Tue Mar 17 17:38:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 17 Mar 2009 17:38:58 +0000
Subject: [BioPython] [Biopython-dev] PDB Parser error
In-Reply-To: <3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com>
References: <3715adb70903170830x61bb6e3bl4412a8cf1504d80c@mail.gmail.com>
	<320fb6e00903170901v6533910bl57ddd534dc05cf51@mail.gmail.com>
	<3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com>
Message-ID: <320fb6e00903171038m72127569m279801556e5b9551@mail.gmail.com>

On Tue, Mar 17, 2009 at 5:34 PM, Rodrigo faccioli
<rodrigo_faccioli at uol.com.br> wrote:
> Peter,
>
> Your suspect was corrected. When I received a database value its was stored
> in a Tuple data structure. The solution was converted them in string
> objects. For this, I used str command.
>
> Now, I can proceed with my tests.
>
> Thanks for your help.

OK, good luck.

Peter


From mitlox at op.pl  Wed Mar 18 09:05:58 2009
From: mitlox at op.pl (mitlox)
Date: Wed, 18 Mar 2009 19:05:58 +1000
Subject: [BioPython] protein-ligand interactions
Message-ID: <49C0B976.1020005@op.pl>

Hello,
I have a solved structure (1E8W) with a ligand and I would like to know 
which residues are within 3A of the ligand. This 3A is a cut off and 
should be using just for the C-alpha in each residue, but it would be 
great if I know which C-alpha belongs to a residue.

I am newbie in Biopython/Python, maybe anyone know an example how is it 
possible?

Thank you in advance.

Best regards


From p.j.a.cock at googlemail.com  Wed Mar 18 09:31:14 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 18 Mar 2009 09:31:14 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <49C0B976.1020005@op.pl>
References: <49C0B976.1020005@op.pl>
Message-ID: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>

On Wed, Mar 18, 2009 at 9:05 AM, mitlox <mitlox at op.pl> wrote:
>
> Hello,
> I have a solved structure (1E8W) with a ligand and I would like to
> know which residues are within 3A of the ligand. This 3A is a cut
> off and should be using just for the C-alpha in each residue, but
> it would be great if I know which C-alpha belongs to a residue.
>
> I am newbie in Biopython/Python, maybe anyone know an
>  example how is it possible?

Hi,

I've got a couple of PDB examples on my personal website, and although
they need a little update to use NumPy instead of Numeric, I think the
page on doing protein contact maps would be very informative:
http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/

In your case, for the protein in each residue you'll want to use just
the C-alpha atom (in the residue's atom dictionary under the key
"CA"), but I think you should loop over all the residues in the ligand
in order to find the least distance.

Peter


From p.j.a.cock at googlemail.com  Wed Mar 18 12:36:06 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 18 Mar 2009 12:36:06 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
References: <49C0B976.1020005@op.pl>
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
Message-ID: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>

> Hi,
>
> I've got a couple of PDB examples on my personal website, and although
> they need a little update to use NumPy instead of Numeric, I think the
> page on doing protein contact maps would be very informative:
> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/

I've updated those pages to use NumPy instead of Numeric - all very
straight forward (apart from some issue with rpy for the graphics which
isn't relevant to Biopython):

http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

Peter


From dalke at dalkescientific.com  Wed Mar 18 15:34:59 2009
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 18 Mar 2009 16:34:59 +0100
Subject: [BioPython] Fwd: Available: 2 Bioinformatics positions in
	AstraZeneca
References: <45988AB300A3B1468F7CF2F9EF207579C6A3D7@SEMLRDEMBX01.rd.astrazeneca.net>
Message-ID: <F68EF5B8-CFDD-448E-911B-42CDCD7BEB62@dalkescientific.com>

For those interested, there's a couple of temporary bioinformatics  
positions at AstraZeneca/M?lndal (near Gothenburg). Reading the  
announcements, which are in Swedish, I see it's more biomedical  
informatics than sequence analysis (text mining, workflows, a  
decision system for medical researchers).

> Ads:
> https://www.poolia.se/sok-jobb/webcv/JobAd.aspx?jobadid=19008
> http://annonsoversikt.monster.se/getjob.aspx? 
> JobID=79909293&cy=se&where=L%c3%a4n%3aV%c3%a4stra+G%c3% 
> b6taland&lid=1398&re=95&pg=1&dv=1&AVSDM=2009-03-13+11%3a17% 
> 3a00&seq=11&fseo=1&isjs=1&re=1000
> https://sjobs.brassring.com/1053/ASP/TG/cim_jobdetail.asp?SID=% 
> 5edUuKAW_slp_rhc_DOlGOwdxDn_slp_rhc_PthlP/WlgiP85aWAkz/ 
> xRYSIbMXcsvZrHO0fJu5/ 
> PZdH3vw1QoLQAr5X3A_C_R__L_F_lA_slp_rhc_0Q7alykZpdfns2LzK3W8x8tde_slp_r 
> hc_tU=&jobId=275215&type=search&partnerid=20054&siteid=5036

Also, if you are doing Python in the Gothenburg area, join us for  
GothPy, the Gothenburg Python user's group: http://groups.google.com/ 
group/gothpy


				Andrew
				dalke at dalkescientific.com


From n.j.loman at bham.ac.uk  Wed Mar 18 17:59:09 2009
From: n.j.loman at bham.ac.uk (Nick Loman)
Date: Wed, 18 Mar 2009 17:59:09 +0000
Subject: [BioPython] [Fwd: Bioinformatician wanted]
Message-ID: <49C1366D.7070105@bham.ac.uk>

Hi all,

I hope biopython'ers will excuse me posting this job advert for a 
Research Fellow at University of Birmingham - the project referenced 
makes heavy use of Biopython. The position holder would interact with 
Biopython on a daily basis, and potentially be able to help the 
Biopython open source effort should they wish.

Cheers,

Nick.


Please pass  this advert on to anyone who might be interested and suitable.

http://www.jobs.ac.uk/jobs/BO446/

Research Fellow

School of Immunity and Infection

*Fixed term for 33 months*

We are looking for a talented bioinformatician to assist in the
development, maintenance and exploitation of an internationally renowned
web-based microbial genomics facility, xBASE. The post holder will build
on our existing achievements with xBASE (http://xbase.ac.uk
<ttp://xbase.ac.uk/">; Chaudhuri RR, Loman NJ, Snyder LA, Bailey CM,
Stekel DJ, Pallen MJ. Nucleic Acids Res. 2008 36:D543-6).

The work will be carried out under the supervision of Professor Mark
Pallen (Medical School) in collaboration with Dr Dov Stekel
(Biosciences). The post holder will work within an attractive modern
research environment in the University's newly established
inter-disciplinary Centre for Systems Biology.

All candidates must have proficiency in programming within the
Unix/Linux environment, including web-linked database design,
development and management and use of languages such as Perl, PHP, C++,
Python, Ruby or JAVA. Familiarity with BioPerl, BioSQL and MySQL is
highly desirable.

Applicants must possess the critical thinking skills needed to devise
and carry out research projects and should have experience of analysing
macromolecular sequence data. A PhD in a relevant subject area is
desirable and will be required for appointment to a research fellowship.

A flair for design, particularly as applied to web-based resources, good
team-working skills and an ability to work under their own initiative
will provide an advantage, as will experience of research in molecular
bacteriology, comparative genomics, molecular evolution and/or pathogenesis.

Informal enquiries may be addressed to Professor Mark Pallen on 0121 414
7163 or m.pallen at bham.ac.uk

Starting salary ?27,183 a year, in the range of ?27,183 to ?35,469 a
year (potential progression on performance once in post to ?37,651). The
post will be offered on a fixed-term contract for a period up to two
years and nine months, starting on or shortly after May 1st 2009.

Interviews will be held in the week beginning Monday 30 March 2009.

Closing date: 23 March 2009   Reference: 39855
To download the details and submit an electronic application online
visit: www.hr.bham.ac.uk/jobs
<ttp://www.hr.bham.ac.uk/jobs">alternatively information can be obtained
from 0121 415 9000.

A University of Fairness and Diversity.

Mark

Professor Mark Pallen
Professor of Microbial Genomics
Centre for Systems Biology
Biosciences
University of Birmingham, BIRMINGHAM, B15 2TT
m.pallen at bham.ac.uk
tel ++44(0)121 414 7163

Author: The Rough Guide to Evolution
http://www.amazon.co.uk/Rough-Guide-Evolution-Science-Phenomena/dp/1858289467/

Blog
http://roughguidetoevolution.blogspot.com
feed://roughguidetoevolution.blogspot.com/feeds/posts/default

"There is grandeur in this view of life, with its several powers, having
been originally breathed into a few forms or into one; and that, whilst
this planet has gone cycling on according to the fixed law of gravity,
from so simple a beginning endless forms most beautiful and most
wonderful have been, and are being evolved."
Charles Darwin


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________


From hlapp at gmx.net  Wed Mar 18 18:45:50 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Mar 2009 14:45:50 -0400
Subject: [BioPython] OBF application for Summer of Code has been rejected
Message-ID: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>

I hope to find out later why, but our Google Summer of Code  
application as an umbrella org has been rejected.

However, NESCent has been accepted. If you can give your project idea  
a phylogenetics/phyloinformatics focus, go and put it up on the  
NESCent ideas page at

http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009

Do so pretty much **now** - we will start broadcasting and reaching  
out to students tonight and tomorrow. If someone comes to the site and  
they don't see a Bio* project that they would have been interested in,  
they may not check back for updates.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From Yvan.Strahm at bccs.uib.no  Wed Mar 18 18:47:58 2009
From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no)
Date: Wed, 18 Mar 2009 19:47:58 +0100
Subject: [BioPython] How can I get a more explicite error
Message-ID: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>

Hello List,

I try to get a grip on Biopython and followed the chapter 6 form the  
tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html)

I run this script:

from Bio.Blast import NCBIStandalone
import re
import sys

my_blast_db =  
"/export/scratch/yvans/BEE/Apis_mellifera_ligustica_complete_mitochondrial_genome.fasta"
my_blast_file = sys.argv[1]
my_blast_exe = "/Home/lundalm/yvans/src/blast-2.2.19/bin/blastall"

result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn",
                                                       my_blast_db,  
my_blast_file,
                                                       gap_open=5,
                                                       gap_extend=2,
                                                       filter ='F',
                                                       expectation=1000)

blast_results = result_handle.read()

my_results=sys.argv[1]+".xml"

save_file = open(my_results, "w")
save_file.write(blast_results)
save_file.close()

I got this error

[yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta
Traceback (most recent call last):
   File "bioblast.py", line 16, in <module>
     blast_results = result_handle.read()
SystemError: Objects/stringobject.c:4271: bad argument to internal function

if the number of sequence blasted agianst the db is greater than 500000.
The sequence are small reads from a solexa sequencing project.

Is there a size limitation?

And should I save(keep) only the sequence I am interested in into  
my_results instead of saving everything?

And is there a way of running some tests before doinr the blast_result.read()?

Now I try to use keep_hits=1 as a blast parameters in order to reduce  
the size of my_result, will see.

Thanks for your time and help

Cheers,
yvan


From cjfields at illinois.edu  Wed Mar 18 19:08:48 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 18 Mar 2009 14:08:48 -0500
Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has
	been rejected
In-Reply-To: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>
References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>
Message-ID: <B306969B-29F9-48C3-9860-F9E30757F0E7@illinois.edu>

Hilmar,

The idea was floated on the google SOC list that language-specific  
organizations that have been accepted may potentially take  
bioinformatics-related  applications.  Specifically, Jonathan Leto  
(from The Perl Foundation) indicated that bioinformatics-related  
projects using BioPerl might be able to apply through them.  Not sure  
about others (Python Software Foundation, etc) but might be worth  
checking into.

Any idea on who's been accepted beyond NEScent?

chris

On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote:

> I hope to find out later why, but our Google Summer of Code  
> application as an umbrella org has been rejected.
>
> However, NESCent has been accepted. If you can give your project  
> idea a phylogenetics/phyloinformatics focus, go and put it up on the  
> NESCent ideas page at
>
> http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009
>
> Do so pretty much **now** - we will start broadcasting and reaching  
> out to students tonight and tomorrow. If someone comes to the site  
> and they don't see a Bio* project that they would have been  
> interested in, they may not check back for updates.
>
> 	-hilmar
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l


From chapmanb at 50mail.com  Wed Mar 18 21:20:07 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 18 Mar 2009 17:20:07 -0400
Subject: [BioPython] How can I get a more explicite error
In-Reply-To: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
Message-ID: <20090318212007.GM57054@sobchak.mgh.harvard.edu>

Hi Yvan;

> I try to get a grip on Biopython and followed the chapter 6 form the  
> tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html)
> 
> I run this script:
[...]
> blast_results = result_handle.read()
[...]
> [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta
> Traceback (most recent call last):
>    File "bioblast.py", line 16, in <module>
>      blast_results = result_handle.read()
> SystemError: Objects/stringobject.c:4271: bad argument to internal function
> 
> if the number of sequence blasted agianst the db is greater than 500000.
> The sequence are small reads from a solexa sequencing project.

The result_handle.read() line is pulling the entire large BLAST result
file into memory as a string. You will run out of memory with huge files,
leading to the errors you are seeing.

To limit the problem, run BLAST initially at the command line,
and then process the resulting XML file with the BLAST parser
as described here:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56

This iterates over 1 record at a time, avoiding the memory issue.

However, you should be using a short read aligner to map these reads
to the genome. BLAST is not the right tool for this particular
application; massive BLAST report files are going to be one of many
problems you will run into analyzing the data. Here are a couple of
popular aligners designed for the exact problem you are tackling:

Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
Maq: http://maq.sourceforge.net/

Hope this helps,
Brad


From hlapp at gmx.net  Wed Mar 18 22:50:26 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Mar 2009 18:50:26 -0400
Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has
	been rejected
In-Reply-To: <B306969B-29F9-48C3-9860-F9E30757F0E7@illinois.edu>
References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net>
	<B306969B-29F9-48C3-9860-F9E30757F0E7@illinois.edu>
Message-ID: <D013D9EA-8509-43B7-988C-3CBA94A580F2@gmx.net>

Yes, thanks for mentioning that, was going to do so too. The Perl  
Foundation and the Python foundation have been accepted.

I guess there isn't a Java Foundation, and if there is a Ruby one it  
hasn't been accepted or hasn't applied. However, Ruby on Rails has  
been accepted. Don't know how open they would be a Bioruby project.

	-hilmar

On Mar 18, 2009, at 3:08 PM, Chris Fields wrote:

> Hilmar,
>
> The idea was floated on the google SOC list that language-specific  
> organizations that have been accepted may potentially take  
> bioinformatics-related  applications.  Specifically, Jonathan Leto  
> (from The Perl Foundation) indicated that bioinformatics-related  
> projects using BioPerl might be able to apply through them.  Not  
> sure about others (Python Software Foundation, etc) but might be  
> worth checking into.
>
> Any idea on who's been accepted beyond NEScent?
>
> chris
>
> On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote:
>
>> I hope to find out later why, but our Google Summer of Code  
>> application as an umbrella org has been rejected.
>>
>> However, NESCent has been accepted. If you can give your project  
>> idea a phylogenetics/phyloinformatics focus, go and put it up on  
>> the NESCent ideas page at
>>
>> http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009
>>
>> Do so pretty much **now** - we will start broadcasting and reaching  
>> out to students tonight and tomorrow. If someone comes to the site  
>> and they don't see a Bio* project that they would have been  
>> interested in, they may not check back for updates.
>>
>> 	-hilmar
>>
>> -- 
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From yvan.strahm at bccs.uib.no  Thu Mar 19 08:10:17 2009
From: yvan.strahm at bccs.uib.no (Yvan Strahm)
Date: Thu, 19 Mar 2009 09:10:17 +0100
Subject: [BioPython] How can I get a more explicite error
In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu>
References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
	<20090318212007.GM57054@sobchak.mgh.harvard.edu>
Message-ID: <49C1FDE9.20305@bccs.uib.no>

Hello Brad,

Thanks for the help, much appreciated.
I will look at bowtie and Maq. In fact I am interested into reads which are not in the reference and 
   how they differ from the reference, how many reads have 1,2,3,.... indels/mismatch.
Cheers,
yvan

Brad Chapman wrote:
> Hi Yvan;
> 
>> I try to get a grip on Biopython and followed the chapter 6 form the  
>> tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html)
>>
>> I run this script:
> [...]
>> blast_results = result_handle.read()
> [...]
>> [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta
>> Traceback (most recent call last):
>>    File "bioblast.py", line 16, in <module>
>>      blast_results = result_handle.read()
>> SystemError: Objects/stringobject.c:4271: bad argument to internal function
>>
>> if the number of sequence blasted agianst the db is greater than 500000.
>> The sequence are small reads from a solexa sequencing project.
> 
> The result_handle.read() line is pulling the entire large BLAST result
> file into memory as a string. You will run out of memory with huge files,
> leading to the errors you are seeing.
> 
> To limit the problem, run BLAST initially at the command line,
> and then process the resulting XML file with the BLAST parser
> as described here:
> 
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56
> 
> This iterates over 1 record at a time, avoiding the memory issue.
> 
> However, you should be using a short read aligner to map these reads
> to the genome. BLAST is not the right tool for this particular
> application; massive BLAST report files are going to be one of many
> problems you will run into analyzing the data. Here are a couple of
> popular aligners designed for the exact problem you are tackling:
> 
> Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
> Maq: http://maq.sourceforge.net/
> 
> Hope this helps,
> Brad
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Thu Mar 19 10:47:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Mar 2009 10:47:05 +0000
Subject: [BioPython] [Fwd: Bioinformatician wanted]
In-Reply-To: <49C1366D.7070105@bham.ac.uk>
References: <49C1366D.7070105@bham.ac.uk>
Message-ID: <320fb6e00903190347v3a70b6b0w46033c5769b38aa5@mail.gmail.com>

On Wed, Mar 18, 2009 at 5:59 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> Hi all,
>
> I hope biopython'ers will excuse me posting this job advert for a Research
> Fellow at University of Birmingham - the project referenced makes heavy use
> of Biopython. The position holder would interact with Biopython on a daily
> basis, and potentially be able to help the Biopython open source effort
> should they wish.
>
> Cheers,
>
> Nick.

I have no objections to posting targeted and directly relevant
academic jobs adverts here - in fact I rather like it.  I would point
out the job advert text itself doesn't actually mention Biopython -
perhaps you can get HR to amend the copy linked to from the University
job page updated to mention experience of Biopython, BioPerl or BioSQL
being desirable?

Peter

P.S. Could you add links to Biopython, BioPerl and BioSQL to the xBase
website, maybe on the about page? http://xbase.bham.ac.uk/about.pl

P.P.S. Did you have a chance to try out the patch on Bug 2738 for
speeding up loading GenBank files into BioSQL?
http://bugzilla.open-bio.org/show_bug.cgi?id=2738

Cheers!


From biopython at maubp.freeserve.co.uk  Thu Mar 19 10:52:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Mar 2009 10:52:30 +0000
Subject: [BioPython] How can I get a more explicite error
In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu>
References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no>
	<20090318212007.GM57054@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00903190352rcbca60bi4d703dbf65bcd3b0@mail.gmail.com>

On Wed, Mar 18, 2009 at 9:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> The result_handle.read() line is pulling the entire large BLAST result
> file into memory as a string. You will run out of memory with huge files,
> leading to the errors you are seeing.

I think Brad is probably right about the memory issue - is certainly
something to be careful of.  Instead of this:

blast_results = result_handle.read()
my_results=sys.argv[1]+".xml"
save_file = open(my_results, "w")
save_file.write(blast_results)
save_file.close()

You could try keeping only one line in memory:

my_results=sys.argv[1]+".xml"
save_file = open(my_results, "w")
for line in result_handle :
    save_file.write(line)
save_file.close()

Or, we should get round to fixing Bug 2654 which would let you tell
the BLAST tool to save the file itself, which would be much more
elegant.  Do you want to add yourself as a CC to this bug, so you'll
automatically be informed of any updates:
http://bugzilla.open-bio.org/show_bug.cgi?id=2654

Peter


From mitlox at op.pl  Thu Mar 19 12:55:06 2009
From: mitlox at op.pl (mitlox)
Date: Thu, 19 Mar 2009 22:55:06 +1000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
References: <49C0B976.1020005@op.pl>	
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
Message-ID: <49C240AA.908@op.pl>

I wrote this code:
------------------------------------------------------------------------------------------------
import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb" #not the full cage!

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)

backBoneAtomNames = "N","CA","C","0", "CB"

tempBackbone = [0,0,0,0,0]
Backbone = []
backboneNo = 0

for atom in structure.get_atoms():
    if (atom.get_name() == backBoneAtomNames[backboneNo]) and 
(backboneNo < len(backBoneAtomNames)):
        tempBackbone[backboneNo] = atom   
        backboneNo+=1
    elif atom.get_name() != backBoneAtomNames[backboneNo]:
        backboneNo = 0
    elif len(backBoneAtomNames) == backboneNo:
        Backbone.extend(tempBackbone)
        for a in tempBackbone:
            print a
------------------------------------------------------------------------------------------------
to identified the backbone, but unfortunately it does not work.

Maybe exist already to identified backbone in Biopython?

Thank you in advance

Peter Cock wrote:
>> Hi,
>>
>> I've got a couple of PDB examples on my personal website, and although
>> they need a little update to use NumPy instead of Numeric, I think the
>> page on doing protein contact maps would be very informative:
>> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/
>>     
>
> I've updated those pages to use NumPy instead of Numeric - all very
> straight forward (apart from some issue with rpy for the graphics which
> isn't relevant to Biopython):
>
> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> Peter
>
>   


From p.j.a.cock at googlemail.com  Thu Mar 19 13:31:30 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 19 Mar 2009 13:31:30 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <49C240AA.908@op.pl>
References: <49C0B976.1020005@op.pl>
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
	<49C240AA.908@op.pl>
Message-ID: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>

On Thu, Mar 19, 2009 at 12:55 PM, mitlox <mitlox at op.pl> wrote:
> I wrote this code:
> ------------------------------------------------------------------------------------------------
> import Bio.PDB
> import numpy
>
> pdb_code = "1E8W"
> pdb_filename = "1E8W.pdb" #not the full cage!

That comment was about the fact that the PDB file 1XI4 only contains
part of the full clathrin cage.

> structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
> backBoneAtomNames = "N","CA","C","0", "CB"
> ...
> ------------------------------------------------------------------------------------------------
> to identified the backbone, but unfortunately it does not work.
>
> Maybe exist already to identified backbone in Biopython?

I don't understand what you were trying to do. Have you read the
Bio.PDB documentation about the hierarchy of structures, models,
chains, residues and atoms?
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

This is how I would solve the original question, finding the distance
between the C-alpha carbon to the closest atom is the ligand:

import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb"

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
chainA = model["A"]

def residue_dist_to_ligand(protein_residue, ligand_residue) :
    """Returns distance from the protein C-alpha to the closest ligand atom."""
    distances = []
    for atom in ligand_residue :
        diff_vector  = protein_residue["CA"].coord - atom.coord
        distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
    return min(distances)

#From looking at the PDB file, ligand is last residue in chain A, named QUE
ligand_res = chainA.child_list[-1]
assert ligand_res.resname == "QUE"
for protein_res in chainA.child_list[:-1] :
    dist = residue_dist_to_ligand(protein_res, ligand_res)
    if dist < 5.0 :
        print protein_res.resname, protein_res.id[1], dist

This gives the following output:

ILE 881 3.64203
VAL 882 3.58559
ALA 885 4.62673
THR 886 4.95211
ILE 963 4.64252
ASP 964 3.08788

If you wanted to, it should be simple change this to find the closest
distance between any part of each residue to any part of the ligand,
which should I expect give some distances less than 3A.

Peter


From mitlox at op.pl  Fri Mar 20 12:18:48 2009
From: mitlox at op.pl (mitlox)
Date: Fri, 20 Mar 2009 22:18:48 +1000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>
References: <49C0B976.1020005@op.pl>	
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>	
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>	
	<49C240AA.908@op.pl>
	<320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>
Message-ID: <49C389A8.5090703@op.pl>

Thank you very much for your code, it works and the output is exactly 
for what I was looking for.

I try to get a structureCA object to write out the results in a PDB file 
(outCA.pdb) like this:
ATOM   5275  CA  ILE A 881      17.242  57.141  22.062  1.00 
38.49           C
ATOM   5283  CA  VAL A 882      16.292  57.880  25.678  1.00 
38.90           C 
....

And the second reason for a structureCA object is that I do not want use:
structureCA = Bio.PDB.PDBParser().get_structure(outCA.pdb, outCA.pdb)

Unfortunately I get this error with the extension:
ILE 881 3.64203
VAL 882 3.58559
ALA 885 4.62673
THR 886 4.95211
ILE 963 4.64252
ASP 964 3.08788
Traceback (most recent call last):
  File "interaction.py", line 31, in ?
    io.save('out.pdb')
  File 
"/usr/lib/python2.4/site-packages/biopython-1.49-py2.4-linux-i686.egg/Bio/PDB/PDBIO.py", 
line 121, in save
    for model in self.structure.get_list():
AttributeError: 'list' object has no attribute 'get_list'

Here is the code:
import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb"

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
chainA = model["A"]
structureCA = []

def residue_dist_to_ligand(protein_residue, ligand_residue) :
    """Returns distance from the protein C-alpha to the closest ligand 
atom."""
    distances = []
    for atom in ligand_residue :
        diff_vector  = protein_residue["CA"].coord - atom.coord
        distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
    return min(distances)

#From looking at the PDB file, ligand is last residue in chain A, named QUE
ligand_res = chainA.child_list[-1]
assert ligand_res.resname == "QUE"
for protein_res in chainA.child_list[:-1] :
    dist = residue_dist_to_ligand(protein_res, ligand_res)
    if dist < 5.0 :
        print protein_res.resname, protein_res.id[1], dist
    structureCA.append(protein_res)

io=Bio.PDB.PDBIO()
io.set_structure(structureCA)
io.save('outCA.pdb')

How can I get a structureCA object of the results?

Thank you in advance.

Best regards


From p.j.a.cock at googlemail.com  Fri Mar 20 13:36:47 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Mar 2009 13:36:47 +0000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <49C389A8.5090703@op.pl>
References: <49C0B976.1020005@op.pl>
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>
	<49C240AA.908@op.pl>
	<320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>
	<49C389A8.5090703@op.pl>
Message-ID: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com>

On Fri, Mar 20, 2009 at 12:18 PM, mitlox <mitlox at op.pl> wrote:
> Thank you very much for your code, it works and the output is exactly for
> what I was looking for.
>
> I try to get a structureCA object to write out the results in a PDB file
> (outCA.pdb) like this:
> ATOM ? 5275 ?CA ?ILE A 881 ? ? ?17.242 ?57.141 ?22.062 ?1.00 38.49
> C
> ATOM ? 5283 ?CA ?VAL A 882 ? ? ?16.292 ?57.880 ?25.678 ?1.00 38.90
> C ....
>
> Unfortunately I get this error with ... Here is the code:
> ...
> structureCA = []
> ...
> io=Bio.PDB.PDBIO()
> io.set_structure(structureCA)
> io.save('outCA.pdb')

Your structureCA object is just a python list, containing Residue objects.
Instead you need to create a new object with the partial chain - which
can be done by creating structure, model and chain objects manually.

However, I suggest you re-read pages 5 and 6 of the Bio.PDB
documentation for the recommend approach:
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
In your case, you'll want to write your own selection class using the
residue distance to the ligand.  I recognise this might seem rather
complicated for a python novice as you have to create your own
class - so here is my solution:

import Bio.PDB
import numpy

pdb_code = "1E8W"
pdb_filename = "1E8W.pdb"

structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
chainA = model["A"]

def residue_dist_to_ligand(protein_residue, ligand_residue) :
    """Returns distance from the protein C-alpha to the closest ligand atom."""
    distances = []
    for atom in ligand_residue :
        diff_vector  = protein_residue["CA"].coord - atom.coord
        distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
    return min(distances)

class NearLigandSelect(Bio.PDB.Select):
    def __init__(self, distance_threshold, ligand_residue) :
        self.threshold = distance_threshold
        self.ligand_res = ligand_residue

    def accept_residue(self, residue):
        if residue == self.ligand_res :
            return True #change this to False if you don't want the ligand
        else :
            dist = residue_dist_to_ligand(residue, self.ligand_res)
            return dist < self.threshold

io=Bio.PDB.PDBIO()
io.set_structure(structure)
#From looking at the PDB file, ligand is last residue in chain A
ligand_res = chainA.child_list[-1]
#Going to use a distance theshold of 4A
io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res))
print "Done"

Peter


From mitlox at op.pl  Fri Mar 20 23:45:56 2009
From: mitlox at op.pl (mitlox)
Date: Sat, 21 Mar 2009 09:45:56 +1000
Subject: [BioPython] protein-ligand interactions
In-Reply-To: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com>
References: <49C0B976.1020005@op.pl>	
	<320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com>	
	<320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com>	
	<49C240AA.908@op.pl>	
	<320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com>	
	<49C389A8.5090703@op.pl>
	<320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com>
Message-ID: <49C42AB4.7050404@op.pl>

Thank you very much for your solution.

Additionally It would be nice to have a structure object with the same 
information like in "near_ligand.pdb", that I do not need to read a new 
pdb file again:
structureMOD = Bio.PDB.PDBParser().get_structure("near", "near_ligand.pdb").

It is possible to have both a "near_ligand.pdb" and the same structure 
object?

Thank you in advance.

Best regards

Peter Cock wrote:
> Your structureCA object is just a python list, containing Residue objects.
> Instead you need to create a new object with the partial chain - which
> can be done by creating structure, model and chain objects manually.
>
> However, I suggest you re-read pages 5 and 6 of the Bio.PDB
> documentation for the recommend approach:
> http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
> In your case, you'll want to write your own selection class using the
> residue distance to the ligand.  I recognise this might seem rather
> complicated for a python novice as you have to create your own
> class - so here is my solution:
>
> import Bio.PDB
> import numpy
>
> pdb_code = "1E8W"
> pdb_filename = "1E8W.pdb"
>
> structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
> model = structure[0]
> chainA = model["A"]
>
> def residue_dist_to_ligand(protein_residue, ligand_residue) :
>     """Returns distance from the protein C-alpha to the closest ligand atom."""
>     distances = []
>     for atom in ligand_residue :
>         diff_vector  = protein_residue["CA"].coord - atom.coord
>         distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector)))
>     return min(distances)
>
> class NearLigandSelect(Bio.PDB.Select):
>     def __init__(self, distance_threshold, ligand_residue) :
>         self.threshold = distance_threshold
>         self.ligand_res = ligand_residue
>
>     def accept_residue(self, residue):
>         if residue == self.ligand_res :
>             return True #change this to False if you don't want the ligand
>         else :
>             dist = residue_dist_to_ligand(residue, self.ligand_res)
>             return dist < self.threshold
>
> io=Bio.PDB.PDBIO()
> io.set_structure(structure)
> #From looking at the PDB file, ligand is last residue in chain A
> ligand_res = chainA.child_list[-1]
> #Going to use a distance theshold of 4A
> io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res))
> print "Done"
>
> Peter
>
>   


From mjldehoon at yahoo.com  Sat Mar 21 04:54:08 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 20 Mar 2009 21:54:08 -0700 (PDT)
Subject: [BioPython] Bio.Enzyme (was: Re: [Biopython-dev] Bio.ExPASy)
In-Reply-To: <76595.11423.qm@web62404.mail.re1.yahoo.com>
Message-ID: <517737.76119.qm@web62403.mail.re1.yahoo.com>


I've created a simplified version of the parser in Bio.Enzyme in Bio.ExPASy.Enzyme. The idea behind it is to collect all parsers related to ExPASy databases in Bio.ExPASy so that they can be found more easily by users.

Bio.ExPASy.Enzyme works essentially the same as Bio.Enzyme, but I've done a few things a bit differently. The biggest change is probably that Bio.Enzyme stores information as attributes to a record, whereas Bio.ExPASy.Enzyme has a Record derived from a dictionary, and stores information in the dictionary (same as Bio.Medline). Does anybody have any objection if Bio.ExPASy.Enzyme becomes the "official" parser for ExPASy's Enzyme database? If not, I'll modify the documentation and tests accordingly, and start the deprecation process for Bio.Enzyme.

--Michiel

--- On Sun, 3/15/09, Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> From: Michiel de Hoon <mjldehoon at yahoo.com>
> Subject: [Biopython-dev] Bio.ExPASy
> To: biopython-dev at biopython.org
> Date: Sunday, March 15, 2009, 6:24 AM
> Hi everybody,
> 
> As discussed previously, I have moved the Bio.Prosite code
> to Bio.ExPASy, and I've added a ScanProsite module to
> Bio.ExPASy. I guess Bio.Enzyme should also move to
> Bio.ExPASy. See
> 
> http://biopython.org/DIST/docs/tutorial/Tutorial.proposal.html
> 
> for the documentation of Biopython as currently in CVS.
> 
> --Michiel.
> 
> 
>       
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From lueck at ipk-gatersleben.de  Tue Mar 24 09:34:19 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 24 Mar 2009 10:34:19 +0100
Subject: [BioPython] Emboss eprimer3
Message-ID: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>

Hi!

I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode:

1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C? 

2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing.

The primer 3 file looks like this:

PRIMER_SEQUENCE_ID=HF15E08r
SEQUENCE=GCATGTAATAATGCCAAAGCTCACAGCTGCAGTTGAATCTTGGGACCCGCGGAGCGAGAATGTACCAATCCATGTATGGGTACACCCATGGCTGCCAACTCTAGGGCAAAGGATAGATACACTGTGCCACTCTATCCGGTACAAGCTGAGTAGTGTCCTCCAATTATGGCAAGCTCACGATTCATCAGCTTATGCTGTGCTATCTCCATGGAAGGGTGTATTTGATCCAGCAAGTTGGGAAGACTTGATAGTGCGTTATATCATTCCTAAACTGAAAATGGCACTCCAGGAGTTCCAGATTAACCCAGCAAGCCAAAAGTTTGACCAGTTTAACTGGGTTATGATCTGGGCTTCTGCTGTCCCGGTACACCATATGGTCCATATGTTGGAAGTTGATTTCTTTAGCAAGTGGCAGCTGGTTTTGTACCATTGGCTGAGCTCACCAAATCCTGATTTCAATGAGATAATGAATTGGTAT
PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200
PRIMER_OPT_TM=60.0
PRIMER_MIN_TM=58.0
PRIMER_MAX_TM=65.0
PRIMER_MAX_DIFF_TM=3.0
PRIMER_DNA_CONC=420
PRIMER_NUM_RETURN=1
PRIMER_PAIR_PENALTY=0.8691
PRIMER_LEFT_PENALTY=0.708329
PRIMER_RIGHT_PENALTY=0.160746
PRIMER_LEFT_SEQUENCE=GCATGTAATAATGCCAAAGC
PRIMER_RIGHT_SEQUENCE=TTGAAATCAGGATTTGGTGA
PRIMER_LEFT=0,20
PRIMER_RIGHT=458,20
PRIMER_LEFT_TM=59.292
PRIMER_RIGHT_TM=60.161
PRIMER_LEFT_GC_PERCENT=40.000
PRIMER_RIGHT_GC_PERCENT=35.000
PRIMER_LEFT_SELF_ANY=7.00
PRIMER_RIGHT_SELF_ANY=8.00
PRIMER_LEFT_SELF_END=2.00
PRIMER_RIGHT_SELF_END=2.00
PRIMER_LEFT_END_STABILITY=8.5000
PRIMER_RIGHT_END_STABILITY=7.9000
PRIMER_PAIR_COMPL_ANY=5.00
PRIMER_PAIR_COMPL_END=3.00
PRIMER_PRODUCT_SIZE=459

Thanks in advance!
Stefanie


From biopython at maubp.freeserve.co.uk  Tue Mar 24 10:00:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Mar 2009 10:00:46 +0000
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>

2009/3/24 Stefanie L?ck <lueck at ipk-gatersleben.de>:
> Hi!
>
> I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode:
>
> 1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C?

OK, you're using the gcclamp argument (i.e. GC clamp), which is
supported by the Bio.Emboss.Applications wrapper.
http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html

I don't know if there is a primer3 argument for limiting the G or C's
at the end - have you asked on the EMBOSS mailing list?

> 2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing.

>From reading the documentation there is a "fformat1" argument which
*might* do what you want - you could try this out on the command line
and see.  Note that this argument is not currently supported in the
Bio.Emboss.Applications wrapper, but that would be easy to add.  If
this argument doesn't do what you want, you'd have to ask the EMBOSS
people about alternative output formats. Alternatively, you might
investigate the original Whitehead version of primer3.

Note that if you do succeed in changing the output format, you may
need a new parser to read it.

Peter


From mitlox at op.pl  Tue Mar 24 11:12:36 2009
From: mitlox at op.pl (mitlox)
Date: Tue, 24 Mar 2009 21:12:36 +1000
Subject: [BioPython] Superimposer
Message-ID: <49C8C024.60403@op.pl>

Hello,
I read that the Superimposer works only with the two lists of atoms 
which contain the same amount of atoms.

So I decided to use "Combinatorial Extension (CE)". This program returns 
a rotation matrix and a translation vector.

After the execution of CE I took the matrix and vector and tried to use 
it with Superimposer:
------------------------------------------------------------------------------
import sys
import numpy
from Bio.PDB import *


pdb_fix = "../files/1z9g.pdb"
pdb_mov = "../files/1z9g90.pdb"
p=PDBParser()
s1=p.get_structure("FIXED", pdb_fix)
fixed=Selection.unfold_entities(s1, "A")

s2=p.get_structure("MOVING", pdb_mov)
moving=Selection.unfold_entities(s2, "A")

rot=numpy.identity(3).astype('f')
tran=numpy.array((1.0, 2.0, 3.0), 'f')

tran[0] = -0.99996603; tran[1] = -2.00002559; tran[2] = -2.99998285
rot[0][0] = 0.19411441; rot[0][1] = -0.85385353; rot[0][2] = 0.48296351
rot[1][0] = 0.94858827; rot[1][1] = 0.28884874; rot[1][2] = 0.12940907
rot[2][0] = -0.24999979; rot[2][1] = 0.43301335; rot[2][2] = 0.86602514

for atom in moving:
    atom.transform(rot, tran)
   
sup=Superimposer()

sup.set_atoms(fixed, moving)

print sup.rotran
print sup.rms

sup.apply(moving)


print "Saving aligned structure as PDB file %s" % pdb_mov
io=PDBIO()
io.set_structure(s2)
io.save(pdb_mov)

print "Done"
------------------------------------------------------------------------------

Unfortunalaty "print sup.rotran" returns this:
(array([[ 0.19411383,  0.94858824, -0.25000035],
       [-0.85385389,  0.28884841,  0.43301285],
       [ 0.4829631 ,  0.12940999,  0.86602523]]), array([-0.06470776,  
1.91446435,  3.21412203]))

but this matrix and vector are no the same like above.

What do I wrong?

Thank you in advance.

Best regards,


From biopython at maubp.freeserve.co.uk  Tue Mar 24 11:43:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Mar 2009 11:43:05 +0000
Subject: [BioPython] Superimposer
In-Reply-To: <49C8C024.60403@op.pl>
References: <49C8C024.60403@op.pl>
Message-ID: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>

On Tue, Mar 24, 2009 at 11:12 AM, mitlox <mitlox at op.pl> wrote:
> Hello,
> I read that the Superimposer works only with the two lists of atoms which
> contain the same amount of atoms.
>
> So I decided to use "Combinatorial Extension (CE)". This program returns a
> rotation matrix and a translation vector.
>
> After the execution of CE I took the matrix and vector and tried to use it
> with Superimposer:

Why?  Once you know the transformation, why do you need to try and
recreate it with the superimposer?  Are you just doing this as a check?

> ------------------------------------------------------------------------------
> import sys
> import numpy
> from Bio.PDB import *
>
>
> pdb_fix = "../files/1z9g.pdb"
> pdb_mov = "../files/1z9g90.pdb"
> p=PDBParser()
> s1=p.get_structure("FIXED", pdb_fix)
> fixed=Selection.unfold_entities(s1, "A")
>
> s2=p.get_structure("MOVING", pdb_mov)
> moving=Selection.unfold_entities(s2, "A")

You should be loading in the ORGINAL pdb file here, as the moved one
won't exist yet, and if it did, you'd apply the transformation twice.

Note you should expect slight differences due to floating point
calculations.  Your input was:

array([[ 0.19411442, -0.85385352,  0.4829635 ],
       [ 0.94858825,  0.28884873,  0.12940907],
       [-0.24999979,  0.43301335,  0.86602515]], dtype=float32)
array([-0.99996603, -2.00002551, -2.99998283], dtype=float32),

The output was:

array([[ 0.19411439,  0.94858827, -0.24999978],
       [-0.85385353,  0.28884871,  0.43301335],
       [ 0.4829635 ,  0.12940907,  0.86602514]]),
array([-0.06473777,  1.91448618,  3.21410633])

The rotation looks transposed (backwards).  The translation does look
different... however, if you switch this line:
sup.set_atoms(fixed, moving)
to:
sup.set_atoms(moving, fixed)
then things agree.  I suspect something is flipped in the logic of
your script regarding the frames of reference.

Also, at the end you do sup.apply(moving), but you have already
manually moved these atoms, so won't your PDB file have them moved
twice?

Peter


From mitlox at op.pl  Tue Mar 24 12:18:32 2009
From: mitlox at op.pl (mitlox)
Date: Tue, 24 Mar 2009 22:18:32 +1000
Subject: [BioPython] Superimposer
In-Reply-To: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>
References: <49C8C024.60403@op.pl>
	<320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>
Message-ID: <49C8CF98.30809@op.pl>

Thank you for you email.  I would like only rotate and translate a pdb 
file that I can see the result in a pdb viewer.

Maybe I do not need the Superimposer object to rotate and translate a 
pdb file with known rotation matrix and translation vector?

Do you know how could I rotate and translate a pdb file?

Thank you in advance.

Peter wrote:
> On Tue, Mar 24, 2009 at 11:12 AM, mitlox <mitlox at op.pl> wrote:
>   
>> Hello,
>> I read that the Superimposer works only with the two lists of atoms which
>> contain the same amount of atoms.
>>
>> So I decided to use "Combinatorial Extension (CE)". This program returns a
>> rotation matrix and a translation vector.
>>
>> After the execution of CE I took the matrix and vector and tried to use it
>> with Superimposer:
>>     
>
> Why?  Once you know the transformation, why do you need to try and
> recreate it with the superimposer?  Are you just doing this as a check?
>
>   
>> ------------------------------------------------------------------------------
>> import sys
>> import numpy
>> from Bio.PDB import *
>>
>>
>> pdb_fix = "../files/1z9g.pdb"
>> pdb_mov = "../files/1z9g90.pdb"
>> p=PDBParser()
>> s1=p.get_structure("FIXED", pdb_fix)
>> fixed=Selection.unfold_entities(s1, "A")
>>
>> s2=p.get_structure("MOVING", pdb_mov)
>> moving=Selection.unfold_entities(s2, "A")
>>     
>
> You should be loading in the ORGINAL pdb file here, as the moved one
> won't exist yet, and if it did, you'd apply the transformation twice.
>
> Note you should expect slight differences due to floating point
> calculations.  Your input was:
>
> array([[ 0.19411442, -0.85385352,  0.4829635 ],
>        [ 0.94858825,  0.28884873,  0.12940907],
>        [-0.24999979,  0.43301335,  0.86602515]], dtype=float32)
> array([-0.99996603, -2.00002551, -2.99998283], dtype=float32),
>
> The output was:
>
> array([[ 0.19411439,  0.94858827, -0.24999978],
>        [-0.85385353,  0.28884871,  0.43301335],
>        [ 0.4829635 ,  0.12940907,  0.86602514]]),
> array([-0.06473777,  1.91448618,  3.21410633])
>
> The rotation looks transposed (backwards).  The translation does look
> different... however, if you switch this line:
> sup.set_atoms(fixed, moving)
> to:
> sup.set_atoms(moving, fixed)
> then things agree.  I suspect something is flipped in the logic of
> your script regarding the frames of reference.
>
> Also, at the end you do sup.apply(moving), but you have already
> manually moved these atoms, so won't your PDB file have them moved
> twice?
>
> Peter
>
>   


From biopython at maubp.freeserve.co.uk  Tue Mar 24 12:41:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Mar 2009 12:41:53 +0000
Subject: [BioPython] Superimposer
In-Reply-To: <49C8CF98.30809@op.pl>
References: <49C8C024.60403@op.pl>
	<320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com>
	<49C8CF98.30809@op.pl>
Message-ID: <320fb6e00903240541p5fa8e043wc3363b18b34af37b@mail.gmail.com>

On Tue, Mar 24, 2009 at 12:18 PM, mitlox <mitlox at op.pl> wrote:
> Thank you for you email. ?I would like only rotate and translate a pdb file
> that I can see the result in a pdb viewer.

I see.

> Maybe I do not need the Superimposer object to rotate and translate a pdb
> file with known rotation matrix and translation vector?

Correct.

> Do you know how could I rotate and translate a pdb file?

You've got most of the steps already.  This is my suggestion:

import numpy
from Bio import PDB

pdb_fix = "1z9g.pdb"
pdb_mov = "1z9g_moved.pdb"

structure = PDB.PDBParser().get_structure("FIXED", pdb_fix)

rot=numpy.identity(3).astype('f')
tran=numpy.array((-0.99996603, -2.00002559, -2.99998285))
rot=numpy.array(((+0.19411441, -0.85385353, +0.48296351),
                 (+0.94858827, +0.28884874, +0.12940907),
                 (-0.24999979, +0.43301335, +0.86602514)))

print "Applying transformation..."
for atom in structure.get_atoms() :
    atom.transform(rot, tran)

print "Saving transformed structure as PDB file %s" % pdb_mov
io=PDB.PDBIO()
io.set_structure(structure)
io.save(pdb_mov)
print "Done"

NOTE - When giving a translation mapping as a translation vector and
a rotation matrix there is some ambiguity about which order to apply them
in.  If the results using Bio.PDB don't match what you expect, you may
want to double check this first.

Peter


From cjfields at illinois.edu  Tue Mar 24 16:51:32 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 24 Mar 2009 11:51:32 -0500
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
Message-ID: <656D2F16-80DD-4976-90FE-2BCB8802093E@illinois.edu>

On Mar 24, 2009, at 5:00 AM, Peter wrote:

> ...
>> From reading the documentation there is a "fformat1" argument which
> *might* do what you want - you could try this out on the command line
> and see.  Note that this argument is not currently supported in the
> Bio.Emboss.Applications wrapper, but that would be easy to add.  If
> this argument doesn't do what you want, you'd have to ask the EMBOSS
> people about alternative output formats. Alternatively, you might
> investigate the original Whitehead version of primer3.

Peter,

Not sure if this will be a problem for the BioPython wrapper for  
primer3, but the latest Primer3 version on Sourceforge (v2.0.0a)  
radically changes the various input parameters.  I had to rewrite a  
bunch of code to handle those as well as older (v1) primer3 params.

> Note that if you do succeed in changing the output format, you may
> need a new parser to read it.
>
> Peter

primer3 input and output is BoulderIO (which I think is an essentially  
obsolete format Lincoln Stein wrote up many years ago).  It's very  
easy to parse, just simple key-value pairings.

chris


From nir at rosettadesigngroup.com  Wed Mar 25 16:18:24 2009
From: nir at rosettadesigngroup.com (Nir London)
Date: Wed, 25 Mar 2009 18:18:24 +0200
Subject: [BioPython] Rosetta Academic Training Webinar
Message-ID: <88F0F36A-FC4D-4A9C-AC31-5B883C3F92CB@rosettadesigngroup.com>

The Rosetta Design Group is proud to present the first webinar in the  
Rosetta Academic Workshop Series. For the first webinar, we have  
selected to focus on Protein-Protein Docking based on the answers to  
the interest poll. We hope this will be the first in a line of helpful  
and inspiring webinars to kick-off our Rosetta Academic Workshop Series.

What: Protein-Protein Docking
When: May 4th 2009, 0800-1000 AM EST

Where: Your office!

Click here for more details and registration

(For non html emails: http://rosettadesigngroup.com/RDGLS/index.php?sid=54479&lang=en 
  )

Pleas note: This is not a promotional webinar. Rosetta is open-source  
and freeware for academic and non-profit organizations and can be  
downloaded here from University of Washington's TechTransfer Digital  
Ventures. The majority of the webinar is concerned with Rosetta 2.3.0.  
Rosetta 3.0 is still a beta version.

Hope to see you there,

Nir London.

Rosetta Design Group | http://rosettadesigngroup.com/


From biopython.chen at gmail.com  Thu Mar 26 02:59:04 2009
From: biopython.chen at gmail.com (chen Ku)
Date: Wed, 25 Mar 2009 19:59:04 -0700
Subject: [BioPython] how to retrieve data from PDB
Message-ID: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>

Dear all,
                I need your help in writing code to retrieve some of the pdb
structures.

Problem definition
 I just want to use some PDB file not all 50,000.

> I want to apply one python code so that I can know transcription factor
binding to DNA only out of all pdb data. So please guide me how to proceed
for this.I raed some published article on this dataset and just want to do
by python and not by manually.This is one of our course work in structural
biology so trying by my own and taking some help of you all. I need a
general code where I can check this kind of things by changing field
name.Any help will be grateful for me as I am  a beginner in python.


Regards
Chen


From lueck at ipk-gatersleben.de  Thu Mar 26 09:42:42 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Thu, 26 Mar 2009 10:42:42 +0100
Subject: [BioPython] Emboss eprimer3
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
Message-ID: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>

Hi!

I got a patch to add a '-originalformat' argument. If someone is interested 
too, I could send it to him or the mailing list.

>>>Note that if you do succeed in changing the output format, you may need a 
>>>new parser to read it.

This is no problem. I just need the data ;-)

>>> I don't know if there is a primer3 argument for limiting the G or C's at 
>>> the end - have you asked on the EMBOSS mailing list?

Yes, no answer yet.

Kind regards
Stefanie


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at lists.open-bio.org>
Sent: Tuesday, March 24, 2009 11:00 AM
Subject: Re: [BioPython] Emboss eprimer3


2009/3/24 Stefanie L?ck <lueck at ipk-gatersleben.de>:
> Hi!
>
> I have some questions about eprimer3 from Emboss which I use over Python 
> to design primers in a batch mode:
>
> 1) I'm using the GCclamp function (value=1). Is it possible to limit the G 
> or C's at the end to maximum of one G or C?

OK, you're using the gcclamp argument (i.e. GC clamp), which is
supported by the Bio.Emboss.Applications wrapper.
http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html

I don't know if there is a primer3 argument for limiting the G or C's
at the end - have you asked on the EMBOSS mailing list?

> 2) Is there a setting to get the original primer3 output? The emboss 
> output is for hundrets of primers not very usefull and many informations 
> are missing.

>From reading the documentation there is a "fformat1" argument which
*might* do what you want - you could try this out on the command line
and see.  Note that this argument is not currently supported in the
Bio.Emboss.Applications wrapper, but that would be easy to add.  If
this argument doesn't do what you want, you'd have to ask the EMBOSS
people about alternative output formats. Alternatively, you might
investigate the original Whitehead version of primer3.

Note that if you do succeed in changing the output format, you may
need a new parser to read it.

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar 26 10:23:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Mar 2009 10:23:01 +0000
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
	<005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00903260323p50f80c50w1ab07c8892518190@mail.gmail.com>

On Thu, Mar 26, 2009 at 9:42 AM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> I got a patch to add a '-originalformat' argument. If someone is interested
> too, I could send it to him or the mailing list.

Could you file an bug on bugzilla please, and the (after the bug is
filed) you can attach the patch.  I'll look at this (if Brad doesn't
first) - if you can also include a short example that would be
excellent.

Thank you,

Peter


From biopython at maubp.freeserve.co.uk  Thu Mar 26 11:04:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Mar 2009 11:04:29 +0000
Subject: [BioPython] how to retrieve data from PDB
In-Reply-To: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
Message-ID: <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com>

On Thu, Mar 26, 2009 at 2:59 AM, chen Ku <biopython.chen at gmail.com> wrote:
> Dear all,
> ? ? ? ? ? ? ? ?I need your help in writing code to retrieve some of the pdb
> structures.
>
> Problem definition
> ?I just want to use some PDB file not all 50,000.
>
>> I want to apply one python code so that I can know transcription factor
> binding to DNA only out of all pdb data. So please guide me how to proceed
> for this.

According to the website, there are about 2250 protein structures in
complex with nucleotides - and I assume some of these are for
transcription factors with DNA:
http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=molType-protein-nucleic-complex&seqid=100

I assume you'll want to search these PDB for entries which are
transcription factors binding to DNA, but I don't know enough about
the PDB search options to advise you.

Peter


From jblanca at btc.upv.es  Thu Mar 26 11:48:02 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Thu, 26 Mar 2009 12:48:02 +0100
Subject: [BioPython] about the SeqRecord slicing
Message-ID: <200903261248.02279.jblanca@btc.upv.es>

Hi:
I'm working with the SeqRecord slicing from cvs and I think that the behaviour 
could be sligthly changed. In fact that same opinion is written in the 
__getitem__ method:

        if isinstance(index, int) :
            #NOTE - The sequence level annotation like the id, name, etc
            #do not really apply to a single character.  However, should
            #we try and expose any per-letter-annotation here?  If so how?
            return self.seq[index]

I don't like the fact that the SeqRecord returns different classes depending 
on the index type. I think is better to return always a SeqRecord because:
- It simplifies the interface. It's easier to deal with the SeqRecord class if 
its behaviour is simple. Otherwise we have to check in the code that uses the 
SeqRecord if it's returning an str or a SeqRecord.
- It looses the per-letter-annotation. I'm working with qualities and I'm 
interested in keeping them.
- It's redundant because if we want to slice the seq property we can do it 
with: seqrec.seq[index]
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Thu Mar 26 12:05:25 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Mar 2009 12:05:25 +0000
Subject: [BioPython] about the SeqRecord slicing
In-Reply-To: <200903261248.02279.jblanca@btc.upv.es>
References: <200903261248.02279.jblanca@btc.upv.es>
Message-ID: <320fb6e00903260505j387279b7kfa4c69c33efe5487@mail.gmail.com>

On Thu, Mar 26, 2009 at 11:48 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
> I'm working with the SeqRecord slicing from cvs and I think that the behaviour
> could be sligthly changed. In fact that same opinion is written in the
> __getitem__ method:
>
> ? ? ? ?if isinstance(index, int) :
> ? ? ? ? ? ?#NOTE - The sequence level annotation like the id, name, etc
> ? ? ? ? ? ?#do not really apply to a single character. ?However, should
> ? ? ? ? ? ?#we try and expose any per-letter-annotation here? ?If so how?
> ? ? ? ? ? ?return self.seq[index]
>
> I don't like the fact that the SeqRecord returns different classes depending
> on the index type. I think is better to return always a SeqRecord because:
> - It simplifies the interface. It's easier to deal with the SeqRecord class if
> its behaviour is simple. Otherwise we have to check in the code that uses the
> SeqRecord if it's returning an str or a SeqRecord.
> - It looses the per-letter-annotation. I'm working with qualities and I'm
> interested in keeping them.
> - It's redundant because if we want to slice the seq property we can do it
> with: seqrec.seq[index]
> Best regards,

Hi Jose,

As we are talking about the CVS code, maybe this could have been on
the dev mailing list, but as its of general interest let's carry on
here for now.

You note that (currently in CVS) the new SeqRecord slicing returns a
SeqRecord for a slice, but a single letter string for a single integer
index.

This isn't so different from the Seq object - it returns a new Seq
object for a slice, but a single letter string for a single integer
index:
>>> from Bio.Seq import Seq
>>> s = Seq("ACGT")
>>> s
Seq('ACGT', Alphabet())
>>> s[0]
'A'
>>> s[0:3]
Seq('ACG', Alphabet())

More generally, consider lists in Python:
>>> x = [1,2,3,4,5]
>>> x[0]
1
>>> x[0:3]
[1, 2, 3]

So I don't agree with this expectation that slicing and indexing a
SeqRecord should automatically both give a SeqRecord.  You really want
a SeqRecord for a single character string?

Can you give me an example of where you want to pull out a single
character from a SeqRecord, and its quality?  I would consider things
like this quite elegant:

for letter, quality in zip(record.seq,
record.letter_annotations("phred_quality") :
   #do stuff

Peter


From chapmanb at 50mail.com  Thu Mar 26 12:40:45 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 26 Mar 2009 08:40:45 -0400
Subject: [BioPython] Emboss eprimer3
In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de>
	<320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com>
	<005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de>
Message-ID: <20090326124045.GD21577@sobchak.mgh.harvard.edu>

Hi all;

Stefanie:
> I got a patch to add a '-originalformat' argument. If someone is interested 
> too, I could send it to him or the mailing list.

Is this a patch to EMBOSS itself? If so, did the developers indicate
it would be in future versions of EMBOSS?

If that's the case, we can easily add this option to the commandline
interface. You need a:

           _Option(["-originalformat"], ["input"], None, 0),

line in Bio.Emboss.Applications.Primer3Commandline.

> >>>Note that if you do succeed in changing the output format, you may need a 
> >>>new parser to read it.
> 
> This is no problem. I just need the data ;-)

Out of curiosity, what parameter did you find useful from that
output that is not in the eprimer3 format output?

> >>> I don't know if there is a primer3 argument for limiting the G or C's at 
> >>> the end - have you asked on the EMBOSS mailing list?
> 
> Yes, no answer yet.

What I do in cases like this is ask for more primers
(-numreturn) and then post-parse them to pull out the ones that
satisfy my additional criteria. The output is ordered by primer3's
ranking, so the first one that passes the criteria would move on.
If none are satisfactory, then you can also build in a logic to
decide if any are good enough for your use (for example, 2 G/Cs at
the end) and pick one from this remaining group with less stringency.

Brad


> 
> Kind regards
> Stefanie
> 
> 
> 
> ----- Original Message ----- 
> From: "Peter" <biopython at maubp.freeserve.co.uk>
> To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
> Cc: <biopython at lists.open-bio.org>
> Sent: Tuesday, March 24, 2009 11:00 AM
> Subject: Re: [BioPython] Emboss eprimer3
> 
> 
> 2009/3/24 Stefanie L?ck <lueck at ipk-gatersleben.de>:
> > Hi!
> >
> > I have some questions about eprimer3 from Emboss which I use over Python 
> > to design primers in a batch mode:
> >
> > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G 
> > or C's at the end to maximum of one G or C?
> 
> OK, you're using the gcclamp argument (i.e. GC clamp), which is
> supported by the Bio.Emboss.Applications wrapper.
> http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html
> 
> I don't know if there is a primer3 argument for limiting the G or C's
> at the end - have you asked on the EMBOSS mailing list?
> 
> > 2) Is there a setting to get the original primer3 output? The emboss 
> > output is for hundrets of primers not very usefull and many informations 
> > are missing.
> 
> >From reading the documentation there is a "fformat1" argument which
> *might* do what you want - you could try this out on the command line
> and see.  Note that this argument is not currently supported in the
> Bio.Emboss.Applications wrapper, but that would be easy to add.  If
> this argument doesn't do what you want, you'd have to ask the EMBOSS
> people about alternative output formats. Alternatively, you might
> investigate the original Whitehead version of primer3.
> 
> Note that if you do succeed in changing the output format, you may
> need a new parser to read it.
> 
> Peter
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Fri Mar 27 12:18:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 27 Mar 2009 12:18:04 +0000
Subject: [BioPython] how to retrieve data from PDB
In-Reply-To: <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com>
References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
	<320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com>
	<4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com>
Message-ID: <320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com>

On Fri, Mar 27, 2009 at 2:53 AM, chen Ku <biopython.chen at gmail.com> wrote:
> Thank you so much for the guidance but I need the coding part in python to
> retrieve the data.
>
> Any help will be helpful for me.

Have a look at the Bio.PDB.PDBList module in Biopython - this may do
what you want.

Peter


From p.j.a.cock at googlemail.com  Fri Mar 27 17:31:55 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 27 Mar 2009 17:31:55 +0000
Subject: [BioPython] Biopython application note published
Message-ID: <320fb6e00903271031k2bd31464k8aaa075f8de39c82@mail.gmail.com>

Dear all,

An Application Note describing Biopython has recently been accepted
for publication in the Oxford Journal Bioinformatics. An advance copy
of the Open Access article is available online:

P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke,
I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski and M.J.L. de Hoon
(2009) Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics,
doi:10.1093/bioinformatics/btp163
http://dx.doi.org/10.1093/bioinformatics/btp163

This was announced at the start of the week on our news page (to which
you can subscribe using the RSS or Atom feeds), but was worth
repeating for the mailing lists.  See
http://news.open-bio.org/news/2009/03/biopython-paper-published/

Peter


From biopython at maubp.freeserve.co.uk  Tue Mar 31 10:08:08 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 31 Mar 2009 11:08:08 +0100
Subject: [BioPython] how to retrieve data from PDB
In-Reply-To: <4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com>
References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com>
	<320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com>
	<4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com>
	<320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com>
	<4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com>
Message-ID: <320fb6e00903310308q38168dbfx447c78c6da5454ee@mail.gmail.com>

On Tue, Mar 31, 2009 at 10:45 AM, chen Ku <biopython.chen at gmail.com> wrote:
> Dear peter,
> ????????????????? thanks for the idea.I think I need to download all the pdb
> files first and then can use command on python mode. Can you please write
> one syntax to start with or give me the practical documentation so that I
> can try out and play with this PDBList.

Hi Chen,

To learn about the PDBList functionality, see page 4 of "The Biopython
Structural Bioinformatics FAQ" - this has some examples:
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

You can also read about PDBList from the built in help,
>>> from Bio import PDB
>>> help(PDB.PDBList)
Or online at http://biopython.org/DIST/docs/api/Bio.PDB.PDBList%27.PDBList-class.html

If you really do want to download all 56,000+ PDB files (and I don't
think this is a good idea), instead of using Python, you might also
consider using the command line tool rsync, see:
http://www.pdb.org/pdb/general_information/news_publications/newsletters/2003q3/focus_rsync.html

However, as I said before, you only want transcription factors with
DNA, so at most you'll need to download the 2250 protein structures in
complex with nucleotides.  I strongly urge you to find out more about
searching the PDB in order to get a list of just the few PDB reference
codes that you'll actually need - and download just those.

Peter