From mmokrejs at fold.natur.cuni.cz  Wed Mar  2 18:00:04 2011
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 03 Mar 2011 00:00:04 +0100
Subject: [Biopython] traditional NCBI blast vs. blast+
Message-ID: <4D6ECBF4.9050006@fold.natur.cuni.cz>

Hi,
  I needed to run and parse some blastn analysis. I had a look into the Tutorial
and followed the currently recommended blast+ approach. Somewhat I was not
getting any results. It seems to me a formatdb-formatted database is not readable
by the blast+ tools. I had a look what tools are installed on my Gentoo Linux
along with blastn, blastx and the other tools coming from blast+ bundle and from
filenames I just could not guess what am I supposed to run over my fasta
target database to make it searchable by blastn. I would prefer if biopython
would throw out some error if there are no appropriate files (which names could
be guessed depending on the (t)blastn/x/p, etc.).
  The tutorial mentions that I should lookup an older version of the Tutorial
for examples on the old, NCBI blast usage via biopython. It took me a while but
I found through Google some docs like that. ;-)
  On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation,
not a single README, HOWTO, Changes, just the binaries and libs. What is installed
on other Linux platform, would you mind sharing this with me? I just failed
to find by Google what tools should I use instead of the formatdb. I found
some FAQ on the NCBI tools++ site but that talked just about C++ API etc.,
nothing from the user perspective.
  On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being
installed because they have same name as the same utility from "old" ncbi-tools
(hence overwting their files). The ncbi-tools++ package is not allowed to be
installed on stable "systems" (lack of testing or open bug reports) so most people
using Gentoo do NOT have ncbi-tools++ and probably won't for a while.
  I propose to keep support for the "old" blast for a long while. Luckily, the
blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML.

  What do you think? Is the blast+ approach faster, more stable, or just newer
so we all like to "upgrade"? Where are some docs and what is the formatdb-like
tool in blast+. ;)
Thanks,
Martin

From nuin at genedrift.org  Wed Mar  2 18:06:17 2011
From: nuin at genedrift.org (Paulo Nuin)
Date: Wed, 2 Mar 2011 18:06:17 -0500
Subject: [Biopython] traditional NCBI blast vs. blast+
In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz>
References: <4D6ECBF4.9050006@fold.natur.cuni.cz>
Message-ID: <4FC7BB7C-9E17-4699-850E-0A4F4E63521B@genedrift.org>

Hi 

Just answering your blast portion of the question:

- you have to run makeblastdb in order to create the database.
- you should be able to download the source of blast+ to compile, it should compile just fine on your system
- and yes, it seems to be faster and more stable than the previous version, at least on the tests I run

Paulo


On 2011-03-02, at 6:00 PM, Martin Mokrejs wrote:

> Hi,
>  I needed to run and parse some blastn analysis. I had a look into the Tutorial
> and followed the currently recommended blast+ approach. Somewhat I was not
> getting any results. It seems to me a formatdb-formatted database is not readable
> by the blast+ tools. I had a look what tools are installed on my Gentoo Linux
> along with blastn, blastx and the other tools coming from blast+ bundle and from
> filenames I just could not guess what am I supposed to run over my fasta
> target database to make it searchable by blastn. I would prefer if biopython
> would throw out some error if there are no appropriate files (which names could
> be guessed depending on the (t)blastn/x/p, etc.).
>  The tutorial mentions that I should lookup an older version of the Tutorial
> for examples on the old, NCBI blast usage via biopython. It took me a while but
> I found through Google some docs like that. ;-)
>  On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation,
> not a single README, HOWTO, Changes, just the binaries and libs. What is installed
> on other Linux platform, would you mind sharing this with me? I just failed
> to find by Google what tools should I use instead of the formatdb. I found
> some FAQ on the NCBI tools++ site but that talked just about C++ API etc.,
> nothing from the user perspective.
>  On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being
> installed because they have same name as the same utility from "old" ncbi-tools
> (hence overwting their files). The ncbi-tools++ package is not allowed to be
> installed on stable "systems" (lack of testing or open bug reports) so most people
> using Gentoo do NOT have ncbi-tools++ and probably won't for a while.
>  I propose to keep support for the "old" blast for a long while. Luckily, the
> blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML.
> 
>  What do you think? Is the blast+ approach faster, more stable, or just newer
> so we all like to "upgrade"? Where are some docs and what is the formatdb-like
> tool in blast+. ;)
> Thanks,
> Martin
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Thu Mar  3 05:27:54 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 3 Mar 2011 10:27:54 +0000
Subject: [Biopython] traditional NCBI blast vs. blast+
In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz>
References: <4D6ECBF4.9050006@fold.natur.cuni.cz>
Message-ID: <AANLkTikiaVSFELhCQd3++jPMrm9Q-Eo__E+pGahgft2c@mail.gmail.com>

On Wed, Mar 2, 2011 at 11:00 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi,
> ?I needed to run and parse some blastn analysis. I had a look into the Tutorial
> and followed the currently recommended blast+ approach. Somewhat I was not
> getting any results. It seems to me a formatdb-formatted database is not readable
> by the blast+ tools.

I think it is possible to get databases which will work with both legacy BLAST
and BLAST+ (since the NCBI only offer one set for NR etc) but I have not tried
to mix the two. As pointed out by Paulo, the successor to formatdb in BLAST+
is makeblastdb, so just use that instead.

> I had a look what tools are installed on my Gentoo Linux
> along with blastn, blastx and the other tools coming from blast+ bundle and from
> filenames I just could not guess what am I supposed to run over my fasta
> target database to make it searchable by blastn.

This is very clear in the BLAST+ documentation from the NCBI website
(link given below), and is arguably a Gentoo packaging issue.

> I would prefer if biopython
> would throw out some error if there are no appropriate files (which names could
> be guessed depending on the (t)blastn/x/p, etc.).

BLAST+ itself generally gives useful errors.

> ?The tutorial mentions that I should lookup an older version of the Tutorial
> for examples on the old, NCBI blast usage via biopython. It took me a while but
> I found through Google some docs like that. ;-)

You could have just downloaded one of the old Biopython releases (the zip
or tar balls) and looked in the Doc subdirectory. I'll clarify the current text
in the tutorial to point people there.

>?On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation,
> not a single README, HOWTO, Changes, just the binaries and libs.

File a bug with Gentoo?

> What is installed
> on other Linux platform, would you mind sharing this with me? I just failed
> to find by Google what tools should I use instead of the formatdb. I found
> some FAQ on the NCBI tools++ site but that talked just about C++ API etc.,
> nothing from the user perspective.

You are probably looking for this, linked to from the BLAST+ download page:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/user_manual.pdf

> On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being
> installed because they have same name as the same utility from "old" ncbi-tools
> (hence overwting their files). The ncbi-tools++ package is not allowed to be
> installed on stable "systems" (lack of testing or open bug reports) so most people
> using Gentoo do NOT have ncbi-tools++ and probably won't for a while.

I was aware of the name clash for rpsblast, and yes, this is a problem the
NCBI could have avoided.

You could just ignore the Gentoo package and get BLAST+ directly from
the NCBI.

>?I propose to keep support for the "old" blast for a long while.

We've already delayed deprecating the ``legacy'' BLAST wrappers,
but probably we should do that after releasing Biopython 1.57.

> Luckily, the
> blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML.

The NCBI kept the same XML output format, and in fact the plain text
output is close enough that our old text parser could be updated to cope.

>?What do you think? Is the blast+ approach faster, more stable, or just newer
> so we all like to "upgrade"?

I like BLAST+ for some new functionality (FASTA vs FASTA for example),
but since the NCBI is dropping the ``legacy'' BLAST you will have to
upgrade at some point

> Where are some docs and what is the formatdb-like tool in blast+. ;)

I've given links to the docs above, they're linked to on the NCBI website.

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Mar  3 15:32:11 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 3 Mar 2011 20:32:11 +0000
Subject: [Biopython] Fwd: [Bosc] Bioinformatics Open Source Conference (BOSC
 2011)--Call for Abstracts
In-Reply-To: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov>
References: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov>
Message-ID: <AANLkTimNQ=o6Unw461vjPPGUB4VKnB+Ww+qFhT-D2iXq@mail.gmail.com>

Dear Biopythoneers,

BOSC will be in Vienna, Austria this year.

Peter

---------- Forwarded message ----------
From: Nomi Harris <nlharris at lbl.gov>
Date: Thu, Mar 3, 2011 at 7:37 PM
Subject: [Bosc] Bioinformatics Open Source Conference (BOSC
2011)--Call for Abstracts
To: bosc-announce at lists.open-bio.org, members at open-bio.org, GMOD
Announcements List <gmod-announce at lists.sourceforge.net>, GMOD
Developers List <gmod-devel at lists.sourceforge.net>
Cc: Nomi Harris <nlharris at lbl.gov>


We invite you to submit an abstract to BOSC 2011! ?Please forward this
message as appropriate, and forgive multiple postings.

Call for Abstracts for the 12th Annual Bioinformatics Open Source
Conference (BOSC 2011)
An ISMB 2011 Special Interest Group (SIG)

Dates: July 15-16, 2011
Location: Vienna, Austria
Web site: http://www.open-bio.org/wiki/BOSC_2011
Email: bosc at open-bio.org
BOSC announcements mailing list:
http://lists.open-bio.org/mailman/listinfo/bosc-announce

Important Dates:
April 18, 2011: Deadline for submitting abstracts to BOSC 2011
May 9, 2011: Notifications of accepted abstracts emailed to
corresponding authors
July 13-14, 2011: Codefest 2011 programming session (see
http://www.open-bio.org/wiki/Codefest_2011 for details)
July 15-16, 2011: BOSC 2011
July 17-19, 2011: ISMB 2011

The Bioinformatics Open Source Conference (BOSC) is sponsored by the
Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated
to promoting the practice and philosophy of Open Source software
development within the biological research community. To be considered
for acceptance, software systems representing the central topic in a
presentation submitted to BOSC must be licensed with a recognized Open
Source License, and be freely available for download in source code
form.

We invite you to submit abstracts for talks and posters. ?Sessions include:
- Approaches to parallel processing
- Cloud-based approaches to improving software and data accessibility
- The Semantic Web in open source bioinformatics
- Data visualization
- Tools for next-generation sequencing
- Other Open Source software

In addition to the above sessions, there will be a panel discussion
about "Meeting the challenges of inter-institutional collaboration".
We are also working to arrange a joint session with one of the other
ISMB SIGs.

Thanks to generous sponsorship from Eagle Genomics and an anonymous
donor, we are pleased to announce a competition for three Student
Travel Awards for BOSC 2011. Each winner will be awarded $250 to
defray the costs of travel to BOSC 2011.

For instructions on submitting your abstract, please visit
http://www.open-bio.org/wiki/BOSC_2011#Abstract_Submission_Information

BOSC 2011 Organizing Committee:
Nomi Harris and Peter Rice (co-chairs); Brad Chapman, Peter Cock,
Erwin Frise, Darin London, Ron Taylor


_______________________________________________
BOSC mailing list
BOSC at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bosc


From hlapp at drycafe.net  Fri Mar  4 18:26:25 2011
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Fri, 4 Mar 2011 18:26:25 -0500
Subject: [Biopython] Informatics job opportunity at NESCent
Message-ID: <1878F27F-000D-4C80-B9EA-A83F7887828F@drycafe.net>

(Apologies if you receive multiple copies, and also if you are not  
interested in job opportunities.  In my defense, quite a few people on  
Bio* lists might qualify for (let alone enjoy) the position. And if  
you know someone who might be interested please forward.)

===================================================
User Interface Design and Web Application Developer
===================================================

The National Evolutionary Synthesis Center (NESCent) seeks a creative  
and enthusiastic individual to design user interfaces and web  
applications for scientific applications. The incumbent will work as  
part of a small informatics team in close collaboration with domain  
scientists.

NESCent is an NSF-funded center dedicated to cross-disciplinary  
research in evolutionary science. Our informatics team works closely  
with visiting and resident scientists to support their custom software  
and database development needs. All NESCent software products are open- 
source, and the Center has a number of initiatives to actively promote  
collaborative development of community software resources  
(informatics.nescent.org). Above all, we are enthusiastic about our  
work, about the mission of the Center, and about the contribution of  
informatics to that mission.

Job description: The incumbent will design and develop user interfaces  
and web applications for databases and other software tools for  
sponsored scientists and staff. The job responsibilities include all  
stages of the software development process, including requirements  
gathering, design, implementation, release packaging and  
documentation, as part of a small team (typically 2-3 individuals)  
following project management best practices. We expect the incumbent  
to present their work at conferences and contribute to publications  
with scientific collaborators; interact regularly with visiting and  
resident scientists, other members of the informatics team and Center  
staff; and generally serve as an expert resource for Center personnel.  
The position provides opportunities for professional development. Most  
informatics staff work at our Durham NC offices, located adjacent to  
Duke University, but we do support a wide range of technologies for  
virtual communication with off-site staff and collaborators.

Required Qualifications:
* Demonstrated success collaborating with clients on custom software  
solutions
* Experience with various stages of the software development cycle
* Expertise in development and testing of user interface designs
* Excellent communication skills, both virtual and face-to-face
* A four-year college degree in Computer Science, Bioinformatics or a  
related field

Preferred Qualifications:
* M.S. or Ph.D. in Computer Science, Bioinformatics or related field  
along with demonstrated interest in science, particularly biology
* Expertise in rapid application development and respective  
programming technologies and languages (e.g., modern scripting  
languages and web-application frameworks such as Python/Django, Ruby/ 
Ruby-on-Rails, and Perl/Catalyst), fluency in Java programming, and  
prior experience in relational database programming (PostgreSQL or  
MySQL)
* Expertise in dynamic and interactive web technologies (JavaScript,  
CGI), web service (SOAP, REST, XML, JSON) and semantic web technologies
* Experience with open-source, and collaborative, software  
development, software usability design and assessment
* Expertise in graphic design, data visualization and/or scientific  
data integration

How to apply: Please send cover letter, resume and contact information  
for three references to Dr. Karen Cranston, Training Coordinator and  
Bioinformatics Project Manager (karen.cranston at nescent.org). Review of  
applications will begin March 21, 2011. Informal inquires or requests  
for additional information may be directed to Dr. Cranston by email or  
phone (+1-919-613-2275).

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From p.j.a.cock at googlemail.com  Mon Mar  7 09:19:11 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 7 Mar 2011 14:19:11 +0000
Subject: [Biopython] Tutorial proofreading?
Message-ID: <AANLkTikn8kr5PqRRdG6ZAWsO3d6hm2E7so7dRQJNXT41@mail.gmail.com>

Hi all,

We're planning to do the Biopython 1.57 release soon, and
something some volunteer help would be useful for is with
our documentation - in particular the tutorial.

These links are for the current tutorial, at the time or writing
that means Biopython 1.56:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

There links are for the latest in-progress tutorial (automatically
updated nightly from the git repository):
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf

I would like some volunteers to proof read this please and
report any problems, suggestions or additions?

Ideally I'd like people to check the examples work (although some
will need the latest Biopython installed from the source code).

Even reporting minor typos is useful, as fixing them will
make a better impression for newcomers reading this.

Thanks,

Peter

P.S. The tutorial source file is here, if you are interested,
https://github.com/biopython/biopython/blob/master/Doc/Tutorial.tex

From anaryin at gmail.com  Mon Mar  7 09:21:19 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 7 Mar 2011 15:21:19 +0100
Subject: [Biopython] Tutorial proofreading?
In-Reply-To: <AANLkTikn8kr5PqRRdG6ZAWsO3d6hm2E7so7dRQJNXT41@mail.gmail.com>
References: <AANLkTikn8kr5PqRRdG6ZAWsO3d6hm2E7so7dRQJNXT41@mail.gmail.com>
Message-ID: <AANLkTimyQL88YRh5knJVnB4cT3oA_OeOzs0bakLASaJB@mail.gmail.com>

Will have a look at it this week, I noticed some problems in the Bio.PDB
section (outdated code).

Cheers!

From rmb32 at cornell.edu  Mon Mar  7 11:37:32 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 07 Mar 2011 11:37:32 -0500
Subject: [Biopython] Google Summer of Code project ideas
Message-ID: <4D7509CC.3040604@cornell.edu>

Hi all,

I'm going to be OBF project admin again this year for Google Summer of 
code.  OBF's application is due later this week, and we need to update 
our project ideas on the OBF wiki page and on each project's individual 
wiki pages.

So, for each of the OBF projects that wants to do GSoC again this year, 
please:

a.) Update the list of project ideas on your project's GSoC page 
(BioPython, BioPerl, BioRuby, etc).  Add new ones, remove ones that have 
already been done or no longer relevant, etc.

b.) Update the list of project ideas on the main OBF GSoC page 
(http://www.open-bio.org/wiki/Google_Summer_of_Code) to match.

c.) Let me know via email that you have done so and it's ready for 
Google to peruse.

Please have the updates done, if possible, by this Friday (March 11). 
The number and quality of the project ideas are part of the evaluation 
process for whether OBF is accepted as a Summer of Code organization 
again this year, so let's come up with some good ones.  :-)

Rob

----
Robert Buels
(prospective) 2011 OBF GSoC Organization Admin


From p.cherepanov at imperial.ac.uk  Mon Mar  7 21:42:26 2011
From: p.cherepanov at imperial.ac.uk (Peter Cherepanov)
Date: Tue, 8 Mar 2011 02:42:26 +0000
Subject: [Biopython] define circular DNA (?)
Message-ID: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>

is there an easy way to define a circular DNA sequence in BioPython? 

It would be useful to have something like: 

my_seq = Seq('ATGCATGC...ATGC', circular_dna)

am I missing something obvious?? 

Peter


From komalsnehal1991 at gmail.com  Tue Mar  8 02:58:11 2011
From: komalsnehal1991 at gmail.com (Komal S)
Date: Tue, 8 Mar 2011 13:28:11 +0530
Subject: [Biopython] Biopython Projects
Message-ID: <AANLkTi=s_bjATb4qWbu1kP6kpULEjABJ+7HOdbfL6_ka@mail.gmail.com>

Hi everyone,

I'm Komal, a Junior Undergraduate Student from India studying
Bioengineering. I'm a fan of Python and I love Computational Biology and I
plan to do my further studies in the same.
I went through the projects on the Biopython page. I was very much
interested in the RNA Structure project mentioned. Any contribution which I
make will help me a lot and the organisation too. In fact, I am currently
doing a project on RNA Editing. I'll be very happy to integrate my
knowledge.

Please help me on how I should proceed.

Komal

From p.j.a.cock at googlemail.com  Tue Mar  8 03:45:31 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 08:45:31 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
Message-ID: <AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>

On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote:
> is there an easy way to define a circular DNA sequence in BioPython?
>
> It would be useful to have something like:
>
> my_seq = Seq('ATGCATGC...ATGC', circular_dna)
>
> am I missing something obvious??
>
> Peter

No, but how would you expect it to act? We've talked
about such an object before... I'd have to go though my
old emails but I recall there being some annoying corner
cases to consider with the slice method (__getitem__).

Peter

From p.j.a.cock at googlemail.com  Tue Mar  8 05:48:13 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 10:48:13 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
Message-ID: <AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>

On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
<p.cherepanov at imperial.ac.uk> wrote:
> I suppose if a DNA sequence is kept as a simple Python string, there is
> no easy way to have it "circular". I am a beginner in Python (I use it only
> occasionally, to solve very specific and simple-minded tasks, when manual
> match/cut-and-paste operations become too much of a burden). Having
> spent an extra hour to hack out and debug a piece of code to match/extract
> to/from circular plasmid sequences kept as Python strings, I thought: hey,
> wait a minute, there is such thing as BioPython, which should have made
> this task so much easier...
>
> Is there a way to "enhance" the Seq object? (or may be I do not know what
> I am talking about...).
>
> thanks a lot for responding!
>
> with best wishes,
>
> Peter

What I had in mind was a new class, CircularSeq, which would subclass
the current Biopython Seq object, and still use a string internally for the
sequence.

We could then modify the slice behaviour so that, perhaps this would
by work wrapping the origin:

c = CircularSeq('ACGTACGTACGT')
assert len(c)==12
print c[10:14]

It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
14 as wrapped to 2, returning the four bases GTAC.

Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
same as 'ACGTACGTACGT'[10:] which is the last two letters only.
This means anyone (or more importantly, any code) expecting the
string like behaviour will get a nasty surprise (or a bug).

Another example, what about c[-2:]? For a plain string you'd
get the last two letters. For a circular sequence you might think
that should represent starting two before the origin, thus giving
the last two letter plus the whole sequence? Also, c[-2:2] could
mean the last two letters plus the first two letters, but for a
plain python string that returns an empty string.

Note that due to the way Python indexing works, single letter
access is fine for negative indices, c[-2] would give the second
last letter, 'G', which is consistent with wrapped counting back
from the origin. We could also make c[14] wrap round to c[2] in
this length 12 example (although there is a small risk of breaking
code expecting an IndexError in this case).

There would be lots of other things to implement, like "in" and the
find methods would need to check the substring across the origin.
Then (for nucleotides), we'd need to ensure reverse_complement
and complement also give a CircularSeq, likewise perhaps for the
transcribe and back_transcribe. The translate method is particularly
tricky as you can have an infinite reading frame, which might be
represented as a circular protein sequence?

All in all, it is quite a lot of work, and there are several tricky bits
where the desired behaviour is not clear cut. Could we come up
with something useful or not?

Peter

P.S. Please CC the mailing list in your replies :)

From p.cherepanov at imperial.ac.uk  Tue Mar  8 05:30:08 2011
From: p.cherepanov at imperial.ac.uk (Peter Cherepanov)
Date: Tue, 8 Mar 2011 10:30:08 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
Message-ID: <503B48D3-61BA-4C77-A441-00942366FFB4@imperial.ac.uk>

I suppose if a DNA sequence is kept as a simple Python string, there is no easy way to have it "circular". I am a beginner in Python (I use it only occasionally, to solve very specific and simple-minded tasks, when manual match/cut-and-paste operations become too much of a burden). Having spent an extra hour to hack out and debug a piece of code to match/extract to/from circular plasmid sequences kept as Python strings, I thought: hey, wait a minute, there is such thing as BioPython, which should have made this task so much easier... 

Is there a way to "enhance" the Seq object? (or may be I do not know what I am talking about...). 

thanks a lot for responding!

with best wishes, 

Peter


On 8 Mar 2011, at 08:45, Peter Cock wrote:

> On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote:
>> is there an easy way to define a circular DNA sequence in BioPython?
>> 
>> It would be useful to have something like:
>> 
>> my_seq = Seq('ATGCATGC...ATGC', circular_dna)
>> 
>> am I missing something obvious??
>> 
>> Peter
> 
> No, but how would you expect it to act? We've talked
> about such an object before... I'd have to go though my
> old emails but I recall there being some annoying corner
> cases to consider with the slice method (__getitem__).
> 
> Peter


From moritz.beber at googlemail.com  Tue Mar  8 06:32:44 2011
From: moritz.beber at googlemail.com (Moritz Beber)
Date: Tue, 08 Mar 2011 12:32:44 +0100
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
Message-ID: <4D7613DC.2050506@googlemail.com>

On 03/08/2011 11:48 AM, Peter Cock wrote:
> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
>
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
>
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
>
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
>
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
>
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
>
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
>
> Peter
>
> P.S. Please CC the mailing list in your replies :)
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

If you just need circular behaviour in a small number of use cases, you
could consider wrapping the sequence in a cycle iterator
http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle

From p.j.a.cock at googlemail.com  Tue Mar  8 06:40:08 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 11:40:08 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <4D7613DC.2050506@googlemail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
	<4D7613DC.2050506@googlemail.com>
Message-ID: <AANLkTi=mbk6sk2BgQOE0=2HDAwvjPiv82KQH3emeS1FE@mail.gmail.com>

On Tue, Mar 8, 2011 at 11:32 AM, Moritz Beber
<moritz.beber at googlemail.com> wrote:
>
> If you just need circular behaviour in a small number of use cases, you
> could consider wrapping the sequence in a cycle iterator
> http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle
>

That might need a lot of memory if used on a long sequence like a
bacterial genome, but an interesting idea.

Peter

From p.cherepanov at imperial.ac.uk  Tue Mar  8 07:12:26 2011
From: p.cherepanov at imperial.ac.uk (Peter Cherepanov)
Date: Tue, 8 Mar 2011 12:12:26 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
Message-ID: <A14925F6-5CC2-4066-AECE-5D401DDFC3B1@imperial.ac.uk>

ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define:

c = CircularSeq('ATGCGGGGA')

where:

c[1:9]  equals  ATGCGGGGA   (or, more awkwardly, c[0:9], if the original Python string numbering must be retained for some reasons)
c[8:7]  equals  GAATGCATG    
c[1:1] equals A  (on a python string it is c[0:1]  =  A, of course)

Ideally, we would want to number such sequences from 1, after all these are the kind of objects we deal in biology. 

And, most importantly of all, if must be able to:
c.find('GGAATG') to return "7"  

Peter


On 8 Mar 2011, at 10:48, Peter Cock wrote:

> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>> 
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>> 
>> thanks a lot for responding!
>> 
>> with best wishes,
>> 
>> Peter
> 
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
> 
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
> 
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
> 
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
> 
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
> 
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
> 
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
> 
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
> 
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
> 
> Peter
> 
> P.S. Please CC the mailing list in your replies :)


From p.j.a.cock at googlemail.com  Tue Mar  8 08:24:07 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 13:24:07 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <A14925F6-5CC2-4066-AECE-5D401DDFC3B1@imperial.ac.uk>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
	<A14925F6-5CC2-4066-AECE-5D401DDFC3B1@imperial.ac.uk>
Message-ID: <AANLkTim=QjaOTiUBxbomFei5r_34CVPy2c0Ytq8MYWGD@mail.gmail.com>

On Tue, Mar 8, 2011 at 12:12 PM, Peter Cherepanov
<p.cherepanov at imperial.ac.uk> wrote:
> ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define:
>
> c = CircularSeq('ATGCGGGGA')
>
> where:
>
> c[1:9] ?equals ?ATGCGGGGA ? (or, more awkwardly, c[0:9], if the original
> Python string numbering must be retained for some reasons)
> c[8:7] ?equals ?GAATGCATG
> c[1:1] equals A ?(on a python string it is c[0:1] ?= ?A, of course)
>
> Ideally, we would want to number such sequences from 1, after all these
> are the kind of objects we deal in biology.

Absolutely not - it would put the circular sequence completely out of
sync with the existing sequence objects in Biopython and the Python
string. Don't worry - you'll get used to zero based counting, and
the Python slicing is very beautiful once you understand it.

> And, most importantly of all, if must be able to:
> c.find('GGAATG') to return "7"
>

Well, 6 in zero based counting, but yes, that would be the expected
result for find (and similarly for rfind). We'd also need to do something
with the split and rsplit methods to include looking for matches over
the origin.

Peter


From Leighton.Pritchard at scri.ac.uk  Tue Mar  8 08:28:11 2011
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Tue, 8 Mar 2011 13:28:11 +0000
Subject: [Biopython] define circular DNA (?)
Message-ID: <C99BDF6A.B58B%lpritc@scri.ac.uk>

I've got 2p hanging around, so...

On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock"
<p.j.a.cock at googlemail.com> wrote:

> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
>
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.

That seems sensible.  The main issue, as I see it, is that the physical
object is naturally represented by a circularly-linked list, and we have for
circular sequences an indexing/co-ordinate system with a defined zero
start/end point (which is essentially arbitrary - though is usually the
origin of replication for bacterial chromosomes).  This leads to a conflict
between our natural expectations of Python indexing, and the meaning of the
indexing on the physical object that's being represented.

Whatever the ultimate implementation, there will either have to be a
compromise between these two representations, or one or other view will be
ignored.  There will inevitably be value judgements that someone is unhappy
with ;)

> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.

That makes sense in Python indexing terms, but not in terms of the
co-ordinate system for navigating the circular DNA.  To be consistent with
location information from GenBank and other sources where features wrap the
origin of circular DNA, we would need c[10:2] to return the same result as
c[10:14].  That gives us potentially the same problem as c[-2:2], as it
currently returns an empty string.  We'd have to modify Python
slicing/indexing behaviour quite a bit to implement this 'naturally'.

However, I don't think we should ignore the Python indexing format here,
because we might want the ten bases after the base with co-ordinate 6 with
c[6:6+10], which would give us a physically and conceptually sensible linear
sequence that crosses the origin.

We'd probably want to do the obvious things with modular arithmetic, so that
we don't return, say, three concatenated linearised circular sequences to a
request like c[0:36] or c[6:42].

> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).

I'm not sure it's wise to constrain functionality and adequate
representation of a (very important! - showing my bacterial bias) physical
structure to maintain that level of consistency with String.  For instance,
what would CircularSeq + Seq mean?  Physically, and conceptually, not a lot.
So we might want to deprecate the __add__ method for this object - not
typical String behaviour but, in my opinion, appropriate.

(You might remember that I was also generally not in favour of treating Seq
objects as idealised Strings, so there's another bias for you ;) )

> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).

I wouldn't be in favour that behaviour in a general sense, though I don't
see how to avoid it cleanly.  I think it would be best to be strict with
indexing to the co-ordinate system to avoid possible degeneracy of feature
locations.  If we had a SNP at position 2, we could equally well associate
it with any one of an infinite number of positions kl+2 where k is an
integer and l is the sequence length, without modifying the computational
result.  I'm not keen on that kind of woolliness, but I think that it could
possibly be avoided by modifying indexing to require at least one index that
lies in the range [-l,l], and using modular arithmetic for slicing so that,
for the example above, c[18:26] would not be treated as the valid slice
c[6:14], but would instead throw an IndexError.

> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe.

Not to mention the other Biopython functions/methods that expect String-like
indexing.  Maybe a cast (of sorts) between CircularSeq and Seq would be
useful for that, though I can imagine great problems, there.

> The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?

I would think that the test for that particular condition should be fairly
straightforward (is there at least one stop codon in each of the six frames,
taking into account the origin?).

> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?

I think that there's every possibility of coming up with something useful -
the question is to what degree it fits the Biopython/Python idiom, or 'looks
like' the physical object, and whether it gets included in Biopython.

L.

--
Dr Leighton Pritchard MRSC
Plant Pathology Programme, SCRI (C block)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel: No telephone during office refurbishment

[The James Hutton Institute logo]
Please note that from 1 April 2011, SCRI and the Macaulay Land Use Research Institute will join to become The James Hutton Institute.

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________


From p.j.a.cock at googlemail.com  Tue Mar  8 08:58:03 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 13:58:03 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <C99BDF6A.B58B%lpritc@scri.ac.uk>
References: <C99BDF6A.B58B%lpritc@scri.ac.uk>
Message-ID: <AANLkTinQC3aNGy2mzUAvnpheC3QiW1KrQnYzS3jYAKsq@mail.gmail.com>

On Tue, Mar 8, 2011 at 1:28 PM, Leighton Pritchard
<Leighton.Pritchard at scri.ac.uk> wrote:
> I've got 2p hanging around, so...
>
> On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock"
> <p.j.a.cock at googlemail.com> wrote:
>>
>> What I had in mind was a new class, CircularSeq, which would subclass
>> the current Biopython Seq object, and still use a string internally for the
>> sequence.
>
> That seems sensible. ?The main issue, as I see it, is that the physical
> object is naturally represented by a circularly-linked list, and we have for
> circular sequences an indexing/co-ordinate system with a defined zero
> start/end point (which is essentially arbitrary - though is usually the
> origin of replication for bacterial chromosomes). ?This leads to a conflict
> between our natural expectations of Python indexing, and the meaning of the
> indexing on the physical object that's being represented.
>
> Whatever the ultimate implementation, there will either have to be a
> compromise between these two representations, or one or other view will be
> ignored. ?There will inevitably be value judgements that someone is unhappy
> with ;)

Indeed.

>> We could then modify the slice behaviour so that, perhaps this would
>> by work wrapping the origin:
>>
>> c = CircularSeq('ACGTACGTACGT')
>> assert len(c)==12
>> print c[10:14]
>>
>> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
>> 14 as wrapped to 2, returning the four bases GTAC.
>
> That makes sense in Python indexing terms, but not in terms of the
> co-ordinate system for navigating the circular DNA. ?To be consistent with
> location information from GenBank and other sources where features wrap the
> origin of circular DNA, we would need c[10:2] to return the same result as
> c[10:14]. ?That gives us potentially the same problem as c[-2:2], as it
> currently returns an empty string. ?We'd have to modify Python
> slicing/indexing behaviour quite a bit to implement this 'naturally'.
>
> However, I don't think we should ignore the Python indexing format here,
> because we might want the ten bases after the base with co-ordinate 6 with
> c[6:6+10], which would give us a physically and conceptually sensible linear
> sequence that crosses the origin.

I think we agree that c[10:14] and c[10:10+4] should give the four bases
GTAC wrapping the origin when c is circular sequence ACGTACGTACGT,
equivalently c[10:12] + c[0:2] using Python slicing.

Likewise for your example c[6:6+10] or c[6:16] this should give six bases
wrapping the origin, equivalently c[6:12] + c[0:4] using Python slicing.

> We'd probably want to do the obvious things with modular arithmetic, so that
> we don't return, say, three concatenated linearised circular sequences to a
> request like c[0:36] or c[6:42].

I disagree, returning the three concatenated linearised circular sequences
is what I would expect. This is one of the debatable issues that will divide
people. Consider the (special and artificial) case of a circular plasmid with
an ORF wrapping round the origin (one, twice or infinite), the ORF sequence
is longer than the linearised plasmid, so slicing with concatenation would
be useful. e.g.

http://www.ncbi.nlm.nih.gov/pubmed/9740124
Perriman and Ares (1998), Circular mRNA can direct translation of
extremely long repeating-sequence proteins in vivo.

and:

http://dx.doi.org/10.1385/1-59259-280-5:069
Perriman (2002), Circular mRNA Encoding for Monomeric and
Polymeric Green Fluorescent Protein

(Very cool work)

>> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
>> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
>> This means anyone (or more importantly, any code) expecting the
>> string like behaviour will get a nasty surprise (or a bug).
>
> I'm not sure it's wise to constrain functionality and adequate
> representation of a (very important! - showing my bacterial bias) physical
> structure to maintain that level of consistency with String. ?For instance,
> what would CircularSeq + Seq mean? ?Physically, and conceptually, not a lot.
> So we might want to deprecate the __add__ method for this object - not
> typical String behaviour but, in my opinion, appropriate.

We're probably want to made addition of CircularSeq + Seq raise a
TypeError. Or, do a linearisation and simple addition with a warning?

> (You might remember that I was also generally not in favour of treating
> Seq objects as idealised Strings, so there's another bias for you ;) )

I recall :)

>> Note that due to the way Python indexing works, single letter
>> access is fine for negative indices, c[-2] would give the second
>> last letter, 'G', which is consistent with wrapped counting back
>> from the origin. We could also make c[14] wrap round to c[2] in
>> this length 12 example (although there is a small risk of breaking
>> code expecting an IndexError in this case).
>
> I wouldn't be in favour that behaviour in a general sense, though I don't
> see how to avoid it cleanly. I think it would be best to be strict with
> indexing to the co-ordinate system to avoid possible degeneracy of feature
> locations. ?If we had a SNP at position 2, we could equally well associate
> it with any one of an infinite number of positions kl+2 where k is an
> integer and l is the sequence length, without modifying the computational
> result.

Yes, I was suggesting we could make c[x+n*length] act as c[x],
i.e. for *single* indexes which return one letter, apply the modulo
arithmetic. Or, we leave this to follow the current Python string
behaviour where if the index is equal to the length or more, you
get an IndexError. That avoids the ambiguity ;)

> I'm not keen on that kind of woolliness, but I think that it could
> possibly be avoided by modifying indexing to require at least one index that
> lies in the range [-l,l], and using modular arithmetic for slicing so that,
> for the example above, c[18:26] would not be treated as the valid slice
> c[6:14], but would instead throw an IndexError.

This depends on the treatment of things like c[0:36] or c[6:42]
discussed above (return 36 bases, or just 12?).

>> There would be lots of other things to implement, like "in" and the
>> find methods would need to check the substring across the origin.
>> Then (for nucleotides), we'd need to ensure reverse_complement
>> and complement also give a CircularSeq, likewise perhaps for the
>> transcribe and back_transcribe.
>
> Not to mention the other Biopython functions/methods that expect String-like
> indexing. ?Maybe a cast (of sorts) between CircularSeq and Seq would be
> useful for that, though I can imagine great problems, there.

Having a toseq method like the MutableSeq does could handle that,
returning a traditional linear Seq object. If the CircularSeq 'breaks'
too much expected string-like behaviour that would be important.

>> The translate method is particularly
>> tricky as you can have an infinite reading frame, which might be
>> represented as a circular protein sequence?
>
> I would think that the test for that particular condition should be fairly
> straightforward (is there at least one stop codon in each of the six frames,
> taking into account the origin?).

Having thought about this example at length before, it can be done
but I don't think it is all that straightforward ;)

>> All in all, it is quite a lot of work, and there are several tricky bits
>> where the desired behaviour is not clear cut. Could we come up
>> with something useful or not?
>
> I think that there's every possibility of coming up with something useful -
> the question is to what degree it fits the Biopython/Python idiom, or 'looks
> like' the physical object, and whether it gets included in Biopython.
>
> L.

Agreed.

Peter


From anaryin at gmail.com  Tue Mar  8 16:39:07 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 8 Mar 2011 22:39:07 +0100
Subject: [Biopython] PDBParser Class --> Output
In-Reply-To: <AANLkTi=og=3VA68hfhnc4jTeJamJ6me7RPKmraqFzvWH@mail.gmail.com>
References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu>
	<AANLkTi=_pCV8-ATaKcVzV40NLmfUs76UDVR4KgYMCMGi@mail.gmail.com>
	<8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu>
	<AANLkTinAFONUdnxps8X2M1toKsKPrR2BjabDSs0vS23Q@mail.gmail.com>
	<95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com>
	<AANLkTimne1YbfSmyLxmXQE8pcrYM96vQPB5O2+iGxCt0@mail.gmail.com>
	<AANLkTi=og=3VA68hfhnc4jTeJamJ6me7RPKmraqFzvWH@mail.gmail.com>
Message-ID: <AANLkTikC3wYvStX+TD7b3po3sD7TQpM4qxKQa+MgA0DZ@mail.gmail.com>

Back to this question. Haven't had much time to look at it and it turned out
to be a bit more complicated than what I thought. Permissive is an attribute
of the PDBParser module and since the assignment takes place in the Atom
module I don't see a straightforward way of pulling this off.

However, and although there is the very simple solution of playing with the
warnings module, the solution I offer is to allow a second level of
"permissiveness" (PERMISSIVE=2) where all warnings are supressed.

Cheers,

J

From laserson at mit.edu  Tue Mar  8 22:07:54 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 8 Mar 2011 22:07:54 -0500
Subject: [Biopython] SeqRecord subclassing or composition
Message-ID: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>

I am trying to implement a data type for my work.  Each object will have a
sequence (derived from a single read) and lots of annotations and features.
 However, I want to implement some extra interface that is problem-specific
to make my analysis more convenient.

I am debating whether to subclass SeqRecord and simply implement the extra
interface or define a new object that wraps a SeqRecord object and pass on
the subset of native SeqRecord calls and/or simply access the underlying
SeqRecord directly.

One additional factor is that I want to be able to read/write INSDC-style
files for the data (e.g., GenBank).  Therefore, if I use the SeqIO parser,
it will return native SeqRecords.  If I go the inheritance route, how do I
cast a SeqRecord object to my new subclass?

So, I am debating between inheritance

class ImmuneChain(SeqRecord):
    def __init__(self, *args, **kw):
        SeqRecord.__init__(self,*args,**kw)
        # But how do I cast a SeqRecord to an ImmuneChain?


or composition

class ImmuneChain(object):
    def __init__(self, *args, **kw):
        if isinstance(args[0],SeqRecord):
            self._record = args[0]
        else:
            # Initialize the underlying SeqRecord manually
            self._record.seq = ...


Any thoughts?

Thanks!
Uri


...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu

From p.j.a.cock at googlemail.com  Wed Mar  9 04:04:26 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 9 Mar 2011 09:04:26 +0000
Subject: [Biopython] SeqRecord subclassing or composition
In-Reply-To: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
References: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
Message-ID: <AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>

On Wed, Mar 9, 2011 at 3:07 AM, Uri Laserson <laserson at mit.edu> wrote:
> I am trying to implement a data type for my work. ?Each object will have a
> sequence (derived from a single read) and lots of annotations and features.
> ?However, I want to implement some extra interface that is problem-specific
> to make my analysis more convenient.
>
> I am debating whether to subclass SeqRecord and simply implement the extra
> interface or define a new object that wraps a SeqRecord object and pass on
> the subset of native SeqRecord calls and/or simply access the underlying
> SeqRecord directly.
>
> One additional factor is that I want to be able to read/write INSDC-style
> files for the data (e.g., GenBank). ?Therefore, if I use the SeqIO parser,
> it will return native SeqRecords. ?If I go the inheritance route, how do I
> cast a SeqRecord object to my new subclass?

There is (currently at least) no option in SeqIO parse/read
to override the use of the SeqRecord object. So you'd need
code to 'upgrade' a SeqRecord into your class. Probably
the simplest route would be for it's __init__ method to
take a single argument (a SeqRecord). Then you could
have:

def my_parse(...):
    for seq_record in SeqIO.parse(...):
        yield MyClass(seq_record)

def my_read(...):
    return MyClass(SeqIO.read(...))

etc

> So, I am debating between inheritance
>
> class ImmuneChain(SeqRecord):
> ? ?def __init__(self, *args, **kw):
> ? ? ? ?SeqRecord.__init__(self,*args,**kw)
> ? ? ? ?# But how do I cast a SeqRecord to an ImmuneChain?

Unless you modify the methods/atttributes too much, a
ImmuneChain subclass of SeqRecord should be usable
as is with SeqIO.write etc. You don't need to 'cast'.

Also note the above __init__ method can be more specific,
you might have say 10 init args for ImmuneChain,  only
some of which you pass to the SeqRecord init.

You could even have a single __init__ argument of a
SeqRecord, and copy all its attributes.

> or composition
>
> class ImmuneChain(object):
> ? ?def __init__(self, *args, **kw):
> ? ? ? ?if isinstance(args[0],SeqRecord):
> ? ? ? ? ? ?self._record = args[0]
> ? ? ? ?else:
> ? ? ? ? ? ?# Initialize the underlying SeqRecord manually
> ? ? ? ? ? ?self._record.seq = ...

With the above approach you'd have to pass the
private record to SeqIO.write etc (anything which
needs a SeqRecord). That could be done inside
methods of the ImmuneChain object (e.g. you
could expose the format method of the SeqRecord).

>
> Any thoughts?
>

You could alternatively go for a procedural style where
you write your code as functions taking SeqRecord
objects (perhaps expecting particular information in
the annotation).

Peter


From komalsnehal1991 at gmail.com  Wed Mar  9 05:49:23 2011
From: komalsnehal1991 at gmail.com (Komal S)
Date: Wed, 9 Mar 2011 02:49:23 -0800
Subject: [Biopython] ::Biopython Project
Message-ID: <AANLkTikDHtLdbyih7u3Jiy76Puvpr7Lh995-aBX-9jKu@mail.gmail.com>

Hi everyone,

I'm Komal, a Junior Undergraduate Student from India
studying Bioengineering. I'm a fan of Python and I love Computational
Biology and I plan to do my further studies in the same.
I went through the projects on the Biopython page. I was very
much interested in the RNA Structure project mentioned. Any contribution
which I make will help me a lot and the organisation too. In fact, I am
currently doing a project on RNA Editing. I'll be very happy to integrate
my knowledge.

In fact, I have been trying to contact people on #obf-soc IRC. I think there
is no separate IRC for Biopython.

Please help me on how I should proceed.


Komal

From laserson at mit.edu  Wed Mar  9 10:28:22 2011
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 9 Mar 2011 10:28:22 -0500
Subject: [Biopython] SeqRecord subclassing or composition
In-Reply-To: <AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>
References: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
	<AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>
Message-ID: <AANLkTi=J-Ok+Frzm4DkPvHJjR5rCY-U9oHR-aC4tDXmO@mail.gmail.com>

>
> Unless you modify the methods/atttributes too much, a
> ImmuneChain subclass of SeqRecord should be usable
> as is with SeqIO.write etc. You don't need to 'cast'.
>

I'm more worried about parsing than writing.  As you mentioned, I will have
to upgrade my SeqRecord object to an ImmuneChain object.

So maybe the best approach is a combination of the two code snippets I
included.  It would subclass SeqRecord, and then manually check whether I am
initializing with a pre-existing SeqRecord or just data:

class ImmuneChain(SeqRecord):
    def __init__(self, *args, **kw):
        if isinstance(args[0],SeqRecord):
            # if initializing with SeqRecord, then manually transfer the
data
            # based on the initializer for SeqRecord (http://goo.gl/X95Zf)
            record = args[0]
            SeqRecord.__init__(self, seq, id=record.id, name=record.name,
                     description=record.description, dbxrefs=record.dbxrefs,
                     features=record.features,
annotations=record.annotations,
                     letter_annotations=record.letter_annotations)
        else:
            # assume I'm initializing just like a regular SeqRecord:
            SeqRecord.__init__(*args,**kw)

        # Finally, I perform any problem-specific additional initializations
        # here.
        pass

Does this seem like a good solution?

Also, do you think that it would make sense to make a deep copy of the
SeqRecord object before I use it to initialize the ImmuneChain?

Uri

From p.j.a.cock at googlemail.com  Wed Mar  9 10:32:50 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 9 Mar 2011 15:32:50 +0000
Subject: [Biopython] SeqRecord subclassing or composition
In-Reply-To: <AANLkTi=J-Ok+Frzm4DkPvHJjR5rCY-U9oHR-aC4tDXmO@mail.gmail.com>
References: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
	<AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>
	<AANLkTi=J-Ok+Frzm4DkPvHJjR5rCY-U9oHR-aC4tDXmO@mail.gmail.com>
Message-ID: <AANLkTikfGO9s+GdDpG00CkXAg0B=Ch2EYi0oL78fGd6O@mail.gmail.com>

On Wed, Mar 9, 2011 at 3:28 PM, Uri Laserson <laserson at mit.edu> wrote:
>> Unless you modify the methods/atttributes too much, a
>> ImmuneChain subclass of SeqRecord should be usable
>> as is with SeqIO.write etc. You don't need to 'cast'.
>
> I'm more worried about parsing than writing. ?As you mentioned, I will have
> to upgrade my SeqRecord object to an ImmuneChain object.
> So maybe the best approach is a combination of the two code snippets I
> included. ?It would subclass SeqRecord, and then manually check whether I am
> initializing with a pre-existing SeqRecord or just data:
> class ImmuneChain(SeqRecord):
> ?? ?def __init__(self, *args, **kw):
> ?? ? ? ?if isinstance(args[0],SeqRecord):
> ?? ? ? ? ? ?# if initializing with SeqRecord, then manually transfer the
> data
> ?? ? ? ? ? ?# based on the initializer for SeqRecord (http://goo.gl/X95Zf)
> ?? ? ? ? ? ?record = args[0]
> ?? ? ? ? ? ?SeqRecord.__init__(self, seq, id=record.id, name=record.name,
> ?? ? ? ? ? ? ? ? ? ? description=record.description, dbxrefs=record.dbxrefs,
> ?? ? ? ? ? ? ? ? ? ? features=record.features,
> annotations=record.annotations,
> ?? ? ? ? ? ? ? ? ? ? letter_annotations=record.letter_annotations)
> ?? ? ? ?else:
> ?? ? ? ? ? ?# assume I'm initializing just like a regular SeqRecord:
> ?? ? ? ? ? ?SeqRecord.__init__(*args,**kw)
>
> ?? ? ? ?# Finally, I perform any problem-specific additional initializations
> ?? ? ? ?# here.
> ?? ? ? ?pass
> Does this seem like a good solution?

I think it will work,

> Also, do you think that it would make sense to make a deep copy of the
> SeqRecord object before I use it to initialize the ImmuneChain?

Assuming you will be discarding the original SeqRecord, then I see
no reason to make a deep copy. It will just slow things down.

Peter


From jvb at Cs.Nott.AC.UK  Wed Mar  9 10:33:28 2011
From: jvb at Cs.Nott.AC.UK (Jonathan Blakes)
Date: Wed, 09 Mar 2011 15:33:28 +0000
Subject: [Biopython] back-translation method for Seq object?
Message-ID: <4D779DC8.8090704@cs.nott.ac.uk>

This is a reply to an old thread (October 2008), but I thought someone 
might find it useful.

In that thread, discussing the representation of back-translations using 
ambiguous bases to avoid the factorial explosion of an all possibilities 
back-translation, Bruce Southey gave a table similar to the one below 
but some of the ambiguous codons were incorrect or the ambiguous codons 
were to ambiguous and covered more than one amino acid. The codons for 
stop (*) were also missing. Some were corrected later in the thread but 
not all.

Here are the correct ambiguous codons for the standard genetic code:

* = TAG, TAA, TGA                = TAR, TGA
A = GCT, GCC, GCA, GCG           = GCN
C = TGT, TGC                     = TGY
D = GAT, GAC                     = GAY
E = GAA, GAG                     = GAR
F = TTT, TTC                     = TTY
G = GGT, GGC, GGA, GGG           = GGN
H = CAT, CAC                     = CAY
I = ATT, ATC, ATA                = ATH
K = AAA, AAG                     = AAR
L = TTA, TTG, CTT, CTC, CTA, CTG = TTR, CTN
M = ATG                          = ATG
N = AAT, AAC                     = AAY
P = CCT, CCC, CCA, CCG           = CCN
Q = CAA, CAG                     = CAR
R = CGT, CGC, CGA, CGG, AGA, AGG = CGN, AGR
S = TCT, TCC, TCA, TCG, AGT, AGC = TCN, AGY
T = ACT, ACC, ACA, ACG           = ACN
V = GTT, GTC, GTA, GTG           = GTN
W = TGG                          = TGG
Y = TAT, TAC                     = TAY

Even though this is still not a one-to-one mapping in 4/21 cases the 
factorial explosion is significantly decreased. For example, the protein 
ACDEFGHIKLMNPQRSTVWY* has 1,019,215,872 unambiguous back-translations. 
Using the code above it has 16, or generally 2^(L+R+S+*).

If anyone has an algorithm for determining the set of non-overlapping 
ambiguous codons from any codon table I would like to know. Thanks,

Jon

-- 
Jonathan Blakes
School of Computer Science
University of Nottingham

From rasi at seas.harvard.edu  Wed Mar  9 17:57:30 2011
From: rasi at seas.harvard.edu (Arvind Subramaniam)
Date: Wed, 9 Mar 2011 17:57:30 -0500
Subject: [Biopython] .ab1 file parser in biopython?
Message-ID: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>

Hi
 I am new to biopython so please excuse me if this issue is obviously
simple. I am trying to parse .ab1 sequencing trace files in Biopython
and I cannot find the right module or method to do this job. Can
someone suggest how I can parse .ab1 files?
Thanks,
Arvind.

From cmckay at u.washington.edu  Wed Mar  9 20:09:55 2011
From: cmckay at u.washington.edu (Cedar McKay)
Date: Wed, 9 Mar 2011 17:09:55 -0800
Subject: [Biopython] "raw" genbank locations?
Message-ID: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>

Hello all. Biopython continues to be a lifesaver.

I'm trying to get the "raw" genbank locations for a downstream application after parsing a genbank file. Is there any way to get at this (or reproduce it)? As it is, the SeqRecord feature has start and stop information for the whole feature, and a list of sub-features each with it's own start and stops. I'm looking for one concise text string the describes the entire feature location, much like the original raw genbank locations do. 

I searched the archives, but nothing popped into view.

Thanks for your help!

best,
Cedar


From chapmanb at 50mail.com  Wed Mar  9 21:05:45 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 9 Mar 2011 21:05:45 -0500
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
Message-ID: <20110310020545.GA2185@kunkel>

Cedar;
Glad to hear Biopython has been helping out with your work.

> I'm trying to get the "raw" genbank locations for a downstream
> application after parsing a genbank file. Is there any way to get at
> this (or reproduce it)? As it is, the SeqRecord feature has start and
> stop information for the whole feature, and a list of sub-features
> each with it's own start and stops. I'm looking for one concise text
> string the describes the entire feature location, much like the
> original raw genbank locations do.

You can do this with the GenBank RecordParser, which doesn't parse
the location strings:

>>> from Bio.GenBank import RecordParser
>>> parser = RecordParser()
>>> handle = open("NT_019265.gb")
>>> rec = parser.parse(handle)
>>> for f in rec.features:
...     print f.location
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

If you have SeqRecord objects from SeqIO you can do this in a ugly
way by reaching into the internals of the GenBank writer:

>>> from Bio import SeqIO
>>> from Bio.SeqIO import InsdcIO
>>> handle = open("NT_019265.gb")
>>> for rec in SeqIO.parse(handle, "genbank"):
...     for f in rec.features:
...         print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

That might work for a quick hack but is not necessarily future proof
is the internal change. Peter, do you think this would be useful to
expose as a function of a SeqFeature directly, so you could do
feature.insdc_string() or something similar?

Brad

From chapmanb at 50mail.com  Wed Mar  9 21:05:45 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 9 Mar 2011 21:05:45 -0500
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
Message-ID: <20110310020545.GA2185@kunkel>

Cedar;
Glad to hear Biopython has been helping out with your work.

> I'm trying to get the "raw" genbank locations for a downstream
> application after parsing a genbank file. Is there any way to get at
> this (or reproduce it)? As it is, the SeqRecord feature has start and
> stop information for the whole feature, and a list of sub-features
> each with it's own start and stops. I'm looking for one concise text
> string the describes the entire feature location, much like the
> original raw genbank locations do.

You can do this with the GenBank RecordParser, which doesn't parse
the location strings:

>>> from Bio.GenBank import RecordParser
>>> parser = RecordParser()
>>> handle = open("NT_019265.gb")
>>> rec = parser.parse(handle)
>>> for f in rec.features:
...     print f.location
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

If you have SeqRecord objects from SeqIO you can do this in a ugly
way by reaching into the internals of the GenBank writer:

>>> from Bio import SeqIO
>>> from Bio.SeqIO import InsdcIO
>>> handle = open("NT_019265.gb")
>>> for rec in SeqIO.parse(handle, "genbank"):
...     for f in rec.features:
...         print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

That might work for a quick hack but is not necessarily future proof
is the internal change. Peter, do you think this would be useful to
expose as a function of a SeqFeature directly, so you could do
feature.insdc_string() or something similar?

Brad

From p.j.a.cock at googlemail.com  Thu Mar 10 03:57:20 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 08:57:20 +0000
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <20110310020545.GA2185@kunkel>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
	<20110310020545.GA2185@kunkel>
Message-ID: <AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>

On Thu, Mar 10, 2011 at 2:05 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Cedar;
> Glad to hear Biopython has been helping out with your work.
>
>> I'm trying to get the "raw" genbank locations for a downstream
>> application after parsing a genbank file. Is there any way to get at
>> this (or reproduce it)? As it is, the SeqRecord feature has start and
>> stop information for the whole feature, and a list of sub-features
>> each with it's own start and stops. I'm looking for one concise text
>> string the describes the entire feature location, much like the
>> original raw genbank locations do.
>
> You can do this with the GenBank RecordParser, which doesn't parse
> the location strings:
>
>>>> from Bio.GenBank import RecordParser
>>>> parser = RecordParser()
>>>> handle = open("NT_019265.gb")
>>>> rec = parser.parse(handle)
>>>> for f in rec.features:
> ... ? ? print f.location
> ...
> <cut>
>
> If you have SeqRecord objects from SeqIO you can do this in a ugly
> way by reaching into the internals of the GenBank writer:
>
>>>> from Bio import SeqIO
>>>> from Bio.SeqIO import InsdcIO
>>>> handle = open("NT_019265.gb")
>>>> for rec in SeqIO.parse(handle, "genbank"):
> ... ? ? for f in rec.features:
> ... ? ? ? ? print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
> ...
> <cut>
>
> That might work for a quick hack but is not necessarily future proof
> is the internal change. Peter, do you think this would be useful to
> expose as a function of a SeqFeature directly, so you could do
> feature.insdc_string() or something similar?

A couple of people have asked for this, and since adding SeqIO
output in GenBank/EMBL format (the code you refer to in InsdcIO)
this would be very possible... the issue holding me back is the
annoying special case(s) requiring to know the parent sequence's
length. The problem is that currently the SeqFeature doesn't
have this information - it doesn't have any link back to a parent
SeqRecord (and indeed it doesn't even have to be created in
the context of a SeqRecord).

Perhaps we can handle the case of between features N^1 on
circular sequences of length N differently, maybe with a dedicated
SeqFeature location class which would tell us it was at the origin?
Then we'd be able to avoid the need to know the parent length.

Once that is resolved, an orphan SeqFeature could generate its
own INSDC (GenBank/EMBL) location string without needing any
extra information, and exposing this as an object method would
be fine.

Peter

P.S. If we ever add a CircularSeq object - see other thread- then
SeqFeature locations spanning the origin might need reworking
too.


From p.j.a.cock at googlemail.com  Thu Mar 10 04:00:51 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 09:00:51 +0000
Subject: [Biopython] .ab1 file parser in biopython?
In-Reply-To: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
References: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
Message-ID: <AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>

On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam
<rasi at seas.harvard.edu> wrote:
> Hi
> ?I am new to biopython so please excuse me if this issue is obviously
> simple. I am trying to parse .ab1 sequencing trace files in Biopython
> and I cannot find the right module or method to do this job. Can
> someone suggest how I can parse .ab1 files?
> Thanks,
> Arvind.

You mean the ABI trace file format for capillary sequencing?

Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner
if I want to recall the bases (the ABI software doesn't always to the
best possible calling job).

http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html
http://sourceforge.net/projects/tracetuner/

Peter


From chapmanb at 50mail.com  Thu Mar 10 06:06:48 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 10 Mar 2011 06:06:48 -0500
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
	<20110310020545.GA2185@kunkel>
	<AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>
Message-ID: <20110310110648.GA2302@kunkel>

Peter;

> > do you think this would be useful to
> > expose as a function of a SeqFeature directly, so you could do
> > feature.insdc_string() or something similar?
> 
> A couple of people have asked for this, and since adding SeqIO
> output in GenBank/EMBL format (the code you refer to in InsdcIO)
> this would be very possible... the issue holding me back is the
> annoying special case(s) requiring to know the parent sequence's
> length. The problem is that currently the SeqFeature doesn't
> have this information - it doesn't have any link back to a parent
> SeqRecord (and indeed it doesn't even have to be created in
> the context of a SeqRecord).
> 
> Perhaps we can handle the case of between features N^1 on
> circular sequences of length N differently, maybe with a dedicated
> SeqFeature location class which would tell us it was at the origin?
> Then we'd be able to avoid the need to know the parent length.

This is a great idea; makes sense to treat this as a special case
since that's what it is. Another simple way would be to put the
function on the SeqRecord class and call it with:
rec.insdc_feature_string(feature); this places the responsibility of
knowing the parent back on the library user. 

> P.S. If we ever add a CircularSeq object - see other thread- then
> SeqFeature locations spanning the origin might need reworking
> too.

Makes sense. We can get the 99% of standard cases working now and
then re-circle back on this once someone gets up the guts to tackle
CircularSeq.

Brad

From p.j.a.cock at googlemail.com  Thu Mar 10 06:52:48 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 11:52:48 +0000
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <20110310110648.GA2302@kunkel>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
	<20110310020545.GA2185@kunkel>
	<AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>
	<20110310110648.GA2302@kunkel>
Message-ID: <AANLkTikdg=SaMguCiCbJwyuiBYbVi=CFgkdSKL+3j5ZY@mail.gmail.com>

On Thu, Mar 10, 2011 at 11:06 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter;
>
>> > do you think this would be useful to
>> > expose as a function of a SeqFeature directly, so you could do
>> > feature.insdc_string() or something similar?
>>
>> A couple of people have asked for this, and since adding SeqIO
>> output in GenBank/EMBL format (the code you refer to in InsdcIO)
>> this would be very possible... the issue holding me back is the
>> annoying special case(s) requiring to know the parent sequence's
>> length. The problem is that currently the SeqFeature doesn't
>> have this information - it doesn't have any link back to a parent
>> SeqRecord (and indeed it doesn't even have to be created in
>> the context of a SeqRecord).
>>
>> Perhaps we can handle the case of between features N^1 on
>> circular sequences of length N differently, maybe with a dedicated
>> SeqFeature location class which would tell us it was at the origin?
>> Then we'd be able to avoid the need to know the parent length.
>
> This is a great idea; makes sense to treat this as a special case
> since that's what it is.

It is probably the most elegant solution without a big refactor.

> Another simple way would be to put the
> function on the SeqRecord class and call it with:
> rec.insdc_feature_string(feature); this places the responsibility of
> knowing the parent back on the library user.

Yes, that would be simple. But don't we sometimes want to use
'orphan' SeqFeature objects (without a SeqRecord parent)?
I'm thinking here about GFF3 files and the like.

>> P.S. If we ever add a CircularSeq object - see other thread- then
>> SeqFeature locations spanning the origin might need reworking
>> too.
>
> Makes sense. We can get the 99% of standard cases working now and
> then re-circle back on this once someone gets up the guts to tackle
> CircularSeq.

:)

Peter

From rmb32 at cornell.edu  Thu Mar 10 12:15:41 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 10 Mar 2011 12:15:41 -0500
Subject: [Biopython] update Google Summer of Code project ideas
Message-ID: <4D79073D.3090603@cornell.edu>

Hi all,

Please make sure the BioJava information is up to date for 2011 on both 
the OBF and BioJava wikis.  Eric has done some work on it, but the 
current page has not been completely updated to reflect that it's 2011 
and we're applying again.

OBF wiki page: http://www.open-bio.org/wiki/Google_Summer_of_Code
BioPython wiki: http://biopython.org/wiki/Google_Summer_of_Code

Rob

----
Robert Buels
(prospective) 2011 OBF GSoC Organization Admin

From anaryin at gmail.com  Thu Mar 10 12:25:04 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 10 Mar 2011 18:25:04 +0100
Subject: [Biopython] update Google Summer of Code project ideas
In-Reply-To: <4D79073D.3090603@cornell.edu>
References: <4D79073D.3090603@cornell.edu>
Message-ID: <AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>

I updated the date and added the project from last year to the page, to show
we got another funded project.

Cheers,

J

From p.j.a.cock at googlemail.com  Thu Mar 10 12:42:58 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 17:42:58 +0000
Subject: [Biopython] Bugzilla -> Redmine migration
Message-ID: <AANLkTi=VuX3+ymNEo34f2XY1N-OmSGJ1MPFp7TNbginn@mail.gmail.com>

Hi all,

Anyone who has tried to file a bug recently will have noticed a big
red message "Sorry, entering bugs into the product Biopython has been
disabled."

The reason for this is the OBF team are about to move us (and all the
other Bio* projects using Bugzilla) to a Redmine server instead.
See http://www.redmine.org/

I expect this to be completed in the next few days (with all the old
bugs and accounts carried across). Hopefully this will include
integration with our git repository as well.

We'll make an announcement once it is ready, in the mean time, any new
bugs could be emailed to the mailing list as a short term measure.

Peter

From laserson at mit.edu  Thu Mar 10 13:22:42 2011
From: laserson at mit.edu (Uri Laserson)
Date: Thu, 10 Mar 2011 13:22:42 -0500
Subject: [Biopython] .ab1 file parser in biopython?
In-Reply-To: <AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>
References: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
	<AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>
Message-ID: <AANLkTimgkHz2N9SbQq6fgX76hxHCA5HKE78ezTO40j=q@mail.gmail.com>

I also found the following code lying around somewhere.  I copied it into
one of my repositories:

https://github.com/laserson/pytools/blob/master/ab1.py

"Python implementation of an ABIF file reader according to Applied
Biosystems' specificatons" as specified in March 2007, it appears.

...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


On Thu, Mar 10, 2011 at 04:00, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam
> <rasi at seas.harvard.edu> wrote:
> > Hi
> >  I am new to biopython so please excuse me if this issue is obviously
> > simple. I am trying to parse .ab1 sequencing trace files in Biopython
> > and I cannot find the right module or method to do this job. Can
> > someone suggest how I can parse .ab1 files?
> > Thanks,
> > Arvind.
>
> You mean the ABI trace file format for capillary sequencing?
>
> Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner
> if I want to recall the bases (the ABI software doesn't always to the
> best possible calling job).
>
> http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html
> http://sourceforge.net/projects/tracetuner/
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From p.j.a.cock at googlemail.com  Thu Mar 10 13:37:04 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 18:37:04 +0000
Subject: [Biopython] .ab1 file parser in biopython?
In-Reply-To: <AANLkTimgkHz2N9SbQq6fgX76hxHCA5HKE78ezTO40j=q@mail.gmail.com>
References: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
	<AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>
	<AANLkTimgkHz2N9SbQq6fgX76hxHCA5HKE78ezTO40j=q@mail.gmail.com>
Message-ID: <AANLkTikzMpbhgF2+t0dDW=cOTVYYsg6+rCxwUcwKSHGG@mail.gmail.com>

On Thu, Mar 10, 2011 at 6:22 PM, Uri Laserson <laserson at mit.edu> wrote:
> I also found the following code lying around somewhere. ?I copied it into
> one of my repositories:
>
> https://github.com/laserson/pytools/blob/master/ab1.py
>
> "Python implementation of an ABIF file reader according to Applied
> Biosystems' specificatons" as specified in March 2007, it appears.
>

Its under the GPL license. If you contacted the named author, Francis
Wolinski, and he was willing to re-licence for Biopython to use, then we
could consider incorporating it.

Alternatively it shouldn't be too hard to reimplement it from scratch
based on the published specification (and go one step further and
consider output too).

http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf

Note some case would be needed to work on Python 3, but we
can follow the example of our SFF parser here.

Is there actually a need for this though? As I said before, for my own
needs getting the ABI file into FASTQ format (or FASTA+QUAL) has
sufficed.

Peter


From cmckay at u.washington.edu  Thu Mar 10 16:51:42 2011
From: cmckay at u.washington.edu (Cedar McKay)
Date: Thu, 10 Mar 2011 13:51:42 -0800
Subject: [Biopython] "raw" genbank locations?
Message-ID: <D045666B-2637-401E-9FE5-02EF61C7BAF6@u.washington.edu>

Great! InsdcIO._insdc_feature_location_string was just what I needed. I was actually on the right track, trying to figure out how SeqIO wrote locations in genbank format, but your email arrived soon enough that I didn't have to finish the job. I realize this is a private method, so I would like an official way to do this.

Thanks so much guys, as usual, awesome service!

Cedar


From laserson at mit.edu  Thu Mar 10 17:07:46 2011
From: laserson at mit.edu (Uri Laserson)
Date: Thu, 10 Mar 2011 17:07:46 -0500
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
Message-ID: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>

Say I have a SeqRecord called A and a SeqRecord called B.  A has a bunch of
SeqFeatures associated with it, while B has none.  I perform a gapped
alignment between the two sequences.  Now I want to copy the SeqFeatures
from A onto B in a way that respects the coordinates of all the features.

For example (and please use a fixed-width font for this):

         0                       1
         0 1 2 3 4   5   6 7 8 9 0 1 2 3 4 5 6 7 8 9
             FEATURE_1               FEATURE_2
          X X X X X X X X X       X X X X X X X X X
A   - - - a c g g t - - a c a g a c g t g a t a c g
          | | | | |     | | |   | | |   | | | | | |
B   a a a a c g g t g g a c a t a c g - g a t a c g

   0                   1                     2
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6  7  8 9 0 1 2 3


In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and
(10,19), respectively.  Now I want to copy it to sequence B, where the
feature coords should instead be (3,12) and (15,23).

Is there an easy way to do this in biopython already?  Or are there any
ideas for an elegant solution?

Thanks!

Uri


...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu

From p.j.a.cock at googlemail.com  Thu Mar 10 17:46:32 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 22:46:32 +0000
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
Message-ID: <AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>

On Thu, Mar 10, 2011 at 10:07 PM, Uri Laserson <laserson at mit.edu> wrote:
> Say I have a SeqRecord called A and a SeqRecord called B. ?A has a bunch of
> SeqFeatures associated with it, while B has none. ?I perform a gapped
> alignment between the two sequences. ?Now I want to copy the SeqFeatures
> from A onto B in a way that respects the coordinates of all the features.
>
> For example (and please use a fixed-width font for this):
> <cut>

I'm not quite sure I followed that figure.

> In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and
> (10,19), respectively. ?Now I want to copy it to sequence B, where the
> feature coords should instead be (3,12) and (15,23).
>
> Is there an easy way to do this in biopython already?

No, but I'm not sure how advisable it is anyway (if I have
understood you right - see below).

> Or are there any ideas for an elegant solution?

I actually wanted to do something similar to this myself.
I had a draft genome I had annotated in GenBank format.
We did some more sequencing and/or I tweaked the
assembly, and I had a new very similar sequence in a
FASTA file, and I wanted to copy the old annotation over.

What I did was look for perfect matches between the regions
spanned by the features (no introns in this case), and that
meant all I needed to do was apply a shift to the SeqFeature
location. There is a (private) method _shift which helped here
(written for use in slicing a SeqRecord).

In my case, that handled most of the annotation, and I did
the nasty cases by hand (since I wanted to examine what
had happened in the new assembly - it was a small genome).

In your case the start and end co-ordinates may be shifted
by different amounts (since you are doing gapped alignments).
This worries me as the length of your features can change.
For any gene or CDS features that is a problem (frame shifts).
Have you thought about that? Perhaps you're dealing with
non-coding features only?

Peter


From p.j.a.cock at googlemail.com  Fri Mar 11 04:53:16 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 11 Mar 2011 09:53:16 +0000
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
	<AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>
	<AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
Message-ID: <AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>

On Thu, Mar 10, 2011 at 11:25 PM, Uri Laserson <laserson at mit.edu> wrote:
>> I'm not quite sure I followed that figure.
>
> I think you understood perfectly.

Good - your text was clearer for me.

>> In your case the start and end co-ordinates may be shifted
>> by different amounts (since you are doing gapped alignments).
>> This worries me as the length of your features can change.
>> For any gene or CDS features that is a problem (frame shifts).
>> Have you thought about that? Perhaps you're dealing with
>> non-coding features only?
>
> That's exactly the complication here. ?I have one reference sequence that is
> highly annotated, and I have a read that I want to align to it and transfer
> over the annotations to the corresponding positions.

OK - and do you want to worry about spotting frameshifts,
and updating the translation for CDS features?

> One way I can handle this situation is that when I actually build the
> pairwise gapped alignment (which I do manually), in addition to the actual
> gapped-sequence strings, I can generate two lists that contain the ungapped
> coordinates of each sequence (in my diagram, this is the numbering above and
> below). ?Figuring out the new coords from the old coordinates is then a
> matter of matching the positions in the lists. ?(Though perhaps it's easier
> to implement using dictionaries, so I don't have to search the lists I
> generated.)

Yes, that kind of technique is also useful for  mapping between
gapped and ungapped coordinates in assembly files.

> Eitherway, in order to move the SeqFeature to the new sequence, should I
> make a deep copy of it and then manually modify the start and end coords?
> Uri

You could do, or create a new SeqFeature, or "steal" the old one and
modify it. The later technique would probably be fastest since there
are no new objects to create, just a few integer attributes changes
(location positions), but is perhaps a bit risky if you don't comment
it clearly. If you do that, perhaps do this by popping the features
from the old SeqRecord's feature list, modify them, and add them
to the new SeqRecord's feature list.

If all your current annotation uses simple exact locations, life is
easier. If there are fuzzy locations, then using the location object's
private _shift method might be simplest.

Another query, are you going to look for inversions? In such
cases the strand needs flipping and the start/end interchanged.
The SeqRecord reverse complement method has to do this,
and therefore the SeqFeature and its location and position
classes all have a private _flip method.

[If you find these private methods useful, perhaps we can make
them public? Let us know]

Thanks,

Peter


From thamelry at binf.ku.dk  Fri Mar 11 08:08:55 2011
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 11 Mar 2011 14:08:55 +0100
Subject: [Biopython] update Google Summer of Code project ideas
In-Reply-To: <AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>
References: <4D79073D.3090603@cornell.edu>
	<AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>
Message-ID: <AANLkTik+Y6uDLKniKLi-AWV-it05rd9YTg0u_Tc_0jOy@mail.gmail.com>

Hi,

I've just added a proposal:

Mocapy++Biopython: from data to probabilistic models of biomolecules
<http://biopython.org/wiki/Google_Summer_of_Code#Mocapy.2B.2BBiopython:_from_data_to_probabilistic_models_of_biomolecules>

Cheers,

-- 
Thomas Hamelryck, Eng., Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://www.binf.ku.dk/research/structural_bioinformatics/

From laserson at mit.edu  Fri Mar 11 12:03:58 2011
From: laserson at mit.edu (Uri Laserson)
Date: Fri, 11 Mar 2011 12:03:58 -0500
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
	<AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>
	<AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
	<AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>
Message-ID: <AANLkTinzrbTRmsqX_z9DJH1fnMbvMCABqfJAEbeGh2Vo@mail.gmail.com>

>
> OK - and do you want to worry about spotting frameshifts,
> and updating the translation for CDS features?
>

I can retranslate the features myself, weary of any frameshifts


> You could do, or create a new SeqFeature, or "steal" the old one and
> modify it. The later technique would probably be fastest since there
> are no new objects to create, just a few integer attributes changes
> (location positions), but is perhaps a bit risky if you don't comment
> it clearly. If you do that, perhaps do this by popping the features
> from the old SeqRecord's feature list, modify them, and add them
> to the new SeqRecord's feature list.
>

I can't steal the features because the source of the features is a reference
sequence that I will reuse for millions of reads.  I will have to make a
copy.  You believe that building a new SeqFeature would be faster/safer than
using python's copy.deepcopy() method?


> Another query, are you going to look for inversions? In such
> cases the strand needs flipping and the start/end interchanged.
> The SeqRecord reverse complement method has to do this,
> and therefore the SeqFeature and its location and position
> classes all have a private _flip method.
>

All the reads will be reverse complemented to the coding orientation before
the transfer of the features, so I don't think this will be a problem.


> [If you find these private methods useful, perhaps we can make
> them public? Let us know]
>

It's hard to tell what the general API should be or what are the most common
use-cases.  For myself, I can get by with writing my own methods to modify
the coordinates accordingly.

Uri

From p.j.a.cock at googlemail.com  Fri Mar 11 12:15:09 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 11 Mar 2011 17:15:09 +0000
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTinzrbTRmsqX_z9DJH1fnMbvMCABqfJAEbeGh2Vo@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
	<AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>
	<AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
	<AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>
	<AANLkTinzrbTRmsqX_z9DJH1fnMbvMCABqfJAEbeGh2Vo@mail.gmail.com>
Message-ID: <AANLkTi=sKcpmoc+0bpfask_Q2=v2kXJq6y5Pxji4y8o0@mail.gmail.com>

On Fri, Mar 11, 2011 at 5:03 PM, Uri Laserson <laserson at mit.edu> wrote:
>> You could do, or create a new SeqFeature, or "steal" the old one and
>> modify it. The later technique would probably be fastest since there
>> are no new objects to create, just a few integer attributes changes
>> (location positions), but is perhaps a bit risky if you don't comment
>> it clearly. If you do that, perhaps do this by popping the features
>> from the old SeqRecord's feature list, modify them, and add them
>> to the new SeqRecord's feature list.
>
> I can't steal the features because the source of the features is a reference
> sequence that I will reuse for millions of reads. ?I will have to make a
> copy. ?You believe that building a new SeqFeature would be faster/safer than
> using python's copy.deepcopy() method?

Yes, in this case you will have to make a copy. As too speed,
I'm not sure which would be fastest - try it and see ;)
Note as long as you are not going to *change* the information
in the qualifiers dictionary (and you may want to if you update
the translation for example), then you can have the new
SeqFeature share the old qualifiers dictionary. That is a bit
sneaky but may help with speed (if speed is an issue).

>> [If you find these private methods useful, perhaps we can make
>> them public? Let us know]
>
> It's hard to tell what the general API should be or what are the most common
> use-cases. ?For myself, I can get by with writing my own methods to modify
> the coordinates accordingly.

Thanks,

Peter


From reece at harts.net  Mon Mar 14 14:22:52 2011
From: reece at harts.net (Reece Hart)
Date: Mon, 14 Mar 2011 11:22:52 -0700
Subject: [Biopython] update Google Summer of Code project ideas
In-Reply-To: <AANLkTik+Y6uDLKniKLi-AWV-it05rd9YTg0u_Tc_0jOy@mail.gmail.com>
References: <4D79073D.3090603@cornell.edu>
	<AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>
	<AANLkTik+Y6uDLKniKLi-AWV-it05rd9YTg0u_Tc_0jOy@mail.gmail.com>
Message-ID: <AANLkTin4xadMNnjj_o-XGxGmjcX79CSfkLoKwhubjkzi@mail.gmail.com>

All-

I just added a GSoC Biopython proposal:
Variant representation, parser, generator, and coordinate
converter<http://biopython.org/wiki/Google_Summer_of_Code#Variant_representation.2C_parser.2C_generator.2C_and_coordinate_converter>

Comments and co-mentors welcome.

-Reece

From 2huggie at gmail.com  Wed Mar 16 04:26:44 2011
From: 2huggie at gmail.com (Timothy Wu)
Date: Wed, 16 Mar 2011 16:26:44 +0800
Subject: [Biopython] [BioPython] Genbank parser
Message-ID: <AANLkTik2-6n_F3-mgFHEDOiSWWmnpXE9Xjn_CMThvfu8@mail.gmail.com>

Hi,

I'm using Biopython to parse human genome files with code like this:

        for seq_record in SeqIO.parse(fd, "genbank"):
            * do something with seq_record*

However something tripped on me:

Traceback (most recent call last):
  File "./buildSyn.py", line 26, in <module>
    main()
  File "./buildSyn.py", line 19, in main
    gene2SynMapping, syn2GeneMapping = mapper.getMappingDicts(files)
  File
"/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py",
line 29, in getMappingDicts
    self.parseAndGetMapping(fd, gene2syn)
  File
"/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py",
line 74, in parseAndGetMapping
    for seq_record in SeqIO.parse(fd, "genbank"):
  File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 525, in
parse
    for r in i:
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 437, in
parse_records
    record = self.parse(handle, do_features)
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 420, in
parse
    if self.feed(handle, consumer, do_features):
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 392, in
feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 344, in
_feed_feature_table
    consumer.location(location_string)
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/__init__.py", line 975, in
location
    raise LocationParserError(location_line)
Bio.GenBank.LocationParserError: 958574^958575..958886

The Genbank file involved has the following structure:

    CDS             958574^958575..958772
                     /gene="CSH2"
                     /gene_synonym="CS-2; CSB; hCS-B"
                     /exception="unclassified translation discrepancy"
                     /note="placental lactogen; chorionic somatomammotropin
B;
                     Derived by automated computational analysis using gene
                     prediction method: Curated Genomic."
                     /codon_start=1
                     /product="chorionic somatomammotropin hormone 2 isoform
3"
                     /protein_id="NP_072171.1"
                     /db_xref="GI:12408694"
                     /db_xref="CCDS:CCDS42368.1"
                     /db_xref="GeneID:1443"
                     /db_xref="HGNC:2441"
                     /db_xref="MIM:118820"

This isn't the first occurrence in this file, however I manually deleted
what's equivalent of "^958575"
in the location and it works out OK.

Is there something I can do? Right now I edit the genbank file instead
(since I won't be needing the location information)
And I'm not sure what the caret is suppose to represent.

Thanks for your attention.

Timothy

From p.j.a.cock at googlemail.com  Wed Mar 16 07:43:28 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 16 Mar 2011 11:43:28 +0000
Subject: [Biopython] [BioPython] Genbank parser
In-Reply-To: <AANLkTik2-6n_F3-mgFHEDOiSWWmnpXE9Xjn_CMThvfu8@mail.gmail.com>
References: <AANLkTik2-6n_F3-mgFHEDOiSWWmnpXE9Xjn_CMThvfu8@mail.gmail.com>
Message-ID: <AANLkTi=O8btp9Yheqs5jx1TR+g-2MBj_XZ6E0aq3cXkf@mail.gmail.com>

On Wed, Mar 16, 2011 at 8:26 AM, Timothy Wu <2huggie at gmail.com> wrote:
> Hi,
>
> I'm using Biopython to parse human genome files with code like this:
>
> ? ? ? ?for seq_record in SeqIO.parse(fd, "genbank"):
> ? ? ? ? ? ?* do something with seq_record*
>
> However something tripped on me:
>
> Traceback (most recent call last):
> ...
> ? ?raise LocationParserError(location_line)
> Bio.GenBank.LocationParserError: 958574^958575..958886
>
> The Genbank file involved has the following structure:
>
> ? ?CDS ? ? ? ? ? ? 958574^958575..958772
> ? ? ? ? ? ? ? ? ? ? /gene="CSH2"
> ...
>
> This isn't the first occurrence in this file, however I manually deleted
> what's equivalent of "^958575" in the location and it works out OK.
>
> Is there something I can do? Right now I edit the genbank file instead
> (since I won't be needing the location information)
> And I'm not sure what the caret is suppose to represent.

Hi Timothy,

I believe this to be an invalid GenBank file, and I would like you
to contact the NCBI to check this. The caret is used for 'between'.
Here it seems to be saying meaning this feature starts between
958574 and 958575, and runs to 958772. That would normally
be represented just as 958575..958772

See also:
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
http://redmine.open-bio.org/issues/3175
(we're migrating the bug database, official announcement
due soon)

How many of this kind of 'broken' GenBank records have you
found? I would hope it is just one or two that can be fixed by
hand. If on the other hand the NCBI say this is valid, we need
to handle this in the Biopython feature model...

Peter


From cjfields at illinois.edu  Wed Mar 16 13:58:23 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 16 Mar 2011 12:58:23 -0500
Subject: [Biopython] [ANNOUNCEMENT] Bugzilla to Redmine migration
Message-ID: <34C8C0CB-9273-468E-86D7-74B22464F181@illinois.edu>

(apologies if you receive multiple copies of this)

All,

We are currently about 95% done with a transition over to our new Redmine tracking system, to the point where we feel comfortable in going ahead with opening it to developers:

http://redmine.open-bio.org/

All edits to bugzilla reports on our old system (http://bugzilla.open-bio.org/) are now disabled and the system is now read-only.  Any new bugs and comments to old ones should be reported on the new Redmine server.

For current Bugzilla users, we have migrated login IDs to Redmine (this is normally an email address), but we have reset user passwords for security reasons.  There are two ways to access your account:

1) When logging in (http://redmine.open-bio.org/login), click on the 'Lost password' link.  You will be prompted for your email address (this should be the same as your bugzilla login).  An new email will be sent out containing directions for resetting your password and logging in.

2) It is possible the above may be automatically detected as spam.  If the above doesn't work or the reset email isn't received within a day, contact support at helpdesk.open-bio.org to receive your new password.

Also, note that Redmine has a different syntax for those who want to add links to their reports; see http://www.redmine.org/projects/redmine/wiki/RedmineTextFormatting.

Let us know if you have any questions.  

chris

Christopher Fields
IGB Postdoctoral Fellow
Genomics of Neural & Behavioral Plasticity
University of Illinois Urbana-Champaign
Institute for Genomic Biology
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801

From rmb32 at cornell.edu  Fri Mar 18 15:23:37 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Fri, 18 Mar 2011 15:23:37 -0400
Subject: [Biopython] Google Summer of Code is *ON* for OBF projects!
Message-ID: <4D83B139.4010803@cornell.edu>

Hi all,

Great news: Google announced today that the Open Bioinformatics
Foundation has been accepted as a mentoring organization for this
summer's Google Summer of Code!

GSoC is a Google-sponsored student internship program for open-source
projects, open to students from around the world (not just US
residents).   Students are paid a $5000 USD stipend to work as a
developer on an open-source project for the summer. For more on GSoC,
see GSoC 2011 FAQ at http://bit.ly/hpoz8W

Student applications are due April 8, 2011 at 19:00 UTC.  Students who
are interested in participating should look at the OBF's GSoC page at
http://open-bio.org/wiki/Google_Summer_of_Code, which lists project
ideas, and whom to contact about applying.

For current developers on OBF projects, please consider volunteering to
be a mentor if you have not already, and contribute project ideas.  Just
list your name and project ideas on OBF wiki and on the relevant
project's GSoC wiki page.

Thanks to all who helped make OBF's application to GSoC a success, and
let's have a great, productive summer of code!

Rob Buels
OBF GSoC 2011 Administrator


From laserson at mit.edu  Mon Mar 21 19:38:10 2011
From: laserson at mit.edu (Uri Laserson)
Date: Mon, 21 Mar 2011 19:38:10 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC
	formats?
Message-ID: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>

If I load a GenBank-formatted record:

    a = SeqIO.parse('myfile.gb','gb').next()

then set an annotation:

    a.annotations['myannotation'] = 'saveme'

and then format the SeqRecord object as GenBank:

    a.format('gb')

then 'myannotation' is lost.

Is this expected behavior?  If so, that's a huge bummer...what is the
suggested method to store my own annotations in INSDC formats?

Uri


...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu

From p.j.a.cock at googlemail.com  Tue Mar 22 05:22:17 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 09:22:17 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
Message-ID: <AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>

On Mon, Mar 21, 2011 at 11:38 PM, Uri Laserson <laserson at mit.edu> wrote:
> If I load a GenBank-formatted record:
>
> ? ?a = SeqIO.parse('myfile.gb','gb').next()
>
> then set an annotation:
>
> ? ?a.annotations['myannotation'] = 'saveme'
>
> and then format the SeqRecord object as GenBank:
>
> ? ?a.format('gb')
>
> then 'myannotation' is lost.

It isn't 'lost' in that it is still in your SeqRecord object in
memory, but it isn't in the GenBank format output.

> Is this expected behavior?

Yes, there is no general field for record level annotation in the
GenBank or EMBL file formats. Where did you expect it to be
written? The same thing would happen with most file formats,
e.g. FASTA has no annotation support at all beyond the free
text description line.

> If so, that's a huge bummer...what is the suggested method to
> store my own annotations in INSDC formats?

You could stuff record level information into a source feature's
qualifier dictionary. It isn't elegant, but it would work. The NCBI
seems to have introduced the source feature primarily to use
this to store the taxon identifier and other little bits of information
not handles explicitly in the header lines. (Plus this can handle
chimeras which may have been a use case).

Peter


From laserson at mit.edu  Tue Mar 22 11:08:08 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 22 Mar 2011 11:08:08 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
Message-ID: <AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>

>
> You could stuff record level information into a source feature's
> qualifier dictionary.


What are the allowed types for the values of the qualifiers dictionary (that
will be output correctly in INSDC)?  Is it possible to have lists of
strings?

What is the standard practice: a feature of type "source" that runs the
entire length of the sequence?  Or is it possible to have a SeqFeature with
no position annotation?  Ideally, if I slice the SeqFeature, I would like
these annotations to stay with the slice no matter what.

Uri

From p.j.a.cock at googlemail.com  Tue Mar 22 11:30:46 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 15:30:46 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
Message-ID: <AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>

On Tue, Mar 22, 2011 at 3:08 PM, Uri Laserson <laserson at mit.edu> wrote:
>> You could stuff record level information into a source feature's
>> qualifier dictionary.
>
> What are the allowed types for the values of the qualifiers dictionary
> (that will be output correctly in INSDC)? ?Is it possible to have lists of
> strings?

As far as the current Biopython output goes, you can basically use any
(short) string as a qualifier key. Avoid keys with spaces in them (INSDC
use underscores) and other funny characters. For strict INSDC compliance
there is probably a white list of allowed feature types...

> What is the standard practice: a feature of type "source" that runs the
> entire length of the sequence? ?Or is it possible to have a SeqFeature with
> no position annotation? ?Ideally, if I slice the SeqFeature, I would like
> these annotations to stay with the slice no matter what.

If you did have a SeqFeature without a location, we couldn't write
it out in GenBank/EMBL format (the error handling here might be
improved).

If you have a SeqRecord with a (source) feature spanning the full
sequence, and you slice the SeqRecord to take a subsequence,
then that full length feature (and any other features not fully within
the subsequence) would be lost.

Using a source feature is really just a work around for the fact that
GenBank/EMBL do not support arbitrary record level annotation.
Do you have to use this as your output format? Would you not be
better off with using a database or something else instead?

Peter


From laserson at mit.edu  Tue Mar 22 11:44:02 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 22 Mar 2011 11:44:02 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
Message-ID: <AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>

>
> As far as the current Biopython output goes, you can basically use any
> (short) string as a qualifier key.
>

Sorry, I meant for the values, not the keys.  Can you have a list of strings
as a value?


> Using a source feature is really just a work around for the fact that
> GenBank/EMBL do not support arbitrary record level annotation.
> Do you have to use this as your output format?


Agreed.  Essentially, I have a huge pile of sequencing reads that are highly
annotated.  For any given read, there are some annotations that are
independent of the sequence itself (which is what I am trying to implement
now) and there are some annotations that are associated with subsequences
(which is why SeqFeatures are very appropriate).  Ideally, I want a file
format that will store the data, be easily parsable (and fast), and can be
readable using something like `less` (though this last feature is less
important).


> Would you not be
> better off with using a database or something else instead?
>

Well, initially I used XML to store the data, but I quickly realized I was
reinventing the wheel, especially when it came to annotating features on top
of the sequences.

Are you suggesting something like SQLite?  How would I deal with
SeqFeature-type annotations?

Uri


>  Peter
>

From p.j.a.cock at googlemail.com  Tue Mar 22 12:14:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 16:14:05 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
	<AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
Message-ID: <AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>

On Tue, Mar 22, 2011 at 3:44 PM, Uri Laserson <laserson at mit.edu> wrote:
>> As far as the current Biopython output goes, you can basically use any
>> (short) string as a qualifier key.
>
> Sorry, I meant for the values, not the keys. ?Can you have a list of strings
> as a value?

Right. Again yes, plus I think a single string as the value should work.
This is because the INSDC feature table allows multiple values for a
tag - for example you often get multiple database cross references.

>> Using a source feature is really just a work around for the fact that
>> GenBank/EMBL do not support arbitrary record level annotation.
>> Do you have to use this as your output format?
>
> Agreed. ?Essentially, I have a huge pile of sequencing reads that are highly
> annotated. ?For any given read, there are some annotations that are
> independent of the sequence itself (which is what I am trying to implement
> now) and there are some annotations that are associated with subsequences
> (which is why SeqFeatures are very appropriate). ?Ideally, I want a file
> format that will store the data, be easily parsable (and fast), and can be
> readable using something like `less` (though this last feature is less
> important).

For this the GenBank/EMBL format with the source feature trick
does sound workable. You just need to be careful how how and
when you create the dummy source feature - I'd do it at the last
moment before writing out the file, and in that way you can avoid
things like slicing throwing it away.

>> Would you not be
>> better off with using a database or something else instead?
>
> Well, initially I used XML to store the data, but I quickly realized I was
> reinventing the wheel, especially when it came to annotating features
> on top of the sequences.

I wonder if one of the INSDC XML formats would work nicely here?
i.e. If they can be extended more easily. We should look at adding a
parser for them to Biopython (and write support too ideally of course).

> Are you suggesting something like SQLite? ?How would I deal with
> SeqFeature-type annotations?

I was thinking you could use the BioSQL schema (run on SQLite if
you wanted to, or MySQL or PostgresSQL etc). You'd still face the
same issues if/when you wanted to dump the annotated records
to a plain text file though.

Peter


From laserson at mit.edu  Tue Mar 22 12:58:03 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 22 Mar 2011 12:58:03 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
	<AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
	<AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>
Message-ID: <AANLkTikh+vjPbVoUw56R5VQuN6Zgg5WVDyf_i36--+Gr@mail.gmail.com>

>
> For this the GenBank/EMBL format with the source feature trick
> does sound workable. You just need to be careful how how and
> when you create the dummy source feature - I'd do it at the last
> moment before writing out the file, and in that way you can avoid
> things like slicing throwing it away.
>
>
That's a good idea.  This should be even easier since I am subclassing
SeqRecord.  I can override `format` to first take the whole annotations
dictionary and dump it into the qualifiers dictionary of a `source` feature.
 I also have my own parser which wraps SeqIO; using SeqIO to parse the
'imgt' format, I can then copy the `source` qualifiers to the annotations
dictionary and delete `source` feature entirely.  Does this sound
reasonable?


> I wonder if one of the INSDC XML formats would work nicely here?
> i.e. If they can be extended more easily. We should look at adding a
> parser for them to Biopython (and write support too ideally of course).
>

My only issue with this is that I'd rather not extend anyone's file format,
but use a standard file format that fits my purpose.  Otherwise, I might as
well just go straight for a database, as below.  (But there are some
super-fast XML parsers out there.)


> I was thinking you could use the BioSQL schema (run on SQLite if
> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
> same issues if/when you wanted to dump the annotated records
> to a plain text file though.
>

I suppose plain text readability is less important to me than ease of
sharing the data.  But when I dump a SeqRecord object to a BioSQL database,
does it do it in a way that I can rebuild that object exactly with no loss
of information? (I.e., does it solve the annotation dictionary problem that
started this whole thread?)

Uri

From p.j.a.cock at googlemail.com  Tue Mar 22 13:24:46 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 17:24:46 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTikh+vjPbVoUw56R5VQuN6Zgg5WVDyf_i36--+Gr@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
	<AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
	<AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>
	<AANLkTikh+vjPbVoUw56R5VQuN6Zgg5WVDyf_i36--+Gr@mail.gmail.com>
Message-ID: <AANLkTi=1=-j7mgZBp1MfGguKE2F4wpfjZn5=mZx+31ML@mail.gmail.com>

On Tue, Mar 22, 2011 at 4:58 PM, Uri Laserson <laserson at mit.edu> wrote:
>> For this the GenBank/EMBL format with the source feature trick
>> does sound workable. You just need to be careful how how and
>> when you create the dummy source feature - I'd do it at the last
>> moment before writing out the file, and in that way you can avoid
>> things like slicing throwing it away.
>
> That's a good idea. ?This should be even easier since I am subclassing
> SeqRecord. ?I can override `format` to first take the whole annotations
> dictionary and dump it into the qualifiers dictionary of a `source` feature.
> ?I also have my own parser which wraps SeqIO; using SeqIO to parse the
> 'imgt' format, I can then copy the `source` qualifiers to the annotations
> dictionary and delete `source` feature entirely. ?Does this sound
> reasonable?

Yes, using your own parser/writer to take care to mapping between
the SeqRecord annotations dictionary and a dummy feature sounds
sensible. Also using 'imgt' rather than GenBank or EMBL will let you
have longer feature qualifier keys - but these files are not as widely
used/supported as the GenBank and EMBL formats.

>> I wonder if one of the INSDC XML formats would work nicely here?
>> i.e. If they can be extended more easily. We should look at adding a
>> parser for them to Biopython (and write support too ideally of course).
>
> My only issue with this is that I'd rather not extend anyone's file format,
> but use a standard file format that fits my purpose. ?Otherwise, I might as
> well just go straight for a database, as below. ?(But there are some
> super-fast XML parsers out there.)

I haven't looked at the details to see if those XML file formats have
a nice open ended misc annotation tag you could just use.

>> I was thinking you could use the BioSQL schema (run on SQLite if
>> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
>> same issues if/when you wanted to dump the annotated records
>> to a plain text file though.
>
> I suppose plain text readability is less important to me than ease of
> sharing the data. ?But when I dump a SeqRecord object to a BioSQL
> database, does it do it in a way that I can rebuild that object exactly
> with no loss of information? (I.e., does it solve the annotation dictionary
> problem that started this whole thread?)

Basically yes, subject to a few provisos, it should. Firstly note we
don't support any per-letter-annotation in BioSQL. Secondly, all
the SeqRecord annotations SeqFeature qualifiers will end up being
stored as strings (in table bioentry_qualifier_value and table
seqfeature_qualifier_value respectively). There may also be some
fun with string values vs single entry lists containing one string.

Peter


From gori at cs.ru.nl  Wed Mar 23 13:43:16 2011
From: gori at cs.ru.nl (Fabio Gori)
Date: Wed, 23 Mar 2011 18:43:16 +0100
Subject: [Biopython] From genome to lineage with Entrez
Message-ID: <201103231843.16762.gori@cs.ru.nl>

Hi all,

I have downloaded all the bacterial genomes 
(ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare 
their taxonomic lineages.

I'm looking for a way to get their lineages with Entrez. From the files I can 
get the accession numbers and GIs, but I don't know how to get their taxonomic 
ids.
I know that I can step from GIs to Taxids processing the file 
gi_taxid_nucl.dmp, but I'd prefer to use Entrez. 


Thanks in advance,

Fabio

-- 

F. Gori, PhD student
Intelligent Systems
ICIS (Institute for Computing and Information Sciences)
Radboud University Nijmegen

Home Page: http://www.cs.ru.nl/~gori/

From p.j.a.cock at googlemail.com  Wed Mar 23 14:01:32 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 23 Mar 2011 18:01:32 +0000
Subject: [Biopython] From genome to lineage with Entrez
In-Reply-To: <201103231843.16762.gori@cs.ru.nl>
References: <201103231843.16762.gori@cs.ru.nl>
Message-ID: <AANLkTinwwDABAtq4bweFZVM4gkQq=hx1Q6fcJO2BSbrs@mail.gmail.com>

On Wed, Mar 23, 2011 at 5:43 PM, Fabio Gori <gori at cs.ru.nl> wrote:
> Hi all,
>
> I have downloaded all the bacterial genomes
> (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare
> their taxonomic lineages.
>
> I'm looking for a way to get their lineages with Entrez. From the files I can
> get the accession numbers and GIs, but I don't know how to get their taxonomic
> ids.
> I know that I can step from GIs to Taxids processing the file
> gi_taxid_nucl.dmp, but I'd prefer to use Entrez.
>

I think you can do it with ELink, but personally I'd use the taxid dump file,
since it sounds like you'll want to process hundreds of lineages.

Peter

From amenity at enthought.com  Wed Mar 23 23:29:35 2011
From: amenity at enthought.com (Amenity Applewhite)
Date: Wed, 23 Mar 2011 22:29:35 -0500
Subject: [Biopython] SciPy 2011 Call for Papers
Message-ID: <AANLkTinNAz1hGt6sDC37WTJceFaiKaTitfHsmziuBRn4@mail.gmail.com>

Hello,

SciPy 2011 <http://conference.scipy.org/scipy2011/index.php>, the 10th
Python in Science conference, will be held July 11 - 16, 2011, in Austin,
TX.

At this conference, novel applications and breakthroughs made in the pursuit
of science using Python are presented. Attended by leading figures from both
academia and industry, it is an excellent opportunity to experience the
cutting edge of scientific software development.

The conference is preceded by two days of tutorials, during which community
experts provide training on several scientific Python packages.

*We'd like to invite you to consider presenting at SciPy 2011.*

The list of topics that are appropriate for the conference includes (but is
not limited to):
     * new Python libraries for science and engineering;
     * applications of Python to the solution of scientific or computational
problems;
     * high performance, parallel and GPU computing with Python;
     * use of Python in science education.

*Specialized Tracks*
This year we also have two specialized tracks. They will be run concurrent
to the main conference.

         *Python in Data Science
         Chair: Peter Wang, Streamitive, Inc.*
   This track focuses on the advantages and challenges of applying Python in
   the emerging field of "data science".  This includes a breadth of
   technologies, from wrangling realtime data streams from the social web,
to
   machine learning and semantic analysis, to workflow and repository
   management for large datasets.

         *Python and Core Technologies
         Chair: Anthony Scopatz, Enthought, Inc.*
   In an effort to broaden the scope of SciPy and to engage the larger
   community of software developers, we are pleased to introduce the _Python
&
   Core Technologies_ track. Talks will cover subjects that are not directly
   related to science and engineering, yet nonetheless affect scientific
   computing. Proposals on the Python language, visualization toolkits, web
   frameworks, education, and other topics are appropriate for this session.

*Talk/Paper Submission*

   We invite you to take part by submitting a talk abstract on the
conference
   website at:
   http://conference.scipy.org/scipy2011/papers.php
   Papers are included in the peer-reviewed conference proceedings, to be
   published online.

*Important dates for authors:*
   Friday, April 15: Tutorial proposals due (remember: stipends will be
provided for Tutorial instructors)

http://conference.scipy.org/scipy2011/tutorials.php
   Sunday, April 24: Paper abstracts due
   Sunday, May 8: Student sponsorship request due
http://conference.scipy.org/scipy2011/student.php
   Tuesday, May 10: Accepted talks announced
   Monday, May 16: Student sponsorships announced
   Monday, May 23: Early Registration ends
   Sunday, June 20: Papers due
   Monday-Tuesday, July 11 - 12: Tutorials
   Wednesday-Thursday, July 13 - July 14: Conference
   Friday-Saturday, July 15 - July 16: Sprints


   The SciPy 2011 Team

  @SciPy2011
  http://twitter.com/SciPy2011

_________________________
Amenity Applewhite
Enthought, Inc. <http://www.enthought.com/>
Scientific Computing Solutions

From michele.silva at gmail.com  Fri Mar 25 02:11:41 2011
From: michele.silva at gmail.com (Michele)
Date: Fri, 25 Mar 2011 03:11:41 -0300
Subject: [Biopython] [GSoC] Proposal: Mocapy++Biopython
In-Reply-To: <AANLkTi=s=74jsMu4LP2RqnXeq28taun+SP1efQjnY8ts@mail.gmail.com>
References: <AANLkTi=s=74jsMu4LP2RqnXeq28taun+SP1efQjnY8ts@mail.gmail.com>
Message-ID: <AANLkTi=oHxSoNzqq1N2auZnCxoanDoXc3RO=QXwhn8vn@mail.gmail.com>

Hello everyone,

I'm Michele, a computer scientist and passionate developer who is
currently enrolled in a biomedicine course. That's why I got in touch with
the biopython project and have tried its tools for biological computation.

When I read the Mocapy++Biopython proposal I immediately fell in love with
it. Let me tell you why. I have worked since 2005 with bayesian networks,
modelling BN for medical learning environments and also programming
algorithms for handling those nets. In the context of my masters in computer
science with the Artificial Intelligence
Group<http://www.inf.ufrgs.br/gia/>, we have published several papers
on the idea of using bayesian networks to
model the uncertainty associated with the students' behavior in learning
environments (see, for example, Designing a Bayesian Network based Student
Model for Distance Learning
Environments<http://ieeexplore.ieee.org/Xplore/login.jsp?url=http://ieeexplore.ieee.org/iel5/4280926/4280927/04281040.pdf%3Farnumber%3D4281040&authDecision=-203>published
at the Seventh IEEE International Conference on Advanced Learning
Technologies, 2007).

As for the C++ and Python glue, I also have enjoyed the project's proposal.
I have been programming in C++ for more than 5 years, in small and big
projects, mainly in microelectronics CAD and firmware
development. Coincidentally, last year I started working with Python in
bigger projects. I worked for ESSS, a company which develops software for
scientific computing and engineering simulation. I worked with oil reservoir
simulation, where the applications were developed in Python and the
simulation core and the computer graphics algorithms were programmed in C++.
If you want to have a feeling on what reservoir simulation and the
applications I worked in look like, have a look at the Kraken's project
website <https://www.esss.com.br/kraken/>. I worked in both Python and C++
development, as well as in the glue through the use of boost python.

Regarding the experience in biomolecular structure, I'm a beginner. I have
started studying biomedicine this year and therefore have a lot to learn. I
know a bit about the PDB format and molecular biology. I'm sure I can count
on your help to continue learning.

So that was my not-so-short presentation. I would love to get to know the
community better and work together on the GSoC. Please let me know if you
think I could write a proposal and If you can help me on that.

Cheers,

Michele Silva

http://www.linkedin.com/pub/michele-silva/6/520/5b0

From p.j.a.cock at googlemail.com  Fri Mar 25 03:37:00 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 25 Mar 2011 07:37:00 +0000
Subject: [Biopython] Public example FASTQ files (for Tutorial examples)?
Message-ID: <AANLkTi=-158GrKViXYrdUSC1dmrO2FXHCfrtDmYiKK+T@mail.gmail.com>

Hi all,

One of the volunteers proof reading the Biopython tutorial
noticed our links to specific example FASTQ files at the NCBI
SRA don't work any more. They have withdrawn them from
the FTP site, although you can still download the files in
the compressed *.sra format and in in theory convert then
to FASTQ locally with the NCBI's toolkit (which is cross
platform).

Another option is to download the FASTQ files via the
NCBI's webinterface. Unless there is an obvious way to
do this with a URL that I missed initially, we have a
complicated situation to describe where the user can
choose all the reads for an experiment or just the filtered
set, and also choose to have them pre-trimmed or not.
Plus for me at least, the HTPP download wasn't as
robust as the FTP one was.

I'm hoping someone could suggest a couple of other
moderately sized FASTQ files which are public, on
FTP or a static HTML server, which we can use in
the tutorial.

So, suggestions?

Thanks!

Peter

From brettpthomas at gmail.com  Tue Mar 29 10:50:38 2011
From: brettpthomas at gmail.com (Brett Thomas)
Date: Tue, 29 Mar 2011 10:50:38 -0400
Subject: [Biopython] VCF files
In-Reply-To: <AANLkTingEonqxANk_M871ig_iqHQs685hi4pMhnvjVfA@mail.gmail.com>
References: <AANLkTingEonqxANk_M871ig_iqHQs685hi4pMhnvjVfA@mail.gmail.com>
Message-ID: <AANLkTim4g0+V6apSU5w7vX=BwtGuAeKMSycMG+BKOggb@mail.gmail.com>

Hi all,

I write software for genetic research, and the predominant file format we
use is VCF, a new file format used to represent genetic variation in the
1000 genomes project.

Has there been any discussion of a biopython api for vcf files? I'd be happy
to help if anybody is working on it.

Thanks,
Brett

From jamesrwagner at gmail.com  Tue Mar 29 13:55:56 2011
From: jamesrwagner at gmail.com (James Wagner)
Date: Tue, 29 Mar 2011 13:55:56 -0400
Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work
Message-ID: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>

Hello:

I was trying just as a proof of concept to do an NCBI WWW BLAST query
with a FASTA file containing more than one sequence (but still a small
number of sequences).

I tried with the opuntia.fasta file from the website, and set it up as follows:

result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
blast_records = NCBIXML.parse(result_handle)

then I try:

for record in blast_records:
      print record.alignments

and I obtain:
[]


Surely at the very least since there were 7 sequences in this file, I
should get 7 empty lists, assuming of course none of the sequences
gives a hit in nr, which I am sure is not the case either?

What is still missing? I realize I could use SeqIO.parse to obtain
each sequence from the FASTA file and do a separate qblast, but surely
doing this separately for each protein would create unnecessary
overhead with the network traffic compared to somehow sending off all
the protein queries at once?

From p.j.a.cock at googlemail.com  Tue Mar 29 14:07:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 29 Mar 2011 19:07:47 +0100
Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work
In-Reply-To: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>
References: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>
Message-ID: <AANLkTi=jwumygS0UCB3pKSzq0x_ivhk26JRBK-4Odgcf@mail.gmail.com>

On Tue, Mar 29, 2011 at 6:55 PM, James Wagner <jamesrwagner at gmail.com> wrote:
> Hello:
>
> I was trying just as a proof of concept to do an NCBI WWW BLAST query
> with a FASTA file containing more than one sequence (but still a small
> number of sequences).
>
> I tried with the opuntia.fasta file from the website, and set it up as follows:
>
> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
> blast_records = NCBIXML.parse(result_handle)
>
> then I try:
>
> for record in blast_records:
> ? ? ?print record.alignments
>
> and I obtain:
> []
>
>
> Surely at the very least since there were 7 sequences in this file, I
> should get 7 empty lists, assuming of course none of the sequences
> gives a hit in nr, which I am sure is not the case either?

Not necessarily, the NCBI may have fixed this but for a long time if
you had say 7 queries but only 2 gave hits, stand alone BLAST's
XML output would only contain those 2 hits. There would be nothing
at all from the 5 hit less queries. This was/is very annoying, but
right now I'm not sure if they have fixed this or not.

Try getting back the results as plain text and manually inspect them.
In the plain text output all the queries appear, and there is a clear
"no hits found" message.

> What is still missing? I realize I could use SeqIO.parse to obtain
> each sequence from the FASTA file and do a separate qblast, but surely
> doing this separately for each protein would create unnecessary
> overhead with the network traffic compared to somehow sending off all
> the protein queries at once?

Yes, in theory a single large query should have less overhead
than individual queries. Personally I'd just use standalone BLAST
and run it locally if I had more than a few queries.

Peter


From jamesrwagner at gmail.com  Tue Mar 29 16:43:35 2011
From: jamesrwagner at gmail.com (James Wagner)
Date: Tue, 29 Mar 2011 16:43:35 -0400
Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work
In-Reply-To: <AANLkTi=jwumygS0UCB3pKSzq0x_ivhk26JRBK-4Odgcf@mail.gmail.com>
References: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>
	<AANLkTi=jwumygS0UCB3pKSzq0x_ivhk26JRBK-4Odgcf@mail.gmail.com>
Message-ID: <AANLkTinqaqE8PNKuyr=LNLuRW1YtARDuXT3WJ6qW-eKE@mail.gmail.com>

OK, when I try to create a .fasta file with just the first sequence in
opuntia, I get no hits. However, when I just copy paste the nucleotide
sequence and copy paste, I get 50 hits!  This is consistent with what
happens with copy pasting the first opuntia sequence into the NCBI
BLAST web interafce, though there I obtain 110 hits for intronic
sequences in Opuntia chloroplast and chloroplasts. As a secondary
point I also find it curious the result with using NCBIWWW is limited
to 50 hits (I thought it was 500 by default). But what is more
problematic than the fact that I get no hits when using a FASTA file
with only a single sequence, when clearly there are some very high
homology hits present in nr.

This is my code from beginning to end, where the file opuntia1.fasta
is a file containing only the 1st sequence from opuntia.fasta, and
when using the line for opuntia1.fasta it resulted in no hits. I am
using BioPython 1.5.3 and Python 2.6 on Ubuntu if this has any effect
on the results. I also tried it by obtaining a single sequence from
SeqIO.parse and then obtaining the Seq of this sequence, and it also
gave 50 hits. So it's basically just with using a FASTA file handle
that I can't get it to work.

#!/usr/bin/python
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
result_handle = NCBIWWW.qblast("blastn", "nr",
"TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGA\CCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTATGNTCATTGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTAAATTGCTCATATTTTCTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAACATAGACAAACTATATATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATATTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAGAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAAAGAACCAGA")

#result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia1.fasta",
"r"))
blast_record = NCBIXML.read(result_handle)

for description in blast_record.descriptions:
    print description;

#end of code.


On Tue, Mar 29, 2011 at 2:07 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Mar 29, 2011 at 6:55 PM, James Wagner <jamesrwagner at gmail.com> wrote:
>> Hello:
>>
>> I was trying just as a proof of concept to do an NCBI WWW BLAST query
>> with a FASTA file containing more than one sequence (but still a small
>> number of sequences).
>>
>> I tried with the opuntia.fasta file from the website, and set it up as follows:
>>
>> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
>> blast_records = NCBIXML.parse(result_handle)
>>
>> then I try:
>>
>> for record in blast_records:
>> ? ? ?print record.alignments
>>
>> and I obtain:
>> []
>>
>>
>> Surely at the very least since there were 7 sequences in this file, I
>> should get 7 empty lists, assuming of course none of the sequences
>> gives a hit in nr, which I am sure is not the case either?
>
> Not necessarily, the NCBI may have fixed this but for a long time if
> you had say 7 queries but only 2 gave hits, stand alone BLAST's
> XML output would only contain those 2 hits. There would be nothing
> at all from the 5 hit less queries. This was/is very annoying, but
> right now I'm not sure if they have fixed this or not.
>
> Try getting back the results as plain text and manually inspect them.
> In the plain text output all the queries appear, and there is a clear
> "no hits found" message.
>
>> What is still missing? I realize I could use SeqIO.parse to obtain
>> each sequence from the FASTA file and do a separate qblast, but surely
>> doing this separately for each protein would create unnecessary
>> overhead with the network traffic compared to somehow sending off all
>> the protein queries at once?
>
> Yes, in theory a single large query should have less overhead
> than individual queries. Personally I'd just use standalone BLAST
> and run it locally if I had more than a few queries.
>
> Peter
>


From rmb32 at cornell.edu  Tue Mar 29 17:20:41 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Tue, 29 Mar 2011 14:20:41 -0700
Subject: [Biopython] Announcing OBF Summer of Code - please forward!
Message-ID: <4D924D29.3020707@cornell.edu>

Hi all,

Here's an advertising-ready announcement for OBF's Summer of Code, 
thanks to Christian Zmasek and Hilmar Lapp for their excellent writing.

Student applications are due April 8!  Please spread it widely, we need 
to reach lots of students with it!

Rob Buels
OBF GSoC 2011 Admin


============================================================

*** Please disseminate widely at your local institutions ***
*** including posting to message and job boards, so that ***
*** we reach as many students as possible.               ***

============================================================


OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2011

Applications due 19:00 UTC, April 8, 2010.
http://www.open-bio.org/wiki/Google_Summer_of_Code

The Open Bioinformatics Foundation Summer of Code program provides a 
unique opportunity for undergraduate, masters, and PhD students to 
obtain hands-on experience writing and extending open-source software 
for bioinformatics under the mentorship of experienced developers from 
around the world. The program is the participation of the Open 
Bioinformatics Foundation (OBF) as a mentoring organization in the 
Google Summer of Code(tm) (http://code.google.com/soc/).

Students successfully completing the 3 month program receive a $5,000 
USD stipend, and may work entirely from their home or home institution. 
  Participation is open to students from any country in the world except 
countries subject to US trade restrictions.  Each student will have at 
least one dedicated mentor to show them the ropes and help them complete 
their project.

The Open Bioinformatics Foundation is particularly seeking students 
interested in both bioinformatics (computational biology) and software 
development. Some initial project ideas are listed on the website. These 
range from Galaxy phylogenetics pipeline development in Biopython to 
lightweight sequence objects and lazy parsing in BioPerl, a DAS Server 
for large files on local filesystems, and mapping Java libraries to 
Perl/Ruby/Python using Biolib+SWIG+JNI.  All project ideas are flexible 
and many can be adjusted in scope to match the skills of the student. We 
also welcome and encourage students proposing their own project ideas; 
historically some of the most successful Summer of Code projects are 
ones proposed by the students themselves.

TO APPLY: Apply online at the Google Summer of Code website 
(http://socghop.appspot.com/), where you will also find GSoC program 
rules and eligibility requirements. The 12-day application period for 
students runs from Monday, March 28 through Friday, April 8th, 2011.

INQUIRIES:

We strongly encourage all interested students to get in touch with us 
with their ideas as early on as possible.  See the OBF GSoC page for 
contact details.

2011 OBF Summer of Code:
http://www.open-bio.org/wiki/Google_Summer_of_Code

Google Summer of Code FAQ:
http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs


From albert.bogdanowicz at gmail.com  Thu Mar 31 13:01:45 2011
From: albert.bogdanowicz at gmail.com (Albert Bogdanowicz)
Date: Thu, 31 Mar 2011 19:01:45 +0200
Subject: [Biopython] Google Summer of Code idea
Message-ID: <201103311901.45372.albert.bogdanowicz@gmail.com>

Hello World,
I am a bioinformatics student and I would like to take part in Google Summer 
of Code this year.
I have an idea for a project that I could write. It would be a module for 
synthetic biology, especially BioBrick standard used in iGEM competition 
(http://ung.igem.org/Main_Page).
I'm a bit late, but I hope this fact won't disqualify me. I would appreciate 
any help in determining a more detailed specification for such project.
Albert Bogdanowicz

From laserson at mit.edu  Thu Mar 31 16:48:16 2011
From: laserson at mit.edu (Uri Laserson)
Date: Thu, 31 Mar 2011 16:48:16 -0400
Subject: [Biopython] Google Summer of Code idea
In-Reply-To: <201103311901.45372.albert.bogdanowicz@gmail.com>
References: <201103311901.45372.albert.bogdanowicz@gmail.com>
Message-ID: <AANLkTik=z3_pmhQLsLXEe6PD5-sGVnHpsyzSiG+=Ukhb@mail.gmail.com>

Hi Albert,

Are you thinking of something like the Clotho project?

http://www.clothocad.org/

Uri

...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


On Thu, Mar 31, 2011 at 13:01, Albert Bogdanowicz <
albert.bogdanowicz at gmail.com> wrote:

> Hello World,
> I am a bioinformatics student and I would like to take part in Google
> Summer
> of Code this year.
> I have an idea for a project that I could write. It would be a module for
> synthetic biology, especially BioBrick standard used in iGEM competition
> (http://ung.igem.org/Main_Page).
> I'm a bit late, but I hope this fact won't disqualify me. I would
> appreciate
> any help in determining a more detailed specification for such project.
> Albert Bogdanowicz
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From rmb32 at cornell.edu  Thu Mar 31 17:58:52 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 31 Mar 2011 14:58:52 -0700
Subject: [Biopython] Reminder: GSoC proposals due in 1 week
Message-ID: <4D94F91C.1080005@cornell.edu>

Hi all,

Just a reminder, Google Summer of Code student applications are due April 8!

If you're a student planning to apply to GSoC with OBF, it's very much 
in your best interest to write your proposal *early*, like now, and get 
it into the hands of the developers and mentors on your subproject 
(BioPerl/Ruby/Python/etc) so that they can give you some feedback on it.

The final proposals must, of course, still be submitted to Google 
through the GSoC web application, as described on the main GSoC site 
(http://www.google-melange.com/gsoc/homepage/google/gsoc2011).

Rob Buels
OBF GSoC 2011 Administrator


From rmb32 at cornell.edu  Thu Mar 31 18:04:49 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 31 Mar 2011 15:04:49 -0700
Subject: [Biopython] GSoC call for mentors
Message-ID: <4D94FA81.5090701@cornell.edu>

Hi all,

For current developers on OBF projects:

If you would not mind being a mentor to a Summer of Code student this 
summer, please make sure you sign up as an OBF mentor in the GSoC web 
app.  There's a link under "mentors: apply now!" midway down the page at 
http://www.google-melange.com/.  If you didn't do last year's summer of 
code, it would be a good idea to drop me an email introducing yourself, 
as well, or I won't know whether to approve your request. :-)

Being signed up as an OBF GSoC mentor will give you access to the 
student proposals, as they come in, and the ability to comment on them 
and assign scores to the ones you think show the most promise.

If you sign up as a mentor, please also add yourself to the two OBF GSoC 
mailing lists: OBF-GSoC and OBF-GSoC-mentors

OBF-GSoC list: http://lists.open-bio.org/mailman/listinfo/gsoc
OBF mentors:   http://lists.open-bio.org/mailman/listinfo/gsoc-mentors


Thanks in advance!

Rob

---
Robert Buels
OBF GSoC 2011 Administrator


From philip.machanick at gmail.com  Thu Mar 31 19:49:33 2011
From: philip.machanick at gmail.com (Philip Machanick)
Date: Fri, 1 Apr 2011 09:49:33 +1000
Subject: [Biopython] extending Motif class
Message-ID: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>

I want to add a new scoring function to the Motif class and in true
object-oriented spirit would like to do it by deriving a new class rather
than hacking the existing code.

The general structure of my test program (all in 1 file) is:

from Bio.Motif import Motif

class ScannableMotif(Motif):
    def pwm_score_hit(self,sequence,position):
    ## stuff to compute my new score

from Bio import Motif
def main ():
    for motif in
ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
        for i in range(3):
          print
motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)

The two different imports appear to be necessary. I need the first to be
able to use the base class to derive a new one, and without the second when
I use metaclass methods, I get

TypeError: Error when calling the metaclass bases
    module.__init__() takes at most 2 arguments (3 given)

The other problem: I can't directly invoke a metaclass method on a derived
instance as above. The snippet below works as expected, but looks like a
kludge to me. Is there a better way of accessing metaclass methods from a
derived class object?

    for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
        motif.__class__ = ScannableMotif # promote to the new class
        for i in range(3):
          print
motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)

I think I have the class vs. metaclass concept straight but understanding
why I need the two different flavours of import would be useful.
-- 
Philip Machanick
Rhodes University, Grahamstown 6140, South Africa
http://opinion-nation.blogspot.com/
+61-7-3871-0963 mobile +61 42 234 6909 skype philipmach

From chapmanb at 50mail.com  Thu Mar 31 20:59:52 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 31 Mar 2011 20:59:52 -0400
Subject: [Biopython] extending Motif class
In-Reply-To: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>
References: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>
Message-ID: <20110401005952.GA2644@kunkel>

Philip;

> I want to add a new scoring function to the Motif class and in true
> object-oriented spirit would like to do it by deriving a new class rather
> than hacking the existing code.

The approach you want to take here is to define a function
that takes a motif as an input:

def pwm_score_hit(motif, sequence, position):

instead of trying to inherit from Motif. What happens in 
your example is that you inherit from the Motif class:

> from Bio.Motif import Motif
> 
> class ScannableMotif(Motif):
>     def pwm_score_hit(self,sequence,position):
>     ## stuff to compute my new score

Then you call ScannableMotif as if it were the Motif namespace, when
it is actually a class. The parse function is defined in Bio.Motif
not the Motif class:

> from Bio import Motif
> def main ():
>     for motif in
> ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
>         for i in range(3):
>           print
> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)

Which is why you get an error. Your promotion trick does work but
really is too tricky and you are better off with just a separate
function that works on motif objects.

Hope this helps,
Brad

From bartek at rezolwenta.eu.org  Thu Mar 31 21:07:53 2011
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Fri, 1 Apr 2011 03:07:53 +0200
Subject: [Biopython] extending Motif class
In-Reply-To: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>
References: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>
Message-ID: <AANLkTikam+Mrhd--TYcy9dKQJuRONrv7szXP-T9svM-0@mail.gmail.com>

Hi,

On Fri, Apr 1, 2011 at 1:49 AM, Philip Machanick <philip.machanick at gmail.com
> wrote:

> I want to add a new scoring function to the Motif class and in true
> object-oriented spirit would like to do it by deriving a new class rather
> than hacking the existing code.
>
> Well, if you want to keep your code separate from biopython and ba able to
use it with newer versions than maybe yes, but if you think tha your code
code be contributed to biopython and useful for other people, than I'd
consider just contributing via github.


> The general structure of my test program (all in 1 file) is:
>
> from Bio.Motif import Motif
>
> class ScannableMotif(Motif):
>    def pwm_score_hit(self,sequence,position):
>    ## stuff to compute my new score
>
> That's OK, although I suspect something fishy is hidden in your code here.
more later.


> from Bio import Motif
>

you shouldn't need to do that


> def main ():
>    for motif in
> ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
>        for i in range(3):
>          print
> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)
>
> is ScannableMotif now a module? or is "parse" a class method? BTW, you can
parse MEME files with Bio.Motif.parse...


> The two different imports appear to be necessary. I need the first to be
> able to use the base class to derive a new one, and without the second when
>
Yes, you need to import the module to subclass Motif


> I use metaclass methods, I get
>
> TypeError: Error when calling the metaclass bases
>    module.__init__() takes at most 2 arguments (3 given)
>
> I cannot reproduce this error and it is highly unlikely that it is related
at all to the Biopython code as Bio.Motif does not uses any metaclasses. I
think you cause it by something in your later code:


> The other problem: I can't directly invoke a metaclass method on a derived
> instance as above. The snippet below works as expected, but looks like a
> kludge to me. Is there a better way of accessing metaclass methods from a
> derived class object?
>
>    for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
>        motif.__class__ = ScannableMotif # promote to the new class
>

There you go! don't do this. It is not the way objects "get promoted" to
other classes, You seem to be playing with some python internals here

       for i in range(3):
>          print
> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)
>
> I think I have the class vs. metaclass concept straight but understanding
> why I need the two different flavours of import would be useful.
>
Don't get me wrong, but I don't think you _need_ any metaclasses here. I
think your problem is that you are trying to change the class of an existing
instance, which (while probably possible in python) is absolutely not the
way to go. If your code is able to produce the correct output using the
complicated imports it's interesting, but probably not the easiest way to
achieve it. However it's hard to say, what exactly is your goal from the
code you provided.

But, on the more constructive side of things, if you want to subclass
Bio.Motif and add a new method to it, you can just do what you did in the
beginning of your code (provided that you do not mess with m.__class__ or
something)
Then, your problem seems to be that the MEME parser fails to return your
subclass and gives you a Bio.Motif.Motif vanilla class (or MEMEMotif). What
you can do (if you insist on not adding the method to Bio.Motif.Motif), is
to write a constructor able to create a ScannableMotif from a "normal"
motif:

class ScannableMotif(Motif):
   def new_score_hit(self,sequence,position):
       return 1 # or something smarter...
   def __init__(self,m): #just copy it all...
       self.instances = m.instances
       self.has_instances=m.has_instances
       self.counts = m.counts
       self.has_counts=m.has_counts
       self.mask = m.mask
       self._pwm_is_current = False
       self._log_odds_is_current = False
       self.alphabet=m.alphabet
       self.length=m.length
       self.background=m.background
       self.beta=m.beta

And then you can do things like:
m=Bio.Motif.parse(f,"AlignAce")
s=ScannableMotif(m)
s.new_score_hit(seq,pos)

I hope this helps...


> --
> Philip Machanick
> Rhodes University, Grahamstown 6140, South Africa
> http://opinion-nation.blogspot.com/
> +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


-- 
Bartek Wilczynski

From philip.machanick at gmail.com  Thu Mar 31 23:21:19 2011
From: philip.machanick at gmail.com (Philip Machanick)
Date: Fri, 1 Apr 2011 13:21:19 +1000
Subject: [Biopython] extending Motif class
In-Reply-To: <AANLkTikam+Mrhd--TYcy9dKQJuRONrv7szXP-T9svM-0@mail.gmail.com>
References: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>
	<AANLkTikam+Mrhd--TYcy9dKQJuRONrv7szXP-T9svM-0@mail.gmail.com>
Message-ID: <AANLkTimZhDDBD-28Nw9Mu_-gzJJJ1p9tVb-Abr56kbH6@mail.gmail.com>

Thanks.

The issue is that parse is not defined in the class but in the module and if
I understand this right, this makes it a metaclass method. More below.

On Fri, Apr 1, 2011 at 11:07 AM, Bartek Wilczynski <bartek at rezolwenta.eu.org
> wrote:

> Hi,
>
> On Fri, Apr 1, 2011 at 1:49 AM, Philip Machanick <
> philip.machanick at gmail.com> wrote:
>
>> I want to add a new scoring function to the Motif class and in true
>> object-oriented spirit would like to do it by deriving a new class rather
>> than hacking the existing code.
>>
>> Well, if you want to keep your code separate from biopython and ba able to
> use it with newer versions than maybe yes, but if you think tha your code
> code be contributed to biopython and useful for other people, than I'd
> consider just contributing via github.
>
>
>> The general structure of my test program (all in 1 file) is:
>>
>> from Bio.Motif import Motif
>>
>> class ScannableMotif(Motif):
>>    def pwm_score_hit(self,sequence,position):
>>    ## stuff to compute my new score
>>
>> That's OK, although I suspect something fishy is hidden in your code here.
> more later.
>
>
>> from Bio import Motif
>>
>
> you shouldn't need to do that
>
>
>> def main ():
>>    for motif in
>> ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
>>        for i in range(3):
>>          print
>> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)
>>
>> is ScannableMotif now a module? or is "parse" a class method? BTW, you can
> parse MEME files with Bio.Motif.parse...
>

At this stage for proof of concept I'm putting this all in the same file as
the main program.


>
>
>> The two different imports appear to be necessary. I need the first to be
>> able to use the base class to derive a new one, and without the second
>> when
>>
> Yes, you need to import the module to subclass Motif
>
>
>> I use metaclass methods, I get
>>
>> TypeError: Error when calling the metaclass bases
>>    module.__init__() takes at most 2 arguments (3 given)
>>
>> I cannot reproduce this error and it is highly unlikely that it is related
> at all to the Biopython code as Bio.Motif does not uses any metaclasses. I
> think you cause it by something in your later code:
>
>
This happens specifically if I use Motif.parse (defined in __inti__.py in
the Motif module directory).


>  The other problem: I can't directly invoke a metaclass method on a derived
>> instance as above. The snippet below works as expected, but looks like a
>> kludge to me. Is there a better way of accessing metaclass methods from a
>> derived class object?
>>
>>    for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
>>        motif.__class__ = ScannableMotif # promote to the new class
>>
>
> There you go! don't do this. It is not the way objects "get promoted" to
> other classes, You seem to be playing with some python internals here
>
>        for i in range(3):
>>          print
>> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)
>>
>> I think I have the class vs. metaclass concept straight but understanding
>> why I need the two different flavours of import would be useful.
>>
> Don't get me wrong, but I don't think you _need_ any metaclasses here. I
> think your problem is that you are trying to change the class of an existing
> instance, which (while probably possible in python) is absolutely not the
> way to go. If your code is able to produce the correct output using the
> complicated imports it's interesting, but probably not the easiest way to
> achieve it. However it's hard to say, what exactly is your goal from the
> code you provided.
>
> But, on the more constructive side of things, if you want to subclass
> Bio.Motif and add a new method to it, you can just do what you did in the
> beginning of your code (provided that you do not mess with m.__class__ or
> something)
> Then, your problem seems to be that the MEME parser fails to return your
> subclass and gives you a Bio.Motif.Motif vanilla class (or MEMEMotif). What
> you can do (if you insist on not adding the method to Bio.Motif.Motif), is
> to write a constructor able to create a ScannableMotif from a "normal"
> motif:
>
> class ScannableMotif(Motif):
>    def new_score_hit(self,sequence,position):
>        return 1 # or something smarter...
>    def __init__(self,m): #just copy it all...
>        self.instances = m.instances
>        self.has_instances=m.has_instances
>        self.counts = m.counts
>        self.has_counts=m.has_counts
>        self.mask = m.mask
>        self._pwm_is_current = False
>        self._log_odds_is_current = False
>        self.alphabet=m.alphabet
>        self.length=m.length
>        self.background=m.background
>        self.beta=m.beta
>
> And then you can do things like:
> m=Bio.Motif.parse(f,"AlignAce")
> s=ScannableMotif(m)
> s.new_score_hit(seq,pos)
>

Thanks, this is more like what I was looking for.


> I hope this helps...
>
>
>
>> --
>>
>> Philip Machanick
>> Rhodes University, Grahamstown 6140, South Africa
>> http://opinion-nation.blogspot.com/
>> +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>
>
> --
> Bartek Wilczynski
>
>
>


-- 
Philip Machanick (still in Australia for a while; note new mail address)
Rhodes University, Grahamstown 6140, South Africa
http://opinion-nation.blogspot.com/
+61-7-3871-0963 mobile +61 42 234 6909 skype philipmach

From eric.talevich at gmail.com  Thu Mar 31 23:53:51 2011
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 31 Mar 2011 23:53:51 -0400
Subject: [Biopython] [GSoC] Proposal: Mocapy++Biopython
In-Reply-To: <AANLkTi=oHxSoNzqq1N2auZnCxoanDoXc3RO=QXwhn8vn@mail.gmail.com>
References: <AANLkTi=s=74jsMu4LP2RqnXeq28taun+SP1efQjnY8ts@mail.gmail.com>
	<AANLkTi=oHxSoNzqq1N2auZnCxoanDoXc3RO=QXwhn8vn@mail.gmail.com>
Message-ID: <AANLkTi=aYDE=UXSxAWu99VMvdFiY_RjChRpUbLikeVNz@mail.gmail.com>

Hi Michele,

On Fri, Mar 25, 2011 at 2:11 AM, Michele <michele.silva at gmail.com> wrote:

> Hello everyone,
>
> I'm Michele, a computer scientist and passionate developer who is
> currently enrolled in a biomedicine course. That's why I got in touch with
> the biopython project and have tried its tools for biological computation.
>

In case you haven't heard back from anyone about your proposal yet -- you
certainly sound qualified for this project and I encourage you to start an
application on the main GSoC site (if you haven't already):
http://www.google-melange.com/gsoc/homepage/google/gsoc2011

It's best to get the administrative bits out of the way and at least a stub
of a proposal online early, to ensure you don't get caught by the deadline
next week.

Another good initial step is to sign up on GitHub and fork the Biopython
source tree for yourself:
https://github.com/biopython/biopython
http://biopython.org/wiki/SourceCode


Regarding the experience in biomolecular structure, I'm a beginner. I have
> started studying biomedicine this year and therefore have a lot to learn. I
> know a bit about the PDB format and molecular biology. I'm sure I can count
> on your help to continue learning.
>

Certainly. :)


So that was my not-so-short presentation. I would love to get to know the
> community better and work together on the GSoC. Please let me know if you
> think I could write a proposal and If you can help me on that.
>

Thanks for the introduction. We're always happy to help here.

Cheers,
Eric

From mmokrejs at fold.natur.cuni.cz  Wed Mar  2 23:00:04 2011
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 03 Mar 2011 00:00:04 +0100
Subject: [Biopython] traditional NCBI blast vs. blast+
Message-ID: <4D6ECBF4.9050006@fold.natur.cuni.cz>

Hi,
  I needed to run and parse some blastn analysis. I had a look into the Tutorial
and followed the currently recommended blast+ approach. Somewhat I was not
getting any results. It seems to me a formatdb-formatted database is not readable
by the blast+ tools. I had a look what tools are installed on my Gentoo Linux
along with blastn, blastx and the other tools coming from blast+ bundle and from
filenames I just could not guess what am I supposed to run over my fasta
target database to make it searchable by blastn. I would prefer if biopython
would throw out some error if there are no appropriate files (which names could
be guessed depending on the (t)blastn/x/p, etc.).
  The tutorial mentions that I should lookup an older version of the Tutorial
for examples on the old, NCBI blast usage via biopython. It took me a while but
I found through Google some docs like that. ;-)
  On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation,
not a single README, HOWTO, Changes, just the binaries and libs. What is installed
on other Linux platform, would you mind sharing this with me? I just failed
to find by Google what tools should I use instead of the formatdb. I found
some FAQ on the NCBI tools++ site but that talked just about C++ API etc.,
nothing from the user perspective.
  On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being
installed because they have same name as the same utility from "old" ncbi-tools
(hence overwting their files). The ncbi-tools++ package is not allowed to be
installed on stable "systems" (lack of testing or open bug reports) so most people
using Gentoo do NOT have ncbi-tools++ and probably won't for a while.
  I propose to keep support for the "old" blast for a long while. Luckily, the
blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML.

  What do you think? Is the blast+ approach faster, more stable, or just newer
so we all like to "upgrade"? Where are some docs and what is the formatdb-like
tool in blast+. ;)
Thanks,
Martin


From nuin at genedrift.org  Wed Mar  2 23:06:17 2011
From: nuin at genedrift.org (Paulo Nuin)
Date: Wed, 2 Mar 2011 18:06:17 -0500
Subject: [Biopython] traditional NCBI blast vs. blast+
In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz>
References: <4D6ECBF4.9050006@fold.natur.cuni.cz>
Message-ID: <4FC7BB7C-9E17-4699-850E-0A4F4E63521B@genedrift.org>

Hi 

Just answering your blast portion of the question:

- you have to run makeblastdb in order to create the database.
- you should be able to download the source of blast+ to compile, it should compile just fine on your system
- and yes, it seems to be faster and more stable than the previous version, at least on the tests I run

Paulo


On 2011-03-02, at 6:00 PM, Martin Mokrejs wrote:

> Hi,
>  I needed to run and parse some blastn analysis. I had a look into the Tutorial
> and followed the currently recommended blast+ approach. Somewhat I was not
> getting any results. It seems to me a formatdb-formatted database is not readable
> by the blast+ tools. I had a look what tools are installed on my Gentoo Linux
> along with blastn, blastx and the other tools coming from blast+ bundle and from
> filenames I just could not guess what am I supposed to run over my fasta
> target database to make it searchable by blastn. I would prefer if biopython
> would throw out some error if there are no appropriate files (which names could
> be guessed depending on the (t)blastn/x/p, etc.).
>  The tutorial mentions that I should lookup an older version of the Tutorial
> for examples on the old, NCBI blast usage via biopython. It took me a while but
> I found through Google some docs like that. ;-)
>  On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation,
> not a single README, HOWTO, Changes, just the binaries and libs. What is installed
> on other Linux platform, would you mind sharing this with me? I just failed
> to find by Google what tools should I use instead of the formatdb. I found
> some FAQ on the NCBI tools++ site but that talked just about C++ API etc.,
> nothing from the user perspective.
>  On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being
> installed because they have same name as the same utility from "old" ncbi-tools
> (hence overwting their files). The ncbi-tools++ package is not allowed to be
> installed on stable "systems" (lack of testing or open bug reports) so most people
> using Gentoo do NOT have ncbi-tools++ and probably won't for a while.
>  I propose to keep support for the "old" blast for a long while. Luckily, the
> blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML.
> 
>  What do you think? Is the blast+ approach faster, more stable, or just newer
> so we all like to "upgrade"? Where are some docs and what is the formatdb-like
> tool in blast+. ;)
> Thanks,
> Martin
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Thu Mar  3 10:27:54 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 3 Mar 2011 10:27:54 +0000
Subject: [Biopython] traditional NCBI blast vs. blast+
In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz>
References: <4D6ECBF4.9050006@fold.natur.cuni.cz>
Message-ID: <AANLkTikiaVSFELhCQd3++jPMrm9Q-Eo__E+pGahgft2c@mail.gmail.com>

On Wed, Mar 2, 2011 at 11:00 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi,
> ?I needed to run and parse some blastn analysis. I had a look into the Tutorial
> and followed the currently recommended blast+ approach. Somewhat I was not
> getting any results. It seems to me a formatdb-formatted database is not readable
> by the blast+ tools.

I think it is possible to get databases which will work with both legacy BLAST
and BLAST+ (since the NCBI only offer one set for NR etc) but I have not tried
to mix the two. As pointed out by Paulo, the successor to formatdb in BLAST+
is makeblastdb, so just use that instead.

> I had a look what tools are installed on my Gentoo Linux
> along with blastn, blastx and the other tools coming from blast+ bundle and from
> filenames I just could not guess what am I supposed to run over my fasta
> target database to make it searchable by blastn.

This is very clear in the BLAST+ documentation from the NCBI website
(link given below), and is arguably a Gentoo packaging issue.

> I would prefer if biopython
> would throw out some error if there are no appropriate files (which names could
> be guessed depending on the (t)blastn/x/p, etc.).

BLAST+ itself generally gives useful errors.

> ?The tutorial mentions that I should lookup an older version of the Tutorial
> for examples on the old, NCBI blast usage via biopython. It took me a while but
> I found through Google some docs like that. ;-)

You could have just downloaded one of the old Biopython releases (the zip
or tar balls) and looked in the Doc subdirectory. I'll clarify the current text
in the tutorial to point people there.

>?On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation,
> not a single README, HOWTO, Changes, just the binaries and libs.

File a bug with Gentoo?

> What is installed
> on other Linux platform, would you mind sharing this with me? I just failed
> to find by Google what tools should I use instead of the formatdb. I found
> some FAQ on the NCBI tools++ site but that talked just about C++ API etc.,
> nothing from the user perspective.

You are probably looking for this, linked to from the BLAST+ download page:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/user_manual.pdf

> On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being
> installed because they have same name as the same utility from "old" ncbi-tools
> (hence overwting their files). The ncbi-tools++ package is not allowed to be
> installed on stable "systems" (lack of testing or open bug reports) so most people
> using Gentoo do NOT have ncbi-tools++ and probably won't for a while.

I was aware of the name clash for rpsblast, and yes, this is a problem the
NCBI could have avoided.

You could just ignore the Gentoo package and get BLAST+ directly from
the NCBI.

>?I propose to keep support for the "old" blast for a long while.

We've already delayed deprecating the ``legacy'' BLAST wrappers,
but probably we should do that after releasing Biopython 1.57.

> Luckily, the
> blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML.

The NCBI kept the same XML output format, and in fact the plain text
output is close enough that our old text parser could be updated to cope.

>?What do you think? Is the blast+ approach faster, more stable, or just newer
> so we all like to "upgrade"?

I like BLAST+ for some new functionality (FASTA vs FASTA for example),
but since the NCBI is dropping the ``legacy'' BLAST you will have to
upgrade at some point

> Where are some docs and what is the formatdb-like tool in blast+. ;)

I've given links to the docs above, they're linked to on the NCBI website.

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Mar  3 20:32:11 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 3 Mar 2011 20:32:11 +0000
Subject: [Biopython] Fwd: [Bosc] Bioinformatics Open Source Conference (BOSC
 2011)--Call for Abstracts
In-Reply-To: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov>
References: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov>
Message-ID: <AANLkTimNQ=o6Unw461vjPPGUB4VKnB+Ww+qFhT-D2iXq@mail.gmail.com>

Dear Biopythoneers,

BOSC will be in Vienna, Austria this year.

Peter

---------- Forwarded message ----------
From: Nomi Harris <nlharris at lbl.gov>
Date: Thu, Mar 3, 2011 at 7:37 PM
Subject: [Bosc] Bioinformatics Open Source Conference (BOSC
2011)--Call for Abstracts
To: bosc-announce at lists.open-bio.org, members at open-bio.org, GMOD
Announcements List <gmod-announce at lists.sourceforge.net>, GMOD
Developers List <gmod-devel at lists.sourceforge.net>
Cc: Nomi Harris <nlharris at lbl.gov>


We invite you to submit an abstract to BOSC 2011! ?Please forward this
message as appropriate, and forgive multiple postings.

Call for Abstracts for the 12th Annual Bioinformatics Open Source
Conference (BOSC 2011)
An ISMB 2011 Special Interest Group (SIG)

Dates: July 15-16, 2011
Location: Vienna, Austria
Web site: http://www.open-bio.org/wiki/BOSC_2011
Email: bosc at open-bio.org
BOSC announcements mailing list:
http://lists.open-bio.org/mailman/listinfo/bosc-announce

Important Dates:
April 18, 2011: Deadline for submitting abstracts to BOSC 2011
May 9, 2011: Notifications of accepted abstracts emailed to
corresponding authors
July 13-14, 2011: Codefest 2011 programming session (see
http://www.open-bio.org/wiki/Codefest_2011 for details)
July 15-16, 2011: BOSC 2011
July 17-19, 2011: ISMB 2011

The Bioinformatics Open Source Conference (BOSC) is sponsored by the
Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated
to promoting the practice and philosophy of Open Source software
development within the biological research community. To be considered
for acceptance, software systems representing the central topic in a
presentation submitted to BOSC must be licensed with a recognized Open
Source License, and be freely available for download in source code
form.

We invite you to submit abstracts for talks and posters. ?Sessions include:
- Approaches to parallel processing
- Cloud-based approaches to improving software and data accessibility
- The Semantic Web in open source bioinformatics
- Data visualization
- Tools for next-generation sequencing
- Other Open Source software

In addition to the above sessions, there will be a panel discussion
about "Meeting the challenges of inter-institutional collaboration".
We are also working to arrange a joint session with one of the other
ISMB SIGs.

Thanks to generous sponsorship from Eagle Genomics and an anonymous
donor, we are pleased to announce a competition for three Student
Travel Awards for BOSC 2011. Each winner will be awarded $250 to
defray the costs of travel to BOSC 2011.

For instructions on submitting your abstract, please visit
http://www.open-bio.org/wiki/BOSC_2011#Abstract_Submission_Information

BOSC 2011 Organizing Committee:
Nomi Harris and Peter Rice (co-chairs); Brad Chapman, Peter Cock,
Erwin Frise, Darin London, Ron Taylor


_______________________________________________
BOSC mailing list
BOSC at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bosc


From hlapp at drycafe.net  Fri Mar  4 23:26:25 2011
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Fri, 4 Mar 2011 18:26:25 -0500
Subject: [Biopython] Informatics job opportunity at NESCent
Message-ID: <1878F27F-000D-4C80-B9EA-A83F7887828F@drycafe.net>

(Apologies if you receive multiple copies, and also if you are not  
interested in job opportunities.  In my defense, quite a few people on  
Bio* lists might qualify for (let alone enjoy) the position. And if  
you know someone who might be interested please forward.)

===================================================
User Interface Design and Web Application Developer
===================================================

The National Evolutionary Synthesis Center (NESCent) seeks a creative  
and enthusiastic individual to design user interfaces and web  
applications for scientific applications. The incumbent will work as  
part of a small informatics team in close collaboration with domain  
scientists.

NESCent is an NSF-funded center dedicated to cross-disciplinary  
research in evolutionary science. Our informatics team works closely  
with visiting and resident scientists to support their custom software  
and database development needs. All NESCent software products are open- 
source, and the Center has a number of initiatives to actively promote  
collaborative development of community software resources  
(informatics.nescent.org). Above all, we are enthusiastic about our  
work, about the mission of the Center, and about the contribution of  
informatics to that mission.

Job description: The incumbent will design and develop user interfaces  
and web applications for databases and other software tools for  
sponsored scientists and staff. The job responsibilities include all  
stages of the software development process, including requirements  
gathering, design, implementation, release packaging and  
documentation, as part of a small team (typically 2-3 individuals)  
following project management best practices. We expect the incumbent  
to present their work at conferences and contribute to publications  
with scientific collaborators; interact regularly with visiting and  
resident scientists, other members of the informatics team and Center  
staff; and generally serve as an expert resource for Center personnel.  
The position provides opportunities for professional development. Most  
informatics staff work at our Durham NC offices, located adjacent to  
Duke University, but we do support a wide range of technologies for  
virtual communication with off-site staff and collaborators.

Required Qualifications:
* Demonstrated success collaborating with clients on custom software  
solutions
* Experience with various stages of the software development cycle
* Expertise in development and testing of user interface designs
* Excellent communication skills, both virtual and face-to-face
* A four-year college degree in Computer Science, Bioinformatics or a  
related field

Preferred Qualifications:
* M.S. or Ph.D. in Computer Science, Bioinformatics or related field  
along with demonstrated interest in science, particularly biology
* Expertise in rapid application development and respective  
programming technologies and languages (e.g., modern scripting  
languages and web-application frameworks such as Python/Django, Ruby/ 
Ruby-on-Rails, and Perl/Catalyst), fluency in Java programming, and  
prior experience in relational database programming (PostgreSQL or  
MySQL)
* Expertise in dynamic and interactive web technologies (JavaScript,  
CGI), web service (SOAP, REST, XML, JSON) and semantic web technologies
* Experience with open-source, and collaborative, software  
development, software usability design and assessment
* Expertise in graphic design, data visualization and/or scientific  
data integration

How to apply: Please send cover letter, resume and contact information  
for three references to Dr. Karen Cranston, Training Coordinator and  
Bioinformatics Project Manager (karen.cranston at nescent.org). Review of  
applications will begin March 21, 2011. Informal inquires or requests  
for additional information may be directed to Dr. Cranston by email or  
phone (+1-919-613-2275).

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From p.j.a.cock at googlemail.com  Mon Mar  7 14:19:11 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 7 Mar 2011 14:19:11 +0000
Subject: [Biopython] Tutorial proofreading?
Message-ID: <AANLkTikn8kr5PqRRdG6ZAWsO3d6hm2E7so7dRQJNXT41@mail.gmail.com>

Hi all,

We're planning to do the Biopython 1.57 release soon, and
something some volunteer help would be useful for is with
our documentation - in particular the tutorial.

These links are for the current tutorial, at the time or writing
that means Biopython 1.56:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

There links are for the latest in-progress tutorial (automatically
updated nightly from the git repository):
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf

I would like some volunteers to proof read this please and
report any problems, suggestions or additions?

Ideally I'd like people to check the examples work (although some
will need the latest Biopython installed from the source code).

Even reporting minor typos is useful, as fixing them will
make a better impression for newcomers reading this.

Thanks,

Peter

P.S. The tutorial source file is here, if you are interested,
https://github.com/biopython/biopython/blob/master/Doc/Tutorial.tex


From anaryin at gmail.com  Mon Mar  7 14:21:19 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 7 Mar 2011 15:21:19 +0100
Subject: [Biopython] Tutorial proofreading?
In-Reply-To: <AANLkTikn8kr5PqRRdG6ZAWsO3d6hm2E7so7dRQJNXT41@mail.gmail.com>
References: <AANLkTikn8kr5PqRRdG6ZAWsO3d6hm2E7so7dRQJNXT41@mail.gmail.com>
Message-ID: <AANLkTimyQL88YRh5knJVnB4cT3oA_OeOzs0bakLASaJB@mail.gmail.com>

Will have a look at it this week, I noticed some problems in the Bio.PDB
section (outdated code).

Cheers!


From rmb32 at cornell.edu  Mon Mar  7 16:37:32 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 07 Mar 2011 11:37:32 -0500
Subject: [Biopython] Google Summer of Code project ideas
Message-ID: <4D7509CC.3040604@cornell.edu>

Hi all,

I'm going to be OBF project admin again this year for Google Summer of 
code.  OBF's application is due later this week, and we need to update 
our project ideas on the OBF wiki page and on each project's individual 
wiki pages.

So, for each of the OBF projects that wants to do GSoC again this year, 
please:

a.) Update the list of project ideas on your project's GSoC page 
(BioPython, BioPerl, BioRuby, etc).  Add new ones, remove ones that have 
already been done or no longer relevant, etc.

b.) Update the list of project ideas on the main OBF GSoC page 
(http://www.open-bio.org/wiki/Google_Summer_of_Code) to match.

c.) Let me know via email that you have done so and it's ready for 
Google to peruse.

Please have the updates done, if possible, by this Friday (March 11). 
The number and quality of the project ideas are part of the evaluation 
process for whether OBF is accepted as a Summer of Code organization 
again this year, so let's come up with some good ones.  :-)

Rob

----
Robert Buels
(prospective) 2011 OBF GSoC Organization Admin


From p.cherepanov at imperial.ac.uk  Tue Mar  8 02:42:26 2011
From: p.cherepanov at imperial.ac.uk (Peter Cherepanov)
Date: Tue, 8 Mar 2011 02:42:26 +0000
Subject: [Biopython] define circular DNA (?)
Message-ID: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>

is there an easy way to define a circular DNA sequence in BioPython? 

It would be useful to have something like: 

my_seq = Seq('ATGCATGC...ATGC', circular_dna)

am I missing something obvious?? 

Peter


From komalsnehal1991 at gmail.com  Tue Mar  8 07:58:11 2011
From: komalsnehal1991 at gmail.com (Komal S)
Date: Tue, 8 Mar 2011 13:28:11 +0530
Subject: [Biopython] Biopython Projects
Message-ID: <AANLkTi=s_bjATb4qWbu1kP6kpULEjABJ+7HOdbfL6_ka@mail.gmail.com>

Hi everyone,

I'm Komal, a Junior Undergraduate Student from India studying
Bioengineering. I'm a fan of Python and I love Computational Biology and I
plan to do my further studies in the same.
I went through the projects on the Biopython page. I was very much
interested in the RNA Structure project mentioned. Any contribution which I
make will help me a lot and the organisation too. In fact, I am currently
doing a project on RNA Editing. I'll be very happy to integrate my
knowledge.

Please help me on how I should proceed.

Komal


From p.j.a.cock at googlemail.com  Tue Mar  8 08:45:31 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 08:45:31 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
Message-ID: <AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>

On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote:
> is there an easy way to define a circular DNA sequence in BioPython?
>
> It would be useful to have something like:
>
> my_seq = Seq('ATGCATGC...ATGC', circular_dna)
>
> am I missing something obvious??
>
> Peter

No, but how would you expect it to act? We've talked
about such an object before... I'd have to go though my
old emails but I recall there being some annoying corner
cases to consider with the slice method (__getitem__).

Peter


From p.j.a.cock at googlemail.com  Tue Mar  8 10:48:13 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 10:48:13 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
Message-ID: <AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>

On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
<p.cherepanov at imperial.ac.uk> wrote:
> I suppose if a DNA sequence is kept as a simple Python string, there is
> no easy way to have it "circular". I am a beginner in Python (I use it only
> occasionally, to solve very specific and simple-minded tasks, when manual
> match/cut-and-paste operations become too much of a burden). Having
> spent an extra hour to hack out and debug a piece of code to match/extract
> to/from circular plasmid sequences kept as Python strings, I thought: hey,
> wait a minute, there is such thing as BioPython, which should have made
> this task so much easier...
>
> Is there a way to "enhance" the Seq object? (or may be I do not know what
> I am talking about...).
>
> thanks a lot for responding!
>
> with best wishes,
>
> Peter

What I had in mind was a new class, CircularSeq, which would subclass
the current Biopython Seq object, and still use a string internally for the
sequence.

We could then modify the slice behaviour so that, perhaps this would
by work wrapping the origin:

c = CircularSeq('ACGTACGTACGT')
assert len(c)==12
print c[10:14]

It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
14 as wrapped to 2, returning the four bases GTAC.

Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
same as 'ACGTACGTACGT'[10:] which is the last two letters only.
This means anyone (or more importantly, any code) expecting the
string like behaviour will get a nasty surprise (or a bug).

Another example, what about c[-2:]? For a plain string you'd
get the last two letters. For a circular sequence you might think
that should represent starting two before the origin, thus giving
the last two letter plus the whole sequence? Also, c[-2:2] could
mean the last two letters plus the first two letters, but for a
plain python string that returns an empty string.

Note that due to the way Python indexing works, single letter
access is fine for negative indices, c[-2] would give the second
last letter, 'G', which is consistent with wrapped counting back
from the origin. We could also make c[14] wrap round to c[2] in
this length 12 example (although there is a small risk of breaking
code expecting an IndexError in this case).

There would be lots of other things to implement, like "in" and the
find methods would need to check the substring across the origin.
Then (for nucleotides), we'd need to ensure reverse_complement
and complement also give a CircularSeq, likewise perhaps for the
transcribe and back_transcribe. The translate method is particularly
tricky as you can have an infinite reading frame, which might be
represented as a circular protein sequence?

All in all, it is quite a lot of work, and there are several tricky bits
where the desired behaviour is not clear cut. Could we come up
with something useful or not?

Peter

P.S. Please CC the mailing list in your replies :)


From p.cherepanov at imperial.ac.uk  Tue Mar  8 10:30:08 2011
From: p.cherepanov at imperial.ac.uk (Peter Cherepanov)
Date: Tue, 8 Mar 2011 10:30:08 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
Message-ID: <503B48D3-61BA-4C77-A441-00942366FFB4@imperial.ac.uk>

I suppose if a DNA sequence is kept as a simple Python string, there is no easy way to have it "circular". I am a beginner in Python (I use it only occasionally, to solve very specific and simple-minded tasks, when manual match/cut-and-paste operations become too much of a burden). Having spent an extra hour to hack out and debug a piece of code to match/extract to/from circular plasmid sequences kept as Python strings, I thought: hey, wait a minute, there is such thing as BioPython, which should have made this task so much easier... 

Is there a way to "enhance" the Seq object? (or may be I do not know what I am talking about...). 

thanks a lot for responding!

with best wishes, 

Peter


On 8 Mar 2011, at 08:45, Peter Cock wrote:

> On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote:
>> is there an easy way to define a circular DNA sequence in BioPython?
>> 
>> It would be useful to have something like:
>> 
>> my_seq = Seq('ATGCATGC...ATGC', circular_dna)
>> 
>> am I missing something obvious??
>> 
>> Peter
> 
> No, but how would you expect it to act? We've talked
> about such an object before... I'd have to go though my
> old emails but I recall there being some annoying corner
> cases to consider with the slice method (__getitem__).
> 
> Peter


From moritz.beber at googlemail.com  Tue Mar  8 11:32:44 2011
From: moritz.beber at googlemail.com (Moritz Beber)
Date: Tue, 08 Mar 2011 12:32:44 +0100
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
Message-ID: <4D7613DC.2050506@googlemail.com>

On 03/08/2011 11:48 AM, Peter Cock wrote:
> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
>
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
>
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
>
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
>
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
>
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
>
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
>
> Peter
>
> P.S. Please CC the mailing list in your replies :)
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

If you just need circular behaviour in a small number of use cases, you
could consider wrapping the sequence in a cycle iterator
http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle


From p.j.a.cock at googlemail.com  Tue Mar  8 11:40:08 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 11:40:08 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <4D7613DC.2050506@googlemail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
	<4D7613DC.2050506@googlemail.com>
Message-ID: <AANLkTi=mbk6sk2BgQOE0=2HDAwvjPiv82KQH3emeS1FE@mail.gmail.com>

On Tue, Mar 8, 2011 at 11:32 AM, Moritz Beber
<moritz.beber at googlemail.com> wrote:
>
> If you just need circular behaviour in a small number of use cases, you
> could consider wrapping the sequence in a cycle iterator
> http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle
>

That might need a lot of memory if used on a long sequence like a
bacterial genome, but an interesting idea.

Peter


From p.cherepanov at imperial.ac.uk  Tue Mar  8 12:12:26 2011
From: p.cherepanov at imperial.ac.uk (Peter Cherepanov)
Date: Tue, 8 Mar 2011 12:12:26 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
Message-ID: <A14925F6-5CC2-4066-AECE-5D401DDFC3B1@imperial.ac.uk>

ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define:

c = CircularSeq('ATGCGGGGA')

where:

c[1:9]  equals  ATGCGGGGA   (or, more awkwardly, c[0:9], if the original Python string numbering must be retained for some reasons)
c[8:7]  equals  GAATGCATG    
c[1:1] equals A  (on a python string it is c[0:1]  =  A, of course)

Ideally, we would want to number such sequences from 1, after all these are the kind of objects we deal in biology. 

And, most importantly of all, if must be able to:
c.find('GGAATG') to return "7"  

Peter


On 8 Mar 2011, at 10:48, Peter Cock wrote:

> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>> 
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>> 
>> thanks a lot for responding!
>> 
>> with best wishes,
>> 
>> Peter
> 
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
> 
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
> 
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
> 
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
> 
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
> 
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
> 
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
> 
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
> 
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
> 
> Peter
> 
> P.S. Please CC the mailing list in your replies :)


From p.j.a.cock at googlemail.com  Tue Mar  8 13:24:07 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 13:24:07 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <A14925F6-5CC2-4066-AECE-5D401DDFC3B1@imperial.ac.uk>
References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk>
	<AANLkTimJVbUVjFBBeQbWYq6pYK56VE8z02WreeDjueX6@mail.gmail.com>
	<BDAC0EBF-88CD-4088-A919-CE394D87B886@imperial.ac.uk>
	<AANLkTimndyehkYjWwhE6uLAZ5RAPr1QoYyiWV-qBTMN3@mail.gmail.com>
	<A14925F6-5CC2-4066-AECE-5D401DDFC3B1@imperial.ac.uk>
Message-ID: <AANLkTim=QjaOTiUBxbomFei5r_34CVPy2c0Ytq8MYWGD@mail.gmail.com>

On Tue, Mar 8, 2011 at 12:12 PM, Peter Cherepanov
<p.cherepanov at imperial.ac.uk> wrote:
> ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define:
>
> c = CircularSeq('ATGCGGGGA')
>
> where:
>
> c[1:9] ?equals ?ATGCGGGGA ? (or, more awkwardly, c[0:9], if the original
> Python string numbering must be retained for some reasons)
> c[8:7] ?equals ?GAATGCATG
> c[1:1] equals A ?(on a python string it is c[0:1] ?= ?A, of course)
>
> Ideally, we would want to number such sequences from 1, after all these
> are the kind of objects we deal in biology.

Absolutely not - it would put the circular sequence completely out of
sync with the existing sequence objects in Biopython and the Python
string. Don't worry - you'll get used to zero based counting, and
the Python slicing is very beautiful once you understand it.

> And, most importantly of all, if must be able to:
> c.find('GGAATG') to return "7"
>

Well, 6 in zero based counting, but yes, that would be the expected
result for find (and similarly for rfind). We'd also need to do something
with the split and rsplit methods to include looking for matches over
the origin.

Peter


From Leighton.Pritchard at scri.ac.uk  Tue Mar  8 13:28:11 2011
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Tue, 8 Mar 2011 13:28:11 +0000
Subject: [Biopython] define circular DNA (?)
Message-ID: <C99BDF6A.B58B%lpritc@scri.ac.uk>

I've got 2p hanging around, so...

On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock"
<p.j.a.cock at googlemail.com> wrote:

> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
>
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.

That seems sensible.  The main issue, as I see it, is that the physical
object is naturally represented by a circularly-linked list, and we have for
circular sequences an indexing/co-ordinate system with a defined zero
start/end point (which is essentially arbitrary - though is usually the
origin of replication for bacterial chromosomes).  This leads to a conflict
between our natural expectations of Python indexing, and the meaning of the
indexing on the physical object that's being represented.

Whatever the ultimate implementation, there will either have to be a
compromise between these two representations, or one or other view will be
ignored.  There will inevitably be value judgements that someone is unhappy
with ;)

> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.

That makes sense in Python indexing terms, but not in terms of the
co-ordinate system for navigating the circular DNA.  To be consistent with
location information from GenBank and other sources where features wrap the
origin of circular DNA, we would need c[10:2] to return the same result as
c[10:14].  That gives us potentially the same problem as c[-2:2], as it
currently returns an empty string.  We'd have to modify Python
slicing/indexing behaviour quite a bit to implement this 'naturally'.

However, I don't think we should ignore the Python indexing format here,
because we might want the ten bases after the base with co-ordinate 6 with
c[6:6+10], which would give us a physically and conceptually sensible linear
sequence that crosses the origin.

We'd probably want to do the obvious things with modular arithmetic, so that
we don't return, say, three concatenated linearised circular sequences to a
request like c[0:36] or c[6:42].

> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).

I'm not sure it's wise to constrain functionality and adequate
representation of a (very important! - showing my bacterial bias) physical
structure to maintain that level of consistency with String.  For instance,
what would CircularSeq + Seq mean?  Physically, and conceptually, not a lot.
So we might want to deprecate the __add__ method for this object - not
typical String behaviour but, in my opinion, appropriate.

(You might remember that I was also generally not in favour of treating Seq
objects as idealised Strings, so there's another bias for you ;) )

> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).

I wouldn't be in favour that behaviour in a general sense, though I don't
see how to avoid it cleanly.  I think it would be best to be strict with
indexing to the co-ordinate system to avoid possible degeneracy of feature
locations.  If we had a SNP at position 2, we could equally well associate
it with any one of an infinite number of positions kl+2 where k is an
integer and l is the sequence length, without modifying the computational
result.  I'm not keen on that kind of woolliness, but I think that it could
possibly be avoided by modifying indexing to require at least one index that
lies in the range [-l,l], and using modular arithmetic for slicing so that,
for the example above, c[18:26] would not be treated as the valid slice
c[6:14], but would instead throw an IndexError.

> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe.

Not to mention the other Biopython functions/methods that expect String-like
indexing.  Maybe a cast (of sorts) between CircularSeq and Seq would be
useful for that, though I can imagine great problems, there.

> The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?

I would think that the test for that particular condition should be fairly
straightforward (is there at least one stop codon in each of the six frames,
taking into account the origin?).

> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?

I think that there's every possibility of coming up with something useful -
the question is to what degree it fits the Biopython/Python idiom, or 'looks
like' the physical object, and whether it gets included in Biopython.

L.

--
Dr Leighton Pritchard MRSC
Plant Pathology Programme, SCRI (C block)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel: No telephone during office refurbishment

[The James Hutton Institute logo]
Please note that from 1 April 2011, SCRI and the Macaulay Land Use Research Institute will join to become The James Hutton Institute.

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________


From p.j.a.cock at googlemail.com  Tue Mar  8 13:58:03 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 8 Mar 2011 13:58:03 +0000
Subject: [Biopython] define circular DNA (?)
In-Reply-To: <C99BDF6A.B58B%lpritc@scri.ac.uk>
References: <C99BDF6A.B58B%lpritc@scri.ac.uk>
Message-ID: <AANLkTinQC3aNGy2mzUAvnpheC3QiW1KrQnYzS3jYAKsq@mail.gmail.com>

On Tue, Mar 8, 2011 at 1:28 PM, Leighton Pritchard
<Leighton.Pritchard at scri.ac.uk> wrote:
> I've got 2p hanging around, so...
>
> On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock"
> <p.j.a.cock at googlemail.com> wrote:
>>
>> What I had in mind was a new class, CircularSeq, which would subclass
>> the current Biopython Seq object, and still use a string internally for the
>> sequence.
>
> That seems sensible. ?The main issue, as I see it, is that the physical
> object is naturally represented by a circularly-linked list, and we have for
> circular sequences an indexing/co-ordinate system with a defined zero
> start/end point (which is essentially arbitrary - though is usually the
> origin of replication for bacterial chromosomes). ?This leads to a conflict
> between our natural expectations of Python indexing, and the meaning of the
> indexing on the physical object that's being represented.
>
> Whatever the ultimate implementation, there will either have to be a
> compromise between these two representations, or one or other view will be
> ignored. ?There will inevitably be value judgements that someone is unhappy
> with ;)

Indeed.

>> We could then modify the slice behaviour so that, perhaps this would
>> by work wrapping the origin:
>>
>> c = CircularSeq('ACGTACGTACGT')
>> assert len(c)==12
>> print c[10:14]
>>
>> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
>> 14 as wrapped to 2, returning the four bases GTAC.
>
> That makes sense in Python indexing terms, but not in terms of the
> co-ordinate system for navigating the circular DNA. ?To be consistent with
> location information from GenBank and other sources where features wrap the
> origin of circular DNA, we would need c[10:2] to return the same result as
> c[10:14]. ?That gives us potentially the same problem as c[-2:2], as it
> currently returns an empty string. ?We'd have to modify Python
> slicing/indexing behaviour quite a bit to implement this 'naturally'.
>
> However, I don't think we should ignore the Python indexing format here,
> because we might want the ten bases after the base with co-ordinate 6 with
> c[6:6+10], which would give us a physically and conceptually sensible linear
> sequence that crosses the origin.

I think we agree that c[10:14] and c[10:10+4] should give the four bases
GTAC wrapping the origin when c is circular sequence ACGTACGTACGT,
equivalently c[10:12] + c[0:2] using Python slicing.

Likewise for your example c[6:6+10] or c[6:16] this should give six bases
wrapping the origin, equivalently c[6:12] + c[0:4] using Python slicing.

> We'd probably want to do the obvious things with modular arithmetic, so that
> we don't return, say, three concatenated linearised circular sequences to a
> request like c[0:36] or c[6:42].

I disagree, returning the three concatenated linearised circular sequences
is what I would expect. This is one of the debatable issues that will divide
people. Consider the (special and artificial) case of a circular plasmid with
an ORF wrapping round the origin (one, twice or infinite), the ORF sequence
is longer than the linearised plasmid, so slicing with concatenation would
be useful. e.g.

http://www.ncbi.nlm.nih.gov/pubmed/9740124
Perriman and Ares (1998), Circular mRNA can direct translation of
extremely long repeating-sequence proteins in vivo.

and:

http://dx.doi.org/10.1385/1-59259-280-5:069
Perriman (2002), Circular mRNA Encoding for Monomeric and
Polymeric Green Fluorescent Protein

(Very cool work)

>> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
>> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
>> This means anyone (or more importantly, any code) expecting the
>> string like behaviour will get a nasty surprise (or a bug).
>
> I'm not sure it's wise to constrain functionality and adequate
> representation of a (very important! - showing my bacterial bias) physical
> structure to maintain that level of consistency with String. ?For instance,
> what would CircularSeq + Seq mean? ?Physically, and conceptually, not a lot.
> So we might want to deprecate the __add__ method for this object - not
> typical String behaviour but, in my opinion, appropriate.

We're probably want to made addition of CircularSeq + Seq raise a
TypeError. Or, do a linearisation and simple addition with a warning?

> (You might remember that I was also generally not in favour of treating
> Seq objects as idealised Strings, so there's another bias for you ;) )

I recall :)

>> Note that due to the way Python indexing works, single letter
>> access is fine for negative indices, c[-2] would give the second
>> last letter, 'G', which is consistent with wrapped counting back
>> from the origin. We could also make c[14] wrap round to c[2] in
>> this length 12 example (although there is a small risk of breaking
>> code expecting an IndexError in this case).
>
> I wouldn't be in favour that behaviour in a general sense, though I don't
> see how to avoid it cleanly. I think it would be best to be strict with
> indexing to the co-ordinate system to avoid possible degeneracy of feature
> locations. ?If we had a SNP at position 2, we could equally well associate
> it with any one of an infinite number of positions kl+2 where k is an
> integer and l is the sequence length, without modifying the computational
> result.

Yes, I was suggesting we could make c[x+n*length] act as c[x],
i.e. for *single* indexes which return one letter, apply the modulo
arithmetic. Or, we leave this to follow the current Python string
behaviour where if the index is equal to the length or more, you
get an IndexError. That avoids the ambiguity ;)

> I'm not keen on that kind of woolliness, but I think that it could
> possibly be avoided by modifying indexing to require at least one index that
> lies in the range [-l,l], and using modular arithmetic for slicing so that,
> for the example above, c[18:26] would not be treated as the valid slice
> c[6:14], but would instead throw an IndexError.

This depends on the treatment of things like c[0:36] or c[6:42]
discussed above (return 36 bases, or just 12?).

>> There would be lots of other things to implement, like "in" and the
>> find methods would need to check the substring across the origin.
>> Then (for nucleotides), we'd need to ensure reverse_complement
>> and complement also give a CircularSeq, likewise perhaps for the
>> transcribe and back_transcribe.
>
> Not to mention the other Biopython functions/methods that expect String-like
> indexing. ?Maybe a cast (of sorts) between CircularSeq and Seq would be
> useful for that, though I can imagine great problems, there.

Having a toseq method like the MutableSeq does could handle that,
returning a traditional linear Seq object. If the CircularSeq 'breaks'
too much expected string-like behaviour that would be important.

>> The translate method is particularly
>> tricky as you can have an infinite reading frame, which might be
>> represented as a circular protein sequence?
>
> I would think that the test for that particular condition should be fairly
> straightforward (is there at least one stop codon in each of the six frames,
> taking into account the origin?).

Having thought about this example at length before, it can be done
but I don't think it is all that straightforward ;)

>> All in all, it is quite a lot of work, and there are several tricky bits
>> where the desired behaviour is not clear cut. Could we come up
>> with something useful or not?
>
> I think that there's every possibility of coming up with something useful -
> the question is to what degree it fits the Biopython/Python idiom, or 'looks
> like' the physical object, and whether it gets included in Biopython.
>
> L.

Agreed.

Peter


From anaryin at gmail.com  Tue Mar  8 21:39:07 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 8 Mar 2011 22:39:07 +0100
Subject: [Biopython] PDBParser Class --> Output
In-Reply-To: <AANLkTi=og=3VA68hfhnc4jTeJamJ6me7RPKmraqFzvWH@mail.gmail.com>
References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu>
	<AANLkTi=_pCV8-ATaKcVzV40NLmfUs76UDVR4KgYMCMGi@mail.gmail.com>
	<8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu>
	<AANLkTinAFONUdnxps8X2M1toKsKPrR2BjabDSs0vS23Q@mail.gmail.com>
	<95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com>
	<AANLkTimne1YbfSmyLxmXQE8pcrYM96vQPB5O2+iGxCt0@mail.gmail.com>
	<AANLkTi=og=3VA68hfhnc4jTeJamJ6me7RPKmraqFzvWH@mail.gmail.com>
Message-ID: <AANLkTikC3wYvStX+TD7b3po3sD7TQpM4qxKQa+MgA0DZ@mail.gmail.com>

Back to this question. Haven't had much time to look at it and it turned out
to be a bit more complicated than what I thought. Permissive is an attribute
of the PDBParser module and since the assignment takes place in the Atom
module I don't see a straightforward way of pulling this off.

However, and although there is the very simple solution of playing with the
warnings module, the solution I offer is to allow a second level of
"permissiveness" (PERMISSIVE=2) where all warnings are supressed.

Cheers,

J


From laserson at mit.edu  Wed Mar  9 03:07:54 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 8 Mar 2011 22:07:54 -0500
Subject: [Biopython] SeqRecord subclassing or composition
Message-ID: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>

I am trying to implement a data type for my work.  Each object will have a
sequence (derived from a single read) and lots of annotations and features.
 However, I want to implement some extra interface that is problem-specific
to make my analysis more convenient.

I am debating whether to subclass SeqRecord and simply implement the extra
interface or define a new object that wraps a SeqRecord object and pass on
the subset of native SeqRecord calls and/or simply access the underlying
SeqRecord directly.

One additional factor is that I want to be able to read/write INSDC-style
files for the data (e.g., GenBank).  Therefore, if I use the SeqIO parser,
it will return native SeqRecords.  If I go the inheritance route, how do I
cast a SeqRecord object to my new subclass?

So, I am debating between inheritance

class ImmuneChain(SeqRecord):
    def __init__(self, *args, **kw):
        SeqRecord.__init__(self,*args,**kw)
        # But how do I cast a SeqRecord to an ImmuneChain?


or composition

class ImmuneChain(object):
    def __init__(self, *args, **kw):
        if isinstance(args[0],SeqRecord):
            self._record = args[0]
        else:
            # Initialize the underlying SeqRecord manually
            self._record.seq = ...


Any thoughts?

Thanks!
Uri


...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


From p.j.a.cock at googlemail.com  Wed Mar  9 09:04:26 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 9 Mar 2011 09:04:26 +0000
Subject: [Biopython] SeqRecord subclassing or composition
In-Reply-To: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
References: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
Message-ID: <AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>

On Wed, Mar 9, 2011 at 3:07 AM, Uri Laserson <laserson at mit.edu> wrote:
> I am trying to implement a data type for my work. ?Each object will have a
> sequence (derived from a single read) and lots of annotations and features.
> ?However, I want to implement some extra interface that is problem-specific
> to make my analysis more convenient.
>
> I am debating whether to subclass SeqRecord and simply implement the extra
> interface or define a new object that wraps a SeqRecord object and pass on
> the subset of native SeqRecord calls and/or simply access the underlying
> SeqRecord directly.
>
> One additional factor is that I want to be able to read/write INSDC-style
> files for the data (e.g., GenBank). ?Therefore, if I use the SeqIO parser,
> it will return native SeqRecords. ?If I go the inheritance route, how do I
> cast a SeqRecord object to my new subclass?

There is (currently at least) no option in SeqIO parse/read
to override the use of the SeqRecord object. So you'd need
code to 'upgrade' a SeqRecord into your class. Probably
the simplest route would be for it's __init__ method to
take a single argument (a SeqRecord). Then you could
have:

def my_parse(...):
    for seq_record in SeqIO.parse(...):
        yield MyClass(seq_record)

def my_read(...):
    return MyClass(SeqIO.read(...))

etc

> So, I am debating between inheritance
>
> class ImmuneChain(SeqRecord):
> ? ?def __init__(self, *args, **kw):
> ? ? ? ?SeqRecord.__init__(self,*args,**kw)
> ? ? ? ?# But how do I cast a SeqRecord to an ImmuneChain?

Unless you modify the methods/atttributes too much, a
ImmuneChain subclass of SeqRecord should be usable
as is with SeqIO.write etc. You don't need to 'cast'.

Also note the above __init__ method can be more specific,
you might have say 10 init args for ImmuneChain,  only
some of which you pass to the SeqRecord init.

You could even have a single __init__ argument of a
SeqRecord, and copy all its attributes.

> or composition
>
> class ImmuneChain(object):
> ? ?def __init__(self, *args, **kw):
> ? ? ? ?if isinstance(args[0],SeqRecord):
> ? ? ? ? ? ?self._record = args[0]
> ? ? ? ?else:
> ? ? ? ? ? ?# Initialize the underlying SeqRecord manually
> ? ? ? ? ? ?self._record.seq = ...

With the above approach you'd have to pass the
private record to SeqIO.write etc (anything which
needs a SeqRecord). That could be done inside
methods of the ImmuneChain object (e.g. you
could expose the format method of the SeqRecord).

>
> Any thoughts?
>

You could alternatively go for a procedural style where
you write your code as functions taking SeqRecord
objects (perhaps expecting particular information in
the annotation).

Peter


From komalsnehal1991 at gmail.com  Wed Mar  9 10:49:23 2011
From: komalsnehal1991 at gmail.com (Komal S)
Date: Wed, 9 Mar 2011 02:49:23 -0800
Subject: [Biopython] ::Biopython Project
Message-ID: <AANLkTikDHtLdbyih7u3Jiy76Puvpr7Lh995-aBX-9jKu@mail.gmail.com>

Hi everyone,

I'm Komal, a Junior Undergraduate Student from India
studying Bioengineering. I'm a fan of Python and I love Computational
Biology and I plan to do my further studies in the same.
I went through the projects on the Biopython page. I was very
much interested in the RNA Structure project mentioned. Any contribution
which I make will help me a lot and the organisation too. In fact, I am
currently doing a project on RNA Editing. I'll be very happy to integrate
my knowledge.

In fact, I have been trying to contact people on #obf-soc IRC. I think there
is no separate IRC for Biopython.

Please help me on how I should proceed.


Komal


From laserson at mit.edu  Wed Mar  9 15:28:22 2011
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 9 Mar 2011 10:28:22 -0500
Subject: [Biopython] SeqRecord subclassing or composition
In-Reply-To: <AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>
References: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
	<AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>
Message-ID: <AANLkTi=J-Ok+Frzm4DkPvHJjR5rCY-U9oHR-aC4tDXmO@mail.gmail.com>

>
> Unless you modify the methods/atttributes too much, a
> ImmuneChain subclass of SeqRecord should be usable
> as is with SeqIO.write etc. You don't need to 'cast'.
>

I'm more worried about parsing than writing.  As you mentioned, I will have
to upgrade my SeqRecord object to an ImmuneChain object.

So maybe the best approach is a combination of the two code snippets I
included.  It would subclass SeqRecord, and then manually check whether I am
initializing with a pre-existing SeqRecord or just data:

class ImmuneChain(SeqRecord):
    def __init__(self, *args, **kw):
        if isinstance(args[0],SeqRecord):
            # if initializing with SeqRecord, then manually transfer the
data
            # based on the initializer for SeqRecord (http://goo.gl/X95Zf)
            record = args[0]
            SeqRecord.__init__(self, seq, id=record.id, name=record.name,
                     description=record.description, dbxrefs=record.dbxrefs,
                     features=record.features,
annotations=record.annotations,
                     letter_annotations=record.letter_annotations)
        else:
            # assume I'm initializing just like a regular SeqRecord:
            SeqRecord.__init__(*args,**kw)

        # Finally, I perform any problem-specific additional initializations
        # here.
        pass

Does this seem like a good solution?

Also, do you think that it would make sense to make a deep copy of the
SeqRecord object before I use it to initialize the ImmuneChain?

Uri


From p.j.a.cock at googlemail.com  Wed Mar  9 15:32:50 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 9 Mar 2011 15:32:50 +0000
Subject: [Biopython] SeqRecord subclassing or composition
In-Reply-To: <AANLkTi=J-Ok+Frzm4DkPvHJjR5rCY-U9oHR-aC4tDXmO@mail.gmail.com>
References: <AANLkTik=yyZc6Y0He0mPH_1wMk7ZHrkBMWhxDAj8Bq5z@mail.gmail.com>
	<AANLkTi=qPJuK+s6YmeA7pi3NLQhULs8UvA=GxNz7u755@mail.gmail.com>
	<AANLkTi=J-Ok+Frzm4DkPvHJjR5rCY-U9oHR-aC4tDXmO@mail.gmail.com>
Message-ID: <AANLkTikfGO9s+GdDpG00CkXAg0B=Ch2EYi0oL78fGd6O@mail.gmail.com>

On Wed, Mar 9, 2011 at 3:28 PM, Uri Laserson <laserson at mit.edu> wrote:
>> Unless you modify the methods/atttributes too much, a
>> ImmuneChain subclass of SeqRecord should be usable
>> as is with SeqIO.write etc. You don't need to 'cast'.
>
> I'm more worried about parsing than writing. ?As you mentioned, I will have
> to upgrade my SeqRecord object to an ImmuneChain object.
> So maybe the best approach is a combination of the two code snippets I
> included. ?It would subclass SeqRecord, and then manually check whether I am
> initializing with a pre-existing SeqRecord or just data:
> class ImmuneChain(SeqRecord):
> ?? ?def __init__(self, *args, **kw):
> ?? ? ? ?if isinstance(args[0],SeqRecord):
> ?? ? ? ? ? ?# if initializing with SeqRecord, then manually transfer the
> data
> ?? ? ? ? ? ?# based on the initializer for SeqRecord (http://goo.gl/X95Zf)
> ?? ? ? ? ? ?record = args[0]
> ?? ? ? ? ? ?SeqRecord.__init__(self, seq, id=record.id, name=record.name,
> ?? ? ? ? ? ? ? ? ? ? description=record.description, dbxrefs=record.dbxrefs,
> ?? ? ? ? ? ? ? ? ? ? features=record.features,
> annotations=record.annotations,
> ?? ? ? ? ? ? ? ? ? ? letter_annotations=record.letter_annotations)
> ?? ? ? ?else:
> ?? ? ? ? ? ?# assume I'm initializing just like a regular SeqRecord:
> ?? ? ? ? ? ?SeqRecord.__init__(*args,**kw)
>
> ?? ? ? ?# Finally, I perform any problem-specific additional initializations
> ?? ? ? ?# here.
> ?? ? ? ?pass
> Does this seem like a good solution?

I think it will work,

> Also, do you think that it would make sense to make a deep copy of the
> SeqRecord object before I use it to initialize the ImmuneChain?

Assuming you will be discarding the original SeqRecord, then I see
no reason to make a deep copy. It will just slow things down.

Peter


From jvb at Cs.Nott.AC.UK  Wed Mar  9 15:33:28 2011
From: jvb at Cs.Nott.AC.UK (Jonathan Blakes)
Date: Wed, 09 Mar 2011 15:33:28 +0000
Subject: [Biopython] back-translation method for Seq object?
Message-ID: <4D779DC8.8090704@cs.nott.ac.uk>

This is a reply to an old thread (October 2008), but I thought someone 
might find it useful.

In that thread, discussing the representation of back-translations using 
ambiguous bases to avoid the factorial explosion of an all possibilities 
back-translation, Bruce Southey gave a table similar to the one below 
but some of the ambiguous codons were incorrect or the ambiguous codons 
were to ambiguous and covered more than one amino acid. The codons for 
stop (*) were also missing. Some were corrected later in the thread but 
not all.

Here are the correct ambiguous codons for the standard genetic code:

* = TAG, TAA, TGA                = TAR, TGA
A = GCT, GCC, GCA, GCG           = GCN
C = TGT, TGC                     = TGY
D = GAT, GAC                     = GAY
E = GAA, GAG                     = GAR
F = TTT, TTC                     = TTY
G = GGT, GGC, GGA, GGG           = GGN
H = CAT, CAC                     = CAY
I = ATT, ATC, ATA                = ATH
K = AAA, AAG                     = AAR
L = TTA, TTG, CTT, CTC, CTA, CTG = TTR, CTN
M = ATG                          = ATG
N = AAT, AAC                     = AAY
P = CCT, CCC, CCA, CCG           = CCN
Q = CAA, CAG                     = CAR
R = CGT, CGC, CGA, CGG, AGA, AGG = CGN, AGR
S = TCT, TCC, TCA, TCG, AGT, AGC = TCN, AGY
T = ACT, ACC, ACA, ACG           = ACN
V = GTT, GTC, GTA, GTG           = GTN
W = TGG                          = TGG
Y = TAT, TAC                     = TAY

Even though this is still not a one-to-one mapping in 4/21 cases the 
factorial explosion is significantly decreased. For example, the protein 
ACDEFGHIKLMNPQRSTVWY* has 1,019,215,872 unambiguous back-translations. 
Using the code above it has 16, or generally 2^(L+R+S+*).

If anyone has an algorithm for determining the set of non-overlapping 
ambiguous codons from any codon table I would like to know. Thanks,

Jon

-- 
Jonathan Blakes
School of Computer Science
University of Nottingham


From rasi at seas.harvard.edu  Wed Mar  9 22:57:30 2011
From: rasi at seas.harvard.edu (Arvind Subramaniam)
Date: Wed, 9 Mar 2011 17:57:30 -0500
Subject: [Biopython] .ab1 file parser in biopython?
Message-ID: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>

Hi
 I am new to biopython so please excuse me if this issue is obviously
simple. I am trying to parse .ab1 sequencing trace files in Biopython
and I cannot find the right module or method to do this job. Can
someone suggest how I can parse .ab1 files?
Thanks,
Arvind.


From cmckay at u.washington.edu  Thu Mar 10 01:09:55 2011
From: cmckay at u.washington.edu (Cedar McKay)
Date: Wed, 9 Mar 2011 17:09:55 -0800
Subject: [Biopython] "raw" genbank locations?
Message-ID: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>

Hello all. Biopython continues to be a lifesaver.

I'm trying to get the "raw" genbank locations for a downstream application after parsing a genbank file. Is there any way to get at this (or reproduce it)? As it is, the SeqRecord feature has start and stop information for the whole feature, and a list of sub-features each with it's own start and stops. I'm looking for one concise text string the describes the entire feature location, much like the original raw genbank locations do. 

I searched the archives, but nothing popped into view.

Thanks for your help!

best,
Cedar


From chapmanb at 50mail.com  Thu Mar 10 02:05:45 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 9 Mar 2011 21:05:45 -0500
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
Message-ID: <20110310020545.GA2185@kunkel>

Cedar;
Glad to hear Biopython has been helping out with your work.

> I'm trying to get the "raw" genbank locations for a downstream
> application after parsing a genbank file. Is there any way to get at
> this (or reproduce it)? As it is, the SeqRecord feature has start and
> stop information for the whole feature, and a list of sub-features
> each with it's own start and stops. I'm looking for one concise text
> string the describes the entire feature location, much like the
> original raw genbank locations do.

You can do this with the GenBank RecordParser, which doesn't parse
the location strings:

>>> from Bio.GenBank import RecordParser
>>> parser = RecordParser()
>>> handle = open("NT_019265.gb")
>>> rec = parser.parse(handle)
>>> for f in rec.features:
...     print f.location
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

If you have SeqRecord objects from SeqIO you can do this in a ugly
way by reaching into the internals of the GenBank writer:

>>> from Bio import SeqIO
>>> from Bio.SeqIO import InsdcIO
>>> handle = open("NT_019265.gb")
>>> for rec in SeqIO.parse(handle, "genbank"):
...     for f in rec.features:
...         print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

That might work for a quick hack but is not necessarily future proof
is the internal change. Peter, do you think this would be useful to
expose as a function of a SeqFeature directly, so you could do
feature.insdc_string() or something similar?

Brad


From chapmanb at 50mail.com  Thu Mar 10 02:05:45 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 9 Mar 2011 21:05:45 -0500
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
Message-ID: <20110310020545.GA2185@kunkel>

Cedar;
Glad to hear Biopython has been helping out with your work.

> I'm trying to get the "raw" genbank locations for a downstream
> application after parsing a genbank file. Is there any way to get at
> this (or reproduce it)? As it is, the SeqRecord feature has start and
> stop information for the whole feature, and a list of sub-features
> each with it's own start and stops. I'm looking for one concise text
> string the describes the entire feature location, much like the
> original raw genbank locations do.

You can do this with the GenBank RecordParser, which doesn't parse
the location strings:

>>> from Bio.GenBank import RecordParser
>>> parser = RecordParser()
>>> handle = open("NT_019265.gb")
>>> rec = parser.parse(handle)
>>> for f in rec.features:
...     print f.location
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

If you have SeqRecord objects from SeqIO you can do this in a ugly
way by reaching into the internals of the GenBank writer:

>>> from Bio import SeqIO
>>> from Bio.SeqIO import InsdcIO
>>> handle = open("NT_019265.gb")
>>> for rec in SeqIO.parse(handle, "genbank"):
...     for f in rec.features:
...         print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
... 
1..1250660
1..3290
215902..365470
217508
join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092)

That might work for a quick hack but is not necessarily future proof
is the internal change. Peter, do you think this would be useful to
expose as a function of a SeqFeature directly, so you could do
feature.insdc_string() or something similar?

Brad


From p.j.a.cock at googlemail.com  Thu Mar 10 08:57:20 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 08:57:20 +0000
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <20110310020545.GA2185@kunkel>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
	<20110310020545.GA2185@kunkel>
Message-ID: <AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>

On Thu, Mar 10, 2011 at 2:05 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Cedar;
> Glad to hear Biopython has been helping out with your work.
>
>> I'm trying to get the "raw" genbank locations for a downstream
>> application after parsing a genbank file. Is there any way to get at
>> this (or reproduce it)? As it is, the SeqRecord feature has start and
>> stop information for the whole feature, and a list of sub-features
>> each with it's own start and stops. I'm looking for one concise text
>> string the describes the entire feature location, much like the
>> original raw genbank locations do.
>
> You can do this with the GenBank RecordParser, which doesn't parse
> the location strings:
>
>>>> from Bio.GenBank import RecordParser
>>>> parser = RecordParser()
>>>> handle = open("NT_019265.gb")
>>>> rec = parser.parse(handle)
>>>> for f in rec.features:
> ... ? ? print f.location
> ...
> <cut>
>
> If you have SeqRecord objects from SeqIO you can do this in a ugly
> way by reaching into the internals of the GenBank writer:
>
>>>> from Bio import SeqIO
>>>> from Bio.SeqIO import InsdcIO
>>>> handle = open("NT_019265.gb")
>>>> for rec in SeqIO.parse(handle, "genbank"):
> ... ? ? for f in rec.features:
> ... ? ? ? ? print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
> ...
> <cut>
>
> That might work for a quick hack but is not necessarily future proof
> is the internal change. Peter, do you think this would be useful to
> expose as a function of a SeqFeature directly, so you could do
> feature.insdc_string() or something similar?

A couple of people have asked for this, and since adding SeqIO
output in GenBank/EMBL format (the code you refer to in InsdcIO)
this would be very possible... the issue holding me back is the
annoying special case(s) requiring to know the parent sequence's
length. The problem is that currently the SeqFeature doesn't
have this information - it doesn't have any link back to a parent
SeqRecord (and indeed it doesn't even have to be created in
the context of a SeqRecord).

Perhaps we can handle the case of between features N^1 on
circular sequences of length N differently, maybe with a dedicated
SeqFeature location class which would tell us it was at the origin?
Then we'd be able to avoid the need to know the parent length.

Once that is resolved, an orphan SeqFeature could generate its
own INSDC (GenBank/EMBL) location string without needing any
extra information, and exposing this as an object method would
be fine.

Peter

P.S. If we ever add a CircularSeq object - see other thread- then
SeqFeature locations spanning the origin might need reworking
too.


From p.j.a.cock at googlemail.com  Thu Mar 10 09:00:51 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 09:00:51 +0000
Subject: [Biopython] .ab1 file parser in biopython?
In-Reply-To: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
References: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
Message-ID: <AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>

On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam
<rasi at seas.harvard.edu> wrote:
> Hi
> ?I am new to biopython so please excuse me if this issue is obviously
> simple. I am trying to parse .ab1 sequencing trace files in Biopython
> and I cannot find the right module or method to do this job. Can
> someone suggest how I can parse .ab1 files?
> Thanks,
> Arvind.

You mean the ABI trace file format for capillary sequencing?

Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner
if I want to recall the bases (the ABI software doesn't always to the
best possible calling job).

http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html
http://sourceforge.net/projects/tracetuner/

Peter


From chapmanb at 50mail.com  Thu Mar 10 11:06:48 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 10 Mar 2011 06:06:48 -0500
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
	<20110310020545.GA2185@kunkel>
	<AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>
Message-ID: <20110310110648.GA2302@kunkel>

Peter;

> > do you think this would be useful to
> > expose as a function of a SeqFeature directly, so you could do
> > feature.insdc_string() or something similar?
> 
> A couple of people have asked for this, and since adding SeqIO
> output in GenBank/EMBL format (the code you refer to in InsdcIO)
> this would be very possible... the issue holding me back is the
> annoying special case(s) requiring to know the parent sequence's
> length. The problem is that currently the SeqFeature doesn't
> have this information - it doesn't have any link back to a parent
> SeqRecord (and indeed it doesn't even have to be created in
> the context of a SeqRecord).
> 
> Perhaps we can handle the case of between features N^1 on
> circular sequences of length N differently, maybe with a dedicated
> SeqFeature location class which would tell us it was at the origin?
> Then we'd be able to avoid the need to know the parent length.

This is a great idea; makes sense to treat this as a special case
since that's what it is. Another simple way would be to put the
function on the SeqRecord class and call it with:
rec.insdc_feature_string(feature); this places the responsibility of
knowing the parent back on the library user. 

> P.S. If we ever add a CircularSeq object - see other thread- then
> SeqFeature locations spanning the origin might need reworking
> too.

Makes sense. We can get the 99% of standard cases working now and
then re-circle back on this once someone gets up the guts to tackle
CircularSeq.

Brad


From p.j.a.cock at googlemail.com  Thu Mar 10 11:52:48 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 11:52:48 +0000
Subject: [Biopython] "raw" genbank locations?
In-Reply-To: <20110310110648.GA2302@kunkel>
References: <A3104CC4-4890-4303-8118-EA62309D445D@u.washington.edu>
	<20110310020545.GA2185@kunkel>
	<AANLkTik9aEnq9F-v8SGmsT7-4ND0bTHZG_xtL6uKN3d0@mail.gmail.com>
	<20110310110648.GA2302@kunkel>
Message-ID: <AANLkTikdg=SaMguCiCbJwyuiBYbVi=CFgkdSKL+3j5ZY@mail.gmail.com>

On Thu, Mar 10, 2011 at 11:06 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter;
>
>> > do you think this would be useful to
>> > expose as a function of a SeqFeature directly, so you could do
>> > feature.insdc_string() or something similar?
>>
>> A couple of people have asked for this, and since adding SeqIO
>> output in GenBank/EMBL format (the code you refer to in InsdcIO)
>> this would be very possible... the issue holding me back is the
>> annoying special case(s) requiring to know the parent sequence's
>> length. The problem is that currently the SeqFeature doesn't
>> have this information - it doesn't have any link back to a parent
>> SeqRecord (and indeed it doesn't even have to be created in
>> the context of a SeqRecord).
>>
>> Perhaps we can handle the case of between features N^1 on
>> circular sequences of length N differently, maybe with a dedicated
>> SeqFeature location class which would tell us it was at the origin?
>> Then we'd be able to avoid the need to know the parent length.
>
> This is a great idea; makes sense to treat this as a special case
> since that's what it is.

It is probably the most elegant solution without a big refactor.

> Another simple way would be to put the
> function on the SeqRecord class and call it with:
> rec.insdc_feature_string(feature); this places the responsibility of
> knowing the parent back on the library user.

Yes, that would be simple. But don't we sometimes want to use
'orphan' SeqFeature objects (without a SeqRecord parent)?
I'm thinking here about GFF3 files and the like.

>> P.S. If we ever add a CircularSeq object - see other thread- then
>> SeqFeature locations spanning the origin might need reworking
>> too.
>
> Makes sense. We can get the 99% of standard cases working now and
> then re-circle back on this once someone gets up the guts to tackle
> CircularSeq.

:)

Peter


From rmb32 at cornell.edu  Thu Mar 10 17:15:41 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 10 Mar 2011 12:15:41 -0500
Subject: [Biopython] update Google Summer of Code project ideas
Message-ID: <4D79073D.3090603@cornell.edu>

Hi all,

Please make sure the BioJava information is up to date for 2011 on both 
the OBF and BioJava wikis.  Eric has done some work on it, but the 
current page has not been completely updated to reflect that it's 2011 
and we're applying again.

OBF wiki page: http://www.open-bio.org/wiki/Google_Summer_of_Code
BioPython wiki: http://biopython.org/wiki/Google_Summer_of_Code

Rob

----
Robert Buels
(prospective) 2011 OBF GSoC Organization Admin


From anaryin at gmail.com  Thu Mar 10 17:25:04 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 10 Mar 2011 18:25:04 +0100
Subject: [Biopython] update Google Summer of Code project ideas
In-Reply-To: <4D79073D.3090603@cornell.edu>
References: <4D79073D.3090603@cornell.edu>
Message-ID: <AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>

I updated the date and added the project from last year to the page, to show
we got another funded project.

Cheers,

J


From p.j.a.cock at googlemail.com  Thu Mar 10 17:42:58 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 17:42:58 +0000
Subject: [Biopython] Bugzilla -> Redmine migration
Message-ID: <AANLkTi=VuX3+ymNEo34f2XY1N-OmSGJ1MPFp7TNbginn@mail.gmail.com>

Hi all,

Anyone who has tried to file a bug recently will have noticed a big
red message "Sorry, entering bugs into the product Biopython has been
disabled."

The reason for this is the OBF team are about to move us (and all the
other Bio* projects using Bugzilla) to a Redmine server instead.
See http://www.redmine.org/

I expect this to be completed in the next few days (with all the old
bugs and accounts carried across). Hopefully this will include
integration with our git repository as well.

We'll make an announcement once it is ready, in the mean time, any new
bugs could be emailed to the mailing list as a short term measure.

Peter


From laserson at mit.edu  Thu Mar 10 18:22:42 2011
From: laserson at mit.edu (Uri Laserson)
Date: Thu, 10 Mar 2011 13:22:42 -0500
Subject: [Biopython] .ab1 file parser in biopython?
In-Reply-To: <AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>
References: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
	<AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>
Message-ID: <AANLkTimgkHz2N9SbQq6fgX76hxHCA5HKE78ezTO40j=q@mail.gmail.com>

I also found the following code lying around somewhere.  I copied it into
one of my repositories:

https://github.com/laserson/pytools/blob/master/ab1.py

"Python implementation of an ABIF file reader according to Applied
Biosystems' specificatons" as specified in March 2007, it appears.

...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


On Thu, Mar 10, 2011 at 04:00, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam
> <rasi at seas.harvard.edu> wrote:
> > Hi
> >  I am new to biopython so please excuse me if this issue is obviously
> > simple. I am trying to parse .ab1 sequencing trace files in Biopython
> > and I cannot find the right module or method to do this job. Can
> > someone suggest how I can parse .ab1 files?
> > Thanks,
> > Arvind.
>
> You mean the ABI trace file format for capillary sequencing?
>
> Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner
> if I want to recall the bases (the ABI software doesn't always to the
> best possible calling job).
>
> http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html
> http://sourceforge.net/projects/tracetuner/
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Thu Mar 10 18:37:04 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 18:37:04 +0000
Subject: [Biopython] .ab1 file parser in biopython?
In-Reply-To: <AANLkTimgkHz2N9SbQq6fgX76hxHCA5HKE78ezTO40j=q@mail.gmail.com>
References: <AANLkTin_q6vdKHeitaZU0RtuO3=YXuBwyQ5yptJOUErT@mail.gmail.com>
	<AANLkTim3T4ET40bbfszRr7KAAFW=Ae9Y7CJWpRo2-o3H@mail.gmail.com>
	<AANLkTimgkHz2N9SbQq6fgX76hxHCA5HKE78ezTO40j=q@mail.gmail.com>
Message-ID: <AANLkTikzMpbhgF2+t0dDW=cOTVYYsg6+rCxwUcwKSHGG@mail.gmail.com>

On Thu, Mar 10, 2011 at 6:22 PM, Uri Laserson <laserson at mit.edu> wrote:
> I also found the following code lying around somewhere. ?I copied it into
> one of my repositories:
>
> https://github.com/laserson/pytools/blob/master/ab1.py
>
> "Python implementation of an ABIF file reader according to Applied
> Biosystems' specificatons" as specified in March 2007, it appears.
>

Its under the GPL license. If you contacted the named author, Francis
Wolinski, and he was willing to re-licence for Biopython to use, then we
could consider incorporating it.

Alternatively it shouldn't be too hard to reimplement it from scratch
based on the published specification (and go one step further and
consider output too).

http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf

Note some case would be needed to work on Python 3, but we
can follow the example of our SFF parser here.

Is there actually a need for this though? As I said before, for my own
needs getting the ABI file into FASTQ format (or FASTA+QUAL) has
sufficed.

Peter


From cmckay at u.washington.edu  Thu Mar 10 21:51:42 2011
From: cmckay at u.washington.edu (Cedar McKay)
Date: Thu, 10 Mar 2011 13:51:42 -0800
Subject: [Biopython] "raw" genbank locations?
Message-ID: <D045666B-2637-401E-9FE5-02EF61C7BAF6@u.washington.edu>

Great! InsdcIO._insdc_feature_location_string was just what I needed. I was actually on the right track, trying to figure out how SeqIO wrote locations in genbank format, but your email arrived soon enough that I didn't have to finish the job. I realize this is a private method, so I would like an official way to do this.

Thanks so much guys, as usual, awesome service!

Cedar


From laserson at mit.edu  Thu Mar 10 22:07:46 2011
From: laserson at mit.edu (Uri Laserson)
Date: Thu, 10 Mar 2011 17:07:46 -0500
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
Message-ID: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>

Say I have a SeqRecord called A and a SeqRecord called B.  A has a bunch of
SeqFeatures associated with it, while B has none.  I perform a gapped
alignment between the two sequences.  Now I want to copy the SeqFeatures
from A onto B in a way that respects the coordinates of all the features.

For example (and please use a fixed-width font for this):

         0                       1
         0 1 2 3 4   5   6 7 8 9 0 1 2 3 4 5 6 7 8 9
             FEATURE_1               FEATURE_2
          X X X X X X X X X       X X X X X X X X X
A   - - - a c g g t - - a c a g a c g t g a t a c g
          | | | | |     | | |   | | |   | | | | | |
B   a a a a c g g t g g a c a t a c g - g a t a c g

   0                   1                     2
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6  7  8 9 0 1 2 3


In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and
(10,19), respectively.  Now I want to copy it to sequence B, where the
feature coords should instead be (3,12) and (15,23).

Is there an easy way to do this in biopython already?  Or are there any
ideas for an elegant solution?

Thanks!

Uri


...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


From p.j.a.cock at googlemail.com  Thu Mar 10 22:46:32 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Mar 2011 22:46:32 +0000
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
Message-ID: <AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>

On Thu, Mar 10, 2011 at 10:07 PM, Uri Laserson <laserson at mit.edu> wrote:
> Say I have a SeqRecord called A and a SeqRecord called B. ?A has a bunch of
> SeqFeatures associated with it, while B has none. ?I perform a gapped
> alignment between the two sequences. ?Now I want to copy the SeqFeatures
> from A onto B in a way that respects the coordinates of all the features.
>
> For example (and please use a fixed-width font for this):
> <cut>

I'm not quite sure I followed that figure.

> In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and
> (10,19), respectively. ?Now I want to copy it to sequence B, where the
> feature coords should instead be (3,12) and (15,23).
>
> Is there an easy way to do this in biopython already?

No, but I'm not sure how advisable it is anyway (if I have
understood you right - see below).

> Or are there any ideas for an elegant solution?

I actually wanted to do something similar to this myself.
I had a draft genome I had annotated in GenBank format.
We did some more sequencing and/or I tweaked the
assembly, and I had a new very similar sequence in a
FASTA file, and I wanted to copy the old annotation over.

What I did was look for perfect matches between the regions
spanned by the features (no introns in this case), and that
meant all I needed to do was apply a shift to the SeqFeature
location. There is a (private) method _shift which helped here
(written for use in slicing a SeqRecord).

In my case, that handled most of the annotation, and I did
the nasty cases by hand (since I wanted to examine what
had happened in the new assembly - it was a small genome).

In your case the start and end co-ordinates may be shifted
by different amounts (since you are doing gapped alignments).
This worries me as the length of your features can change.
For any gene or CDS features that is a problem (frame shifts).
Have you thought about that? Perhaps you're dealing with
non-coding features only?

Peter


From p.j.a.cock at googlemail.com  Fri Mar 11 09:53:16 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 11 Mar 2011 09:53:16 +0000
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
	<AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>
	<AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
Message-ID: <AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>

On Thu, Mar 10, 2011 at 11:25 PM, Uri Laserson <laserson at mit.edu> wrote:
>> I'm not quite sure I followed that figure.
>
> I think you understood perfectly.

Good - your text was clearer for me.

>> In your case the start and end co-ordinates may be shifted
>> by different amounts (since you are doing gapped alignments).
>> This worries me as the length of your features can change.
>> For any gene or CDS features that is a problem (frame shifts).
>> Have you thought about that? Perhaps you're dealing with
>> non-coding features only?
>
> That's exactly the complication here. ?I have one reference sequence that is
> highly annotated, and I have a read that I want to align to it and transfer
> over the annotations to the corresponding positions.

OK - and do you want to worry about spotting frameshifts,
and updating the translation for CDS features?

> One way I can handle this situation is that when I actually build the
> pairwise gapped alignment (which I do manually), in addition to the actual
> gapped-sequence strings, I can generate two lists that contain the ungapped
> coordinates of each sequence (in my diagram, this is the numbering above and
> below). ?Figuring out the new coords from the old coordinates is then a
> matter of matching the positions in the lists. ?(Though perhaps it's easier
> to implement using dictionaries, so I don't have to search the lists I
> generated.)

Yes, that kind of technique is also useful for  mapping between
gapped and ungapped coordinates in assembly files.

> Eitherway, in order to move the SeqFeature to the new sequence, should I
> make a deep copy of it and then manually modify the start and end coords?
> Uri

You could do, or create a new SeqFeature, or "steal" the old one and
modify it. The later technique would probably be fastest since there
are no new objects to create, just a few integer attributes changes
(location positions), but is perhaps a bit risky if you don't comment
it clearly. If you do that, perhaps do this by popping the features
from the old SeqRecord's feature list, modify them, and add them
to the new SeqRecord's feature list.

If all your current annotation uses simple exact locations, life is
easier. If there are fuzzy locations, then using the location object's
private _shift method might be simplest.

Another query, are you going to look for inversions? In such
cases the strand needs flipping and the start/end interchanged.
The SeqRecord reverse complement method has to do this,
and therefore the SeqFeature and its location and position
classes all have a private _flip method.

[If you find these private methods useful, perhaps we can make
them public? Let us know]

Thanks,

Peter


From thamelry at binf.ku.dk  Fri Mar 11 13:08:55 2011
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 11 Mar 2011 14:08:55 +0100
Subject: [Biopython] update Google Summer of Code project ideas
In-Reply-To: <AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>
References: <4D79073D.3090603@cornell.edu>
	<AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>
Message-ID: <AANLkTik+Y6uDLKniKLi-AWV-it05rd9YTg0u_Tc_0jOy@mail.gmail.com>

Hi,

I've just added a proposal:

Mocapy++Biopython: from data to probabilistic models of biomolecules
<http://biopython.org/wiki/Google_Summer_of_Code#Mocapy.2B.2BBiopython:_from_data_to_probabilistic_models_of_biomolecules>

Cheers,

-- 
Thomas Hamelryck, Eng., Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://www.binf.ku.dk/research/structural_bioinformatics/


From laserson at mit.edu  Fri Mar 11 17:03:58 2011
From: laserson at mit.edu (Uri Laserson)
Date: Fri, 11 Mar 2011 12:03:58 -0500
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
	<AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>
	<AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
	<AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>
Message-ID: <AANLkTinzrbTRmsqX_z9DJH1fnMbvMCABqfJAEbeGh2Vo@mail.gmail.com>

>
> OK - and do you want to worry about spotting frameshifts,
> and updating the translation for CDS features?
>

I can retranslate the features myself, weary of any frameshifts


> You could do, or create a new SeqFeature, or "steal" the old one and
> modify it. The later technique would probably be fastest since there
> are no new objects to create, just a few integer attributes changes
> (location positions), but is perhaps a bit risky if you don't comment
> it clearly. If you do that, perhaps do this by popping the features
> from the old SeqRecord's feature list, modify them, and add them
> to the new SeqRecord's feature list.
>

I can't steal the features because the source of the features is a reference
sequence that I will reuse for millions of reads.  I will have to make a
copy.  You believe that building a new SeqFeature would be faster/safer than
using python's copy.deepcopy() method?


> Another query, are you going to look for inversions? In such
> cases the strand needs flipping and the start/end interchanged.
> The SeqRecord reverse complement method has to do this,
> and therefore the SeqFeature and its location and position
> classes all have a private _flip method.
>

All the reads will be reverse complemented to the coding orientation before
the transfer of the features, so I don't think this will be a problem.


> [If you find these private methods useful, perhaps we can make
> them public? Let us know]
>

It's hard to tell what the general API should be or what are the most common
use-cases.  For myself, I can get by with writing my own methods to modify
the coordinates accordingly.

Uri


From p.j.a.cock at googlemail.com  Fri Mar 11 17:15:09 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 11 Mar 2011 17:15:09 +0000
Subject: [Biopython] Transferring SeqFeatures between aligned sequences
In-Reply-To: <AANLkTinzrbTRmsqX_z9DJH1fnMbvMCABqfJAEbeGh2Vo@mail.gmail.com>
References: <AANLkTik-Pa7tPs=knepZYYu92-emHbR2ptYTHAhfxWJk@mail.gmail.com>
	<AANLkTint4fjxkkknMOjZ0WqsEyhR1OL_joShuAoOg3m8@mail.gmail.com>
	<AANLkTikc0387NYygsmoQJNfExJHdGJhNCde1Z5_Gkwyk@mail.gmail.com>
	<AANLkTimyzVBDe48Vm=mak0j_ftfy5SXf__rOLg26Az-K@mail.gmail.com>
	<AANLkTinzrbTRmsqX_z9DJH1fnMbvMCABqfJAEbeGh2Vo@mail.gmail.com>
Message-ID: <AANLkTi=sKcpmoc+0bpfask_Q2=v2kXJq6y5Pxji4y8o0@mail.gmail.com>

On Fri, Mar 11, 2011 at 5:03 PM, Uri Laserson <laserson at mit.edu> wrote:
>> You could do, or create a new SeqFeature, or "steal" the old one and
>> modify it. The later technique would probably be fastest since there
>> are no new objects to create, just a few integer attributes changes
>> (location positions), but is perhaps a bit risky if you don't comment
>> it clearly. If you do that, perhaps do this by popping the features
>> from the old SeqRecord's feature list, modify them, and add them
>> to the new SeqRecord's feature list.
>
> I can't steal the features because the source of the features is a reference
> sequence that I will reuse for millions of reads. ?I will have to make a
> copy. ?You believe that building a new SeqFeature would be faster/safer than
> using python's copy.deepcopy() method?

Yes, in this case you will have to make a copy. As too speed,
I'm not sure which would be fastest - try it and see ;)
Note as long as you are not going to *change* the information
in the qualifiers dictionary (and you may want to if you update
the translation for example), then you can have the new
SeqFeature share the old qualifiers dictionary. That is a bit
sneaky but may help with speed (if speed is an issue).

>> [If you find these private methods useful, perhaps we can make
>> them public? Let us know]
>
> It's hard to tell what the general API should be or what are the most common
> use-cases. ?For myself, I can get by with writing my own methods to modify
> the coordinates accordingly.

Thanks,

Peter


From reece at harts.net  Mon Mar 14 18:22:52 2011
From: reece at harts.net (Reece Hart)
Date: Mon, 14 Mar 2011 11:22:52 -0700
Subject: [Biopython] update Google Summer of Code project ideas
In-Reply-To: <AANLkTik+Y6uDLKniKLi-AWV-it05rd9YTg0u_Tc_0jOy@mail.gmail.com>
References: <4D79073D.3090603@cornell.edu>
	<AANLkTinMWsPFmoigY2nPxPKfcj164CsXbWs8M71+xcuz@mail.gmail.com>
	<AANLkTik+Y6uDLKniKLi-AWV-it05rd9YTg0u_Tc_0jOy@mail.gmail.com>
Message-ID: <AANLkTin4xadMNnjj_o-XGxGmjcX79CSfkLoKwhubjkzi@mail.gmail.com>

All-

I just added a GSoC Biopython proposal:
Variant representation, parser, generator, and coordinate
converter<http://biopython.org/wiki/Google_Summer_of_Code#Variant_representation.2C_parser.2C_generator.2C_and_coordinate_converter>

Comments and co-mentors welcome.

-Reece


From 2huggie at gmail.com  Wed Mar 16 08:26:44 2011
From: 2huggie at gmail.com (Timothy Wu)
Date: Wed, 16 Mar 2011 16:26:44 +0800
Subject: [Biopython] [BioPython] Genbank parser
Message-ID: <AANLkTik2-6n_F3-mgFHEDOiSWWmnpXE9Xjn_CMThvfu8@mail.gmail.com>

Hi,

I'm using Biopython to parse human genome files with code like this:

        for seq_record in SeqIO.parse(fd, "genbank"):
            * do something with seq_record*

However something tripped on me:

Traceback (most recent call last):
  File "./buildSyn.py", line 26, in <module>
    main()
  File "./buildSyn.py", line 19, in main
    gene2SynMapping, syn2GeneMapping = mapper.getMappingDicts(files)
  File
"/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py",
line 29, in getMappingDicts
    self.parseAndGetMapping(fd, gene2syn)
  File
"/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py",
line 74, in parseAndGetMapping
    for seq_record in SeqIO.parse(fd, "genbank"):
  File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 525, in
parse
    for r in i:
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 437, in
parse_records
    record = self.parse(handle, do_features)
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 420, in
parse
    if self.feed(handle, consumer, do_features):
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 392, in
feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 344, in
_feed_feature_table
    consumer.location(location_string)
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/__init__.py", line 975, in
location
    raise LocationParserError(location_line)
Bio.GenBank.LocationParserError: 958574^958575..958886

The Genbank file involved has the following structure:

    CDS             958574^958575..958772
                     /gene="CSH2"
                     /gene_synonym="CS-2; CSB; hCS-B"
                     /exception="unclassified translation discrepancy"
                     /note="placental lactogen; chorionic somatomammotropin
B;
                     Derived by automated computational analysis using gene
                     prediction method: Curated Genomic."
                     /codon_start=1
                     /product="chorionic somatomammotropin hormone 2 isoform
3"
                     /protein_id="NP_072171.1"
                     /db_xref="GI:12408694"
                     /db_xref="CCDS:CCDS42368.1"
                     /db_xref="GeneID:1443"
                     /db_xref="HGNC:2441"
                     /db_xref="MIM:118820"

This isn't the first occurrence in this file, however I manually deleted
what's equivalent of "^958575"
in the location and it works out OK.

Is there something I can do? Right now I edit the genbank file instead
(since I won't be needing the location information)
And I'm not sure what the caret is suppose to represent.

Thanks for your attention.

Timothy


From p.j.a.cock at googlemail.com  Wed Mar 16 11:43:28 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 16 Mar 2011 11:43:28 +0000
Subject: [Biopython] [BioPython] Genbank parser
In-Reply-To: <AANLkTik2-6n_F3-mgFHEDOiSWWmnpXE9Xjn_CMThvfu8@mail.gmail.com>
References: <AANLkTik2-6n_F3-mgFHEDOiSWWmnpXE9Xjn_CMThvfu8@mail.gmail.com>
Message-ID: <AANLkTi=O8btp9Yheqs5jx1TR+g-2MBj_XZ6E0aq3cXkf@mail.gmail.com>

On Wed, Mar 16, 2011 at 8:26 AM, Timothy Wu <2huggie at gmail.com> wrote:
> Hi,
>
> I'm using Biopython to parse human genome files with code like this:
>
> ? ? ? ?for seq_record in SeqIO.parse(fd, "genbank"):
> ? ? ? ? ? ?* do something with seq_record*
>
> However something tripped on me:
>
> Traceback (most recent call last):
> ...
> ? ?raise LocationParserError(location_line)
> Bio.GenBank.LocationParserError: 958574^958575..958886
>
> The Genbank file involved has the following structure:
>
> ? ?CDS ? ? ? ? ? ? 958574^958575..958772
> ? ? ? ? ? ? ? ? ? ? /gene="CSH2"
> ...
>
> This isn't the first occurrence in this file, however I manually deleted
> what's equivalent of "^958575" in the location and it works out OK.
>
> Is there something I can do? Right now I edit the genbank file instead
> (since I won't be needing the location information)
> And I'm not sure what the caret is suppose to represent.

Hi Timothy,

I believe this to be an invalid GenBank file, and I would like you
to contact the NCBI to check this. The caret is used for 'between'.
Here it seems to be saying meaning this feature starts between
958574 and 958575, and runs to 958772. That would normally
be represented just as 958575..958772

See also:
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
http://redmine.open-bio.org/issues/3175
(we're migrating the bug database, official announcement
due soon)

How many of this kind of 'broken' GenBank records have you
found? I would hope it is just one or two that can be fixed by
hand. If on the other hand the NCBI say this is valid, we need
to handle this in the Biopython feature model...

Peter


From cjfields at illinois.edu  Wed Mar 16 17:58:23 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 16 Mar 2011 12:58:23 -0500
Subject: [Biopython] [ANNOUNCEMENT] Bugzilla to Redmine migration
Message-ID: <34C8C0CB-9273-468E-86D7-74B22464F181@illinois.edu>

(apologies if you receive multiple copies of this)

All,

We are currently about 95% done with a transition over to our new Redmine tracking system, to the point where we feel comfortable in going ahead with opening it to developers:

http://redmine.open-bio.org/

All edits to bugzilla reports on our old system (http://bugzilla.open-bio.org/) are now disabled and the system is now read-only.  Any new bugs and comments to old ones should be reported on the new Redmine server.

For current Bugzilla users, we have migrated login IDs to Redmine (this is normally an email address), but we have reset user passwords for security reasons.  There are two ways to access your account:

1) When logging in (http://redmine.open-bio.org/login), click on the 'Lost password' link.  You will be prompted for your email address (this should be the same as your bugzilla login).  An new email will be sent out containing directions for resetting your password and logging in.

2) It is possible the above may be automatically detected as spam.  If the above doesn't work or the reset email isn't received within a day, contact support at helpdesk.open-bio.org to receive your new password.

Also, note that Redmine has a different syntax for those who want to add links to their reports; see http://www.redmine.org/projects/redmine/wiki/RedmineTextFormatting.

Let us know if you have any questions.  

chris

Christopher Fields
IGB Postdoctoral Fellow
Genomics of Neural & Behavioral Plasticity
University of Illinois Urbana-Champaign
Institute for Genomic Biology
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From rmb32 at cornell.edu  Fri Mar 18 19:23:37 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Fri, 18 Mar 2011 15:23:37 -0400
Subject: [Biopython] Google Summer of Code is *ON* for OBF projects!
Message-ID: <4D83B139.4010803@cornell.edu>

Hi all,

Great news: Google announced today that the Open Bioinformatics
Foundation has been accepted as a mentoring organization for this
summer's Google Summer of Code!

GSoC is a Google-sponsored student internship program for open-source
projects, open to students from around the world (not just US
residents).   Students are paid a $5000 USD stipend to work as a
developer on an open-source project for the summer. For more on GSoC,
see GSoC 2011 FAQ at http://bit.ly/hpoz8W

Student applications are due April 8, 2011 at 19:00 UTC.  Students who
are interested in participating should look at the OBF's GSoC page at
http://open-bio.org/wiki/Google_Summer_of_Code, which lists project
ideas, and whom to contact about applying.

For current developers on OBF projects, please consider volunteering to
be a mentor if you have not already, and contribute project ideas.  Just
list your name and project ideas on OBF wiki and on the relevant
project's GSoC wiki page.

Thanks to all who helped make OBF's application to GSoC a success, and
let's have a great, productive summer of code!

Rob Buels
OBF GSoC 2011 Administrator


From laserson at mit.edu  Mon Mar 21 23:38:10 2011
From: laserson at mit.edu (Uri Laserson)
Date: Mon, 21 Mar 2011 19:38:10 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC
	formats?
Message-ID: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>

If I load a GenBank-formatted record:

    a = SeqIO.parse('myfile.gb','gb').next()

then set an annotation:

    a.annotations['myannotation'] = 'saveme'

and then format the SeqRecord object as GenBank:

    a.format('gb')

then 'myannotation' is lost.

Is this expected behavior?  If so, that's a huge bummer...what is the
suggested method to store my own annotations in INSDC formats?

Uri


...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


From p.j.a.cock at googlemail.com  Tue Mar 22 09:22:17 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 09:22:17 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
Message-ID: <AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>

On Mon, Mar 21, 2011 at 11:38 PM, Uri Laserson <laserson at mit.edu> wrote:
> If I load a GenBank-formatted record:
>
> ? ?a = SeqIO.parse('myfile.gb','gb').next()
>
> then set an annotation:
>
> ? ?a.annotations['myannotation'] = 'saveme'
>
> and then format the SeqRecord object as GenBank:
>
> ? ?a.format('gb')
>
> then 'myannotation' is lost.

It isn't 'lost' in that it is still in your SeqRecord object in
memory, but it isn't in the GenBank format output.

> Is this expected behavior?

Yes, there is no general field for record level annotation in the
GenBank or EMBL file formats. Where did you expect it to be
written? The same thing would happen with most file formats,
e.g. FASTA has no annotation support at all beyond the free
text description line.

> If so, that's a huge bummer...what is the suggested method to
> store my own annotations in INSDC formats?

You could stuff record level information into a source feature's
qualifier dictionary. It isn't elegant, but it would work. The NCBI
seems to have introduced the source feature primarily to use
this to store the taxon identifier and other little bits of information
not handles explicitly in the header lines. (Plus this can handle
chimeras which may have been a use case).

Peter


From laserson at mit.edu  Tue Mar 22 15:08:08 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 22 Mar 2011 11:08:08 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
Message-ID: <AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>

>
> You could stuff record level information into a source feature's
> qualifier dictionary.


What are the allowed types for the values of the qualifiers dictionary (that
will be output correctly in INSDC)?  Is it possible to have lists of
strings?

What is the standard practice: a feature of type "source" that runs the
entire length of the sequence?  Or is it possible to have a SeqFeature with
no position annotation?  Ideally, if I slice the SeqFeature, I would like
these annotations to stay with the slice no matter what.

Uri


From p.j.a.cock at googlemail.com  Tue Mar 22 15:30:46 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 15:30:46 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
Message-ID: <AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>

On Tue, Mar 22, 2011 at 3:08 PM, Uri Laserson <laserson at mit.edu> wrote:
>> You could stuff record level information into a source feature's
>> qualifier dictionary.
>
> What are the allowed types for the values of the qualifiers dictionary
> (that will be output correctly in INSDC)? ?Is it possible to have lists of
> strings?

As far as the current Biopython output goes, you can basically use any
(short) string as a qualifier key. Avoid keys with spaces in them (INSDC
use underscores) and other funny characters. For strict INSDC compliance
there is probably a white list of allowed feature types...

> What is the standard practice: a feature of type "source" that runs the
> entire length of the sequence? ?Or is it possible to have a SeqFeature with
> no position annotation? ?Ideally, if I slice the SeqFeature, I would like
> these annotations to stay with the slice no matter what.

If you did have a SeqFeature without a location, we couldn't write
it out in GenBank/EMBL format (the error handling here might be
improved).

If you have a SeqRecord with a (source) feature spanning the full
sequence, and you slice the SeqRecord to take a subsequence,
then that full length feature (and any other features not fully within
the subsequence) would be lost.

Using a source feature is really just a work around for the fact that
GenBank/EMBL do not support arbitrary record level annotation.
Do you have to use this as your output format? Would you not be
better off with using a database or something else instead?

Peter


From laserson at mit.edu  Tue Mar 22 15:44:02 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 22 Mar 2011 11:44:02 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
Message-ID: <AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>

>
> As far as the current Biopython output goes, you can basically use any
> (short) string as a qualifier key.
>

Sorry, I meant for the values, not the keys.  Can you have a list of strings
as a value?


> Using a source feature is really just a work around for the fact that
> GenBank/EMBL do not support arbitrary record level annotation.
> Do you have to use this as your output format?


Agreed.  Essentially, I have a huge pile of sequencing reads that are highly
annotated.  For any given read, there are some annotations that are
independent of the sequence itself (which is what I am trying to implement
now) and there are some annotations that are associated with subsequences
(which is why SeqFeatures are very appropriate).  Ideally, I want a file
format that will store the data, be easily parsable (and fast), and can be
readable using something like `less` (though this last feature is less
important).


> Would you not be
> better off with using a database or something else instead?
>

Well, initially I used XML to store the data, but I quickly realized I was
reinventing the wheel, especially when it came to annotating features on top
of the sequences.

Are you suggesting something like SQLite?  How would I deal with
SeqFeature-type annotations?

Uri


>  Peter
>


From p.j.a.cock at googlemail.com  Tue Mar 22 16:14:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 16:14:05 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
	<AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
Message-ID: <AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>

On Tue, Mar 22, 2011 at 3:44 PM, Uri Laserson <laserson at mit.edu> wrote:
>> As far as the current Biopython output goes, you can basically use any
>> (short) string as a qualifier key.
>
> Sorry, I meant for the values, not the keys. ?Can you have a list of strings
> as a value?

Right. Again yes, plus I think a single string as the value should work.
This is because the INSDC feature table allows multiple values for a
tag - for example you often get multiple database cross references.

>> Using a source feature is really just a work around for the fact that
>> GenBank/EMBL do not support arbitrary record level annotation.
>> Do you have to use this as your output format?
>
> Agreed. ?Essentially, I have a huge pile of sequencing reads that are highly
> annotated. ?For any given read, there are some annotations that are
> independent of the sequence itself (which is what I am trying to implement
> now) and there are some annotations that are associated with subsequences
> (which is why SeqFeatures are very appropriate). ?Ideally, I want a file
> format that will store the data, be easily parsable (and fast), and can be
> readable using something like `less` (though this last feature is less
> important).

For this the GenBank/EMBL format with the source feature trick
does sound workable. You just need to be careful how how and
when you create the dummy source feature - I'd do it at the last
moment before writing out the file, and in that way you can avoid
things like slicing throwing it away.

>> Would you not be
>> better off with using a database or something else instead?
>
> Well, initially I used XML to store the data, but I quickly realized I was
> reinventing the wheel, especially when it came to annotating features
> on top of the sequences.

I wonder if one of the INSDC XML formats would work nicely here?
i.e. If they can be extended more easily. We should look at adding a
parser for them to Biopython (and write support too ideally of course).

> Are you suggesting something like SQLite? ?How would I deal with
> SeqFeature-type annotations?

I was thinking you could use the BioSQL schema (run on SQLite if
you wanted to, or MySQL or PostgresSQL etc). You'd still face the
same issues if/when you wanted to dump the annotated records
to a plain text file though.

Peter


From laserson at mit.edu  Tue Mar 22 16:58:03 2011
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 22 Mar 2011 12:58:03 -0400
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
	<AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
	<AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>
Message-ID: <AANLkTikh+vjPbVoUw56R5VQuN6Zgg5WVDyf_i36--+Gr@mail.gmail.com>

>
> For this the GenBank/EMBL format with the source feature trick
> does sound workable. You just need to be careful how how and
> when you create the dummy source feature - I'd do it at the last
> moment before writing out the file, and in that way you can avoid
> things like slicing throwing it away.
>
>
That's a good idea.  This should be even easier since I am subclassing
SeqRecord.  I can override `format` to first take the whole annotations
dictionary and dump it into the qualifiers dictionary of a `source` feature.
 I also have my own parser which wraps SeqIO; using SeqIO to parse the
'imgt' format, I can then copy the `source` qualifiers to the annotations
dictionary and delete `source` feature entirely.  Does this sound
reasonable?


> I wonder if one of the INSDC XML formats would work nicely here?
> i.e. If they can be extended more easily. We should look at adding a
> parser for them to Biopython (and write support too ideally of course).
>

My only issue with this is that I'd rather not extend anyone's file format,
but use a standard file format that fits my purpose.  Otherwise, I might as
well just go straight for a database, as below.  (But there are some
super-fast XML parsers out there.)


> I was thinking you could use the BioSQL schema (run on SQLite if
> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
> same issues if/when you wanted to dump the annotated records
> to a plain text file though.
>

I suppose plain text readability is less important to me than ease of
sharing the data.  But when I dump a SeqRecord object to a BioSQL database,
does it do it in a way that I can rebuild that object exactly with no loss
of information? (I.e., does it solve the annotation dictionary problem that
started this whole thread?)

Uri


From p.j.a.cock at googlemail.com  Tue Mar 22 17:24:46 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Mar 2011 17:24:46 +0000
Subject: [Biopython] User-defined SeqRecord annotations are trashed in
 INSDC formats?
In-Reply-To: <AANLkTikh+vjPbVoUw56R5VQuN6Zgg5WVDyf_i36--+Gr@mail.gmail.com>
References: <AANLkTik=mrv8k5uPMtVvw7X1S=Sj_cV54rK+syu9PNcS@mail.gmail.com>
	<AANLkTim9Pt_=ovMcw056kRncGB095+Exan+ge+A3aJyw@mail.gmail.com>
	<AANLkTi=iFYteknFhwkCFRhGtuiRH0wKdUJ4GFijfpCeo@mail.gmail.com>
	<AANLkTimk+twNBrFtYLz-4SQCZEdNBHkbnmok=aPp54=+@mail.gmail.com>
	<AANLkTik0xCVH8NaUJEbjKa6MbTjiPbN+KeR0KsFVZBb-@mail.gmail.com>
	<AANLkTikxn+T4449LLFY0+bHFawYUty1VgsqYUKkj9chz@mail.gmail.com>
	<AANLkTikh+vjPbVoUw56R5VQuN6Zgg5WVDyf_i36--+Gr@mail.gmail.com>
Message-ID: <AANLkTi=1=-j7mgZBp1MfGguKE2F4wpfjZn5=mZx+31ML@mail.gmail.com>

On Tue, Mar 22, 2011 at 4:58 PM, Uri Laserson <laserson at mit.edu> wrote:
>> For this the GenBank/EMBL format with the source feature trick
>> does sound workable. You just need to be careful how how and
>> when you create the dummy source feature - I'd do it at the last
>> moment before writing out the file, and in that way you can avoid
>> things like slicing throwing it away.
>
> That's a good idea. ?This should be even easier since I am subclassing
> SeqRecord. ?I can override `format` to first take the whole annotations
> dictionary and dump it into the qualifiers dictionary of a `source` feature.
> ?I also have my own parser which wraps SeqIO; using SeqIO to parse the
> 'imgt' format, I can then copy the `source` qualifiers to the annotations
> dictionary and delete `source` feature entirely. ?Does this sound
> reasonable?

Yes, using your own parser/writer to take care to mapping between
the SeqRecord annotations dictionary and a dummy feature sounds
sensible. Also using 'imgt' rather than GenBank or EMBL will let you
have longer feature qualifier keys - but these files are not as widely
used/supported as the GenBank and EMBL formats.

>> I wonder if one of the INSDC XML formats would work nicely here?
>> i.e. If they can be extended more easily. We should look at adding a
>> parser for them to Biopython (and write support too ideally of course).
>
> My only issue with this is that I'd rather not extend anyone's file format,
> but use a standard file format that fits my purpose. ?Otherwise, I might as
> well just go straight for a database, as below. ?(But there are some
> super-fast XML parsers out there.)

I haven't looked at the details to see if those XML file formats have
a nice open ended misc annotation tag you could just use.

>> I was thinking you could use the BioSQL schema (run on SQLite if
>> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
>> same issues if/when you wanted to dump the annotated records
>> to a plain text file though.
>
> I suppose plain text readability is less important to me than ease of
> sharing the data. ?But when I dump a SeqRecord object to a BioSQL
> database, does it do it in a way that I can rebuild that object exactly
> with no loss of information? (I.e., does it solve the annotation dictionary
> problem that started this whole thread?)

Basically yes, subject to a few provisos, it should. Firstly note we
don't support any per-letter-annotation in BioSQL. Secondly, all
the SeqRecord annotations SeqFeature qualifiers will end up being
stored as strings (in table bioentry_qualifier_value and table
seqfeature_qualifier_value respectively). There may also be some
fun with string values vs single entry lists containing one string.

Peter


From gori at cs.ru.nl  Wed Mar 23 17:43:16 2011
From: gori at cs.ru.nl (Fabio Gori)
Date: Wed, 23 Mar 2011 18:43:16 +0100
Subject: [Biopython] From genome to lineage with Entrez
Message-ID: <201103231843.16762.gori@cs.ru.nl>

Hi all,

I have downloaded all the bacterial genomes 
(ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare 
their taxonomic lineages.

I'm looking for a way to get their lineages with Entrez. From the files I can 
get the accession numbers and GIs, but I don't know how to get their taxonomic 
ids.
I know that I can step from GIs to Taxids processing the file 
gi_taxid_nucl.dmp, but I'd prefer to use Entrez. 


Thanks in advance,

Fabio

-- 

F. Gori, PhD student
Intelligent Systems
ICIS (Institute for Computing and Information Sciences)
Radboud University Nijmegen

Home Page: http://www.cs.ru.nl/~gori/


From p.j.a.cock at googlemail.com  Wed Mar 23 18:01:32 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 23 Mar 2011 18:01:32 +0000
Subject: [Biopython] From genome to lineage with Entrez
In-Reply-To: <201103231843.16762.gori@cs.ru.nl>
References: <201103231843.16762.gori@cs.ru.nl>
Message-ID: <AANLkTinwwDABAtq4bweFZVM4gkQq=hx1Q6fcJO2BSbrs@mail.gmail.com>

On Wed, Mar 23, 2011 at 5:43 PM, Fabio Gori <gori at cs.ru.nl> wrote:
> Hi all,
>
> I have downloaded all the bacterial genomes
> (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare
> their taxonomic lineages.
>
> I'm looking for a way to get their lineages with Entrez. From the files I can
> get the accession numbers and GIs, but I don't know how to get their taxonomic
> ids.
> I know that I can step from GIs to Taxids processing the file
> gi_taxid_nucl.dmp, but I'd prefer to use Entrez.
>

I think you can do it with ELink, but personally I'd use the taxid dump file,
since it sounds like you'll want to process hundreds of lineages.

Peter


From amenity at enthought.com  Thu Mar 24 03:29:35 2011
From: amenity at enthought.com (Amenity Applewhite)
Date: Wed, 23 Mar 2011 22:29:35 -0500
Subject: [Biopython] SciPy 2011 Call for Papers
Message-ID: <AANLkTinNAz1hGt6sDC37WTJceFaiKaTitfHsmziuBRn4@mail.gmail.com>

Hello,

SciPy 2011 <http://conference.scipy.org/scipy2011/index.php>, the 10th
Python in Science conference, will be held July 11 - 16, 2011, in Austin,
TX.

At this conference, novel applications and breakthroughs made in the pursuit
of science using Python are presented. Attended by leading figures from both
academia and industry, it is an excellent opportunity to experience the
cutting edge of scientific software development.

The conference is preceded by two days of tutorials, during which community
experts provide training on several scientific Python packages.

*We'd like to invite you to consider presenting at SciPy 2011.*

The list of topics that are appropriate for the conference includes (but is
not limited to):
     * new Python libraries for science and engineering;
     * applications of Python to the solution of scientific or computational
problems;
     * high performance, parallel and GPU computing with Python;
     * use of Python in science education.

*Specialized Tracks*
This year we also have two specialized tracks. They will be run concurrent
to the main conference.

         *Python in Data Science
         Chair: Peter Wang, Streamitive, Inc.*
   This track focuses on the advantages and challenges of applying Python in
   the emerging field of "data science".  This includes a breadth of
   technologies, from wrangling realtime data streams from the social web,
to
   machine learning and semantic analysis, to workflow and repository
   management for large datasets.

         *Python and Core Technologies
         Chair: Anthony Scopatz, Enthought, Inc.*
   In an effort to broaden the scope of SciPy and to engage the larger
   community of software developers, we are pleased to introduce the _Python
&
   Core Technologies_ track. Talks will cover subjects that are not directly
   related to science and engineering, yet nonetheless affect scientific
   computing. Proposals on the Python language, visualization toolkits, web
   frameworks, education, and other topics are appropriate for this session.

*Talk/Paper Submission*

   We invite you to take part by submitting a talk abstract on the
conference
   website at:
   http://conference.scipy.org/scipy2011/papers.php
   Papers are included in the peer-reviewed conference proceedings, to be
   published online.

*Important dates for authors:*
   Friday, April 15: Tutorial proposals due (remember: stipends will be
provided for Tutorial instructors)

http://conference.scipy.org/scipy2011/tutorials.php
   Sunday, April 24: Paper abstracts due
   Sunday, May 8: Student sponsorship request due
http://conference.scipy.org/scipy2011/student.php
   Tuesday, May 10: Accepted talks announced
   Monday, May 16: Student sponsorships announced
   Monday, May 23: Early Registration ends
   Sunday, June 20: Papers due
   Monday-Tuesday, July 11 - 12: Tutorials
   Wednesday-Thursday, July 13 - July 14: Conference
   Friday-Saturday, July 15 - July 16: Sprints


   The SciPy 2011 Team

  @SciPy2011
  http://twitter.com/SciPy2011

_________________________
Amenity Applewhite
Enthought, Inc. <http://www.enthought.com/>
Scientific Computing Solutions


From michele.silva at gmail.com  Fri Mar 25 06:11:41 2011
From: michele.silva at gmail.com (Michele)
Date: Fri, 25 Mar 2011 03:11:41 -0300
Subject: [Biopython] [GSoC] Proposal: Mocapy++Biopython
In-Reply-To: <AANLkTi=s=74jsMu4LP2RqnXeq28taun+SP1efQjnY8ts@mail.gmail.com>
References: <AANLkTi=s=74jsMu4LP2RqnXeq28taun+SP1efQjnY8ts@mail.gmail.com>
Message-ID: <AANLkTi=oHxSoNzqq1N2auZnCxoanDoXc3RO=QXwhn8vn@mail.gmail.com>

Hello everyone,

I'm Michele, a computer scientist and passionate developer who is
currently enrolled in a biomedicine course. That's why I got in touch with
the biopython project and have tried its tools for biological computation.

When I read the Mocapy++Biopython proposal I immediately fell in love with
it. Let me tell you why. I have worked since 2005 with bayesian networks,
modelling BN for medical learning environments and also programming
algorithms for handling those nets. In the context of my masters in computer
science with the Artificial Intelligence
Group<http://www.inf.ufrgs.br/gia/>, we have published several papers
on the idea of using bayesian networks to
model the uncertainty associated with the students' behavior in learning
environments (see, for example, Designing a Bayesian Network based Student
Model for Distance Learning
Environments<http://ieeexplore.ieee.org/Xplore/login.jsp?url=http://ieeexplore.ieee.org/iel5/4280926/4280927/04281040.pdf%3Farnumber%3D4281040&authDecision=-203>published
at the Seventh IEEE International Conference on Advanced Learning
Technologies, 2007).

As for the C++ and Python glue, I also have enjoyed the project's proposal.
I have been programming in C++ for more than 5 years, in small and big
projects, mainly in microelectronics CAD and firmware
development. Coincidentally, last year I started working with Python in
bigger projects. I worked for ESSS, a company which develops software for
scientific computing and engineering simulation. I worked with oil reservoir
simulation, where the applications were developed in Python and the
simulation core and the computer graphics algorithms were programmed in C++.
If you want to have a feeling on what reservoir simulation and the
applications I worked in look like, have a look at the Kraken's project
website <https://www.esss.com.br/kraken/>. I worked in both Python and C++
development, as well as in the glue through the use of boost python.

Regarding the experience in biomolecular structure, I'm a beginner. I have
started studying biomedicine this year and therefore have a lot to learn. I
know a bit about the PDB format and molecular biology. I'm sure I can count
on your help to continue learning.

So that was my not-so-short presentation. I would love to get to know the
community better and work together on the GSoC. Please let me know if you
think I could write a proposal and If you can help me on that.

Cheers,

Michele Silva

http://www.linkedin.com/pub/michele-silva/6/520/5b0


From p.j.a.cock at googlemail.com  Fri Mar 25 07:37:00 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 25 Mar 2011 07:37:00 +0000
Subject: [Biopython] Public example FASTQ files (for Tutorial examples)?
Message-ID: <AANLkTi=-158GrKViXYrdUSC1dmrO2FXHCfrtDmYiKK+T@mail.gmail.com>

Hi all,

One of the volunteers proof reading the Biopython tutorial
noticed our links to specific example FASTQ files at the NCBI
SRA don't work any more. They have withdrawn them from
the FTP site, although you can still download the files in
the compressed *.sra format and in in theory convert then
to FASTQ locally with the NCBI's toolkit (which is cross
platform).

Another option is to download the FASTQ files via the
NCBI's webinterface. Unless there is an obvious way to
do this with a URL that I missed initially, we have a
complicated situation to describe where the user can
choose all the reads for an experiment or just the filtered
set, and also choose to have them pre-trimmed or not.
Plus for me at least, the HTPP download wasn't as
robust as the FTP one was.

I'm hoping someone could suggest a couple of other
moderately sized FASTQ files which are public, on
FTP or a static HTML server, which we can use in
the tutorial.

So, suggestions?

Thanks!

Peter


From brettpthomas at gmail.com  Tue Mar 29 14:50:38 2011
From: brettpthomas at gmail.com (Brett Thomas)
Date: Tue, 29 Mar 2011 10:50:38 -0400
Subject: [Biopython] VCF files
In-Reply-To: <AANLkTingEonqxANk_M871ig_iqHQs685hi4pMhnvjVfA@mail.gmail.com>
References: <AANLkTingEonqxANk_M871ig_iqHQs685hi4pMhnvjVfA@mail.gmail.com>
Message-ID: <AANLkTim4g0+V6apSU5w7vX=BwtGuAeKMSycMG+BKOggb@mail.gmail.com>

Hi all,

I write software for genetic research, and the predominant file format we
use is VCF, a new file format used to represent genetic variation in the
1000 genomes project.

Has there been any discussion of a biopython api for vcf files? I'd be happy
to help if anybody is working on it.

Thanks,
Brett


From jamesrwagner at gmail.com  Tue Mar 29 17:55:56 2011
From: jamesrwagner at gmail.com (James Wagner)
Date: Tue, 29 Mar 2011 13:55:56 -0400
Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work
Message-ID: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>

Hello:

I was trying just as a proof of concept to do an NCBI WWW BLAST query
with a FASTA file containing more than one sequence (but still a small
number of sequences).

I tried with the opuntia.fasta file from the website, and set it up as follows:

result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
blast_records = NCBIXML.parse(result_handle)

then I try:

for record in blast_records:
      print record.alignments

and I obtain:
[]


Surely at the very least since there were 7 sequences in this file, I
should get 7 empty lists, assuming of course none of the sequences
gives a hit in nr, which I am sure is not the case either?

What is still missing? I realize I could use SeqIO.parse to obtain
each sequence from the FASTA file and do a separate qblast, but surely
doing this separately for each protein would create unnecessary
overhead with the network traffic compared to somehow sending off all
the protein queries at once?


From p.j.a.cock at googlemail.com  Tue Mar 29 18:07:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 29 Mar 2011 19:07:47 +0100
Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work
In-Reply-To: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>
References: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>
Message-ID: <AANLkTi=jwumygS0UCB3pKSzq0x_ivhk26JRBK-4Odgcf@mail.gmail.com>

On Tue, Mar 29, 2011 at 6:55 PM, James Wagner <jamesrwagner at gmail.com> wrote:
> Hello:
>
> I was trying just as a proof of concept to do an NCBI WWW BLAST query
> with a FASTA file containing more than one sequence (but still a small
> number of sequences).
>
> I tried with the opuntia.fasta file from the website, and set it up as follows:
>
> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
> blast_records = NCBIXML.parse(result_handle)
>
> then I try:
>
> for record in blast_records:
> ? ? ?print record.alignments
>
> and I obtain:
> []
>
>
> Surely at the very least since there were 7 sequences in this file, I
> should get 7 empty lists, assuming of course none of the sequences
> gives a hit in nr, which I am sure is not the case either?

Not necessarily, the NCBI may have fixed this but for a long time if
you had say 7 queries but only 2 gave hits, stand alone BLAST's
XML output would only contain those 2 hits. There would be nothing
at all from the 5 hit less queries. This was/is very annoying, but
right now I'm not sure if they have fixed this or not.

Try getting back the results as plain text and manually inspect them.
In the plain text output all the queries appear, and there is a clear
"no hits found" message.

> What is still missing? I realize I could use SeqIO.parse to obtain
> each sequence from the FASTA file and do a separate qblast, but surely
> doing this separately for each protein would create unnecessary
> overhead with the network traffic compared to somehow sending off all
> the protein queries at once?

Yes, in theory a single large query should have less overhead
than individual queries. Personally I'd just use standalone BLAST
and run it locally if I had more than a few queries.

Peter


From jamesrwagner at gmail.com  Tue Mar 29 20:43:35 2011
From: jamesrwagner at gmail.com (James Wagner)
Date: Tue, 29 Mar 2011 16:43:35 -0400
Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work
In-Reply-To: <AANLkTi=jwumygS0UCB3pKSzq0x_ivhk26JRBK-4Odgcf@mail.gmail.com>
References: <AANLkTimVdtVpNSzdxjqJ_9fRCGSvB0Hnge_eDh=MAWwF@mail.gmail.com>
	<AANLkTi=jwumygS0UCB3pKSzq0x_ivhk26JRBK-4Odgcf@mail.gmail.com>
Message-ID: <AANLkTinqaqE8PNKuyr=LNLuRW1YtARDuXT3WJ6qW-eKE@mail.gmail.com>

OK, when I try to create a .fasta file with just the first sequence in
opuntia, I get no hits. However, when I just copy paste the nucleotide
sequence and copy paste, I get 50 hits!  This is consistent with what
happens with copy pasting the first opuntia sequence into the NCBI
BLAST web interafce, though there I obtain 110 hits for intronic
sequences in Opuntia chloroplast and chloroplasts. As a secondary
point I also find it curious the result with using NCBIWWW is limited
to 50 hits (I thought it was 500 by default). But what is more
problematic than the fact that I get no hits when using a FASTA file
with only a single sequence, when clearly there are some very high
homology hits present in nr.

This is my code from beginning to end, where the file opuntia1.fasta
is a file containing only the 1st sequence from opuntia.fasta, and
when using the line for opuntia1.fasta it resulted in no hits. I am
using BioPython 1.5.3 and Python 2.6 on Ubuntu if this has any effect
on the results. I also tried it by obtaining a single sequence from
SeqIO.parse and then obtaining the Seq of this sequence, and it also
gave 50 hits. So it's basically just with using a FASTA file handle
that I can't get it to work.

#!/usr/bin/python
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
result_handle = NCBIWWW.qblast("blastn", "nr",
"TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGA\CCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTATGNTCATTGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTAAATTGCTCATATTTTCTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAACATAGACAAACTATATATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATATTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAGAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAAAGAACCAGA")

#result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia1.fasta",
"r"))
blast_record = NCBIXML.read(result_handle)

for description in blast_record.descriptions:
    print description;

#end of code.


On Tue, Mar 29, 2011 at 2:07 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Mar 29, 2011 at 6:55 PM, James Wagner <jamesrwagner at gmail.com> wrote:
>> Hello:
>>
>> I was trying just as a proof of concept to do an NCBI WWW BLAST query
>> with a FASTA file containing more than one sequence (but still a small
>> number of sequences).
>>
>> I tried with the opuntia.fasta file from the website, and set it up as follows:
>>
>> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
>> blast_records = NCBIXML.parse(result_handle)
>>
>> then I try:
>>
>> for record in blast_records:
>> ? ? ?print record.alignments
>>
>> and I obtain:
>> []
>>
>>
>> Surely at the very least since there were 7 sequences in this file, I
>> should get 7 empty lists, assuming of course none of the sequences
>> gives a hit in nr, which I am sure is not the case either?
>
> Not necessarily, the NCBI may have fixed this but for a long time if
> you had say 7 queries but only 2 gave hits, stand alone BLAST's
> XML output would only contain those 2 hits. There would be nothing
> at all from the 5 hit less queries. This was/is very annoying, but
> right now I'm not sure if they have fixed this or not.
>
> Try getting back the results as plain text and manually inspect them.
> In the plain text output all the queries appear, and there is a clear
> "no hits found" message.
>
>> What is still missing? I realize I could use SeqIO.parse to obtain
>> each sequence from the FASTA file and do a separate qblast, but surely
>> doing this separately for each protein would create unnecessary
>> overhead with the network traffic compared to somehow sending off all
>> the protein queries at once?
>
> Yes, in theory a single large query should have less overhead
> than individual queries. Personally I'd just use standalone BLAST
> and run it locally if I had more than a few queries.
>
> Peter
>


From rmb32 at cornell.edu  Tue Mar 29 21:20:41 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Tue, 29 Mar 2011 14:20:41 -0700
Subject: [Biopython] Announcing OBF Summer of Code - please forward!
Message-ID: <4D924D29.3020707@cornell.edu>

Hi all,

Here's an advertising-ready announcement for OBF's Summer of Code, 
thanks to Christian Zmasek and Hilmar Lapp for their excellent writing.

Student applications are due April 8!  Please spread it widely, we need 
to reach lots of students with it!

Rob Buels
OBF GSoC 2011 Admin


============================================================

*** Please disseminate widely at your local institutions ***
*** including posting to message and job boards, so that ***
*** we reach as many students as possible.               ***

============================================================


OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2011

Applications due 19:00 UTC, April 8, 2010.
http://www.open-bio.org/wiki/Google_Summer_of_Code

The Open Bioinformatics Foundation Summer of Code program provides a 
unique opportunity for undergraduate, masters, and PhD students to 
obtain hands-on experience writing and extending open-source software 
for bioinformatics under the mentorship of experienced developers from 
around the world. The program is the participation of the Open 
Bioinformatics Foundation (OBF) as a mentoring organization in the 
Google Summer of Code(tm) (http://code.google.com/soc/).

Students successfully completing the 3 month program receive a $5,000 
USD stipend, and may work entirely from their home or home institution. 
  Participation is open to students from any country in the world except 
countries subject to US trade restrictions.  Each student will have at 
least one dedicated mentor to show them the ropes and help them complete 
their project.

The Open Bioinformatics Foundation is particularly seeking students 
interested in both bioinformatics (computational biology) and software 
development. Some initial project ideas are listed on the website. These 
range from Galaxy phylogenetics pipeline development in Biopython to 
lightweight sequence objects and lazy parsing in BioPerl, a DAS Server 
for large files on local filesystems, and mapping Java libraries to 
Perl/Ruby/Python using Biolib+SWIG+JNI.  All project ideas are flexible 
and many can be adjusted in scope to match the skills of the student. We 
also welcome and encourage students proposing their own project ideas; 
historically some of the most successful Summer of Code projects are 
ones proposed by the students themselves.

TO APPLY: Apply online at the Google Summer of Code website 
(http://socghop.appspot.com/), where you will also find GSoC program 
rules and eligibility requirements. The 12-day application period for 
students runs from Monday, March 28 through Friday, April 8th, 2011.

INQUIRIES:

We strongly encourage all interested students to get in touch with us 
with their ideas as early on as possible.  See the OBF GSoC page for 
contact details.

2011 OBF Summer of Code:
http://www.open-bio.org/wiki/Google_Summer_of_Code

Google Summer of Code FAQ:
http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs


From albert.bogdanowicz at gmail.com  Thu Mar 31 17:01:45 2011
From: albert.bogdanowicz at gmail.com (Albert Bogdanowicz)
Date: Thu, 31 Mar 2011 19:01:45 +0200
Subject: [Biopython] Google Summer of Code idea
Message-ID: <201103311901.45372.albert.bogdanowicz@gmail.com>

Hello World,
I am a bioinformatics student and I would like to take part in Google Summer 
of Code this year.
I have an idea for a project that I could write. It would be a module for 
synthetic biology, especially BioBrick standard used in iGEM competition 
(http://ung.igem.org/Main_Page).
I'm a bit late, but I hope this fact won't disqualify me. I would appreciate 
any help in determining a more detailed specification for such project.
Albert Bogdanowicz


From laserson at mit.edu  Thu Mar 31 20:48:16 2011
From: laserson at mit.edu (Uri Laserson)
Date: Thu, 31 Mar 2011 16:48:16 -0400
Subject: [Biopython] Google Summer of Code idea
In-Reply-To: <201103311901.45372.albert.bogdanowicz@gmail.com>
References: <201103311901.45372.albert.bogdanowicz@gmail.com>
Message-ID: <AANLkTik=z3_pmhQLsLXEe6PD5-sGVnHpsyzSiG+=Ukhb@mail.gmail.com>

Hi Albert,

Are you thinking of something like the Clotho project?

http://www.clothocad.org/

Uri

...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


On Thu, Mar 31, 2011 at 13:01, Albert Bogdanowicz <
albert.bogdanowicz at gmail.com> wrote:

> Hello World,
> I am a bioinformatics student and I would like to take part in Google
> Summer
> of Code this year.
> I have an idea for a project that I could write. It would be a module for
> synthetic biology, especially BioBrick standard used in iGEM competition
> (http://ung.igem.org/Main_Page).
> I'm a bit late, but I hope this fact won't disqualify me. I would
> appreciate
> any help in determining a more detailed specification for such project.
> Albert Bogdanowicz
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From rmb32 at cornell.edu  Thu Mar 31 21:58:52 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 31 Mar 2011 14:58:52 -0700
Subject: [Biopython] Reminder: GSoC proposals due in 1 week
Message-ID: <4D94F91C.1080005@cornell.edu>

Hi all,

Just a reminder, Google Summer of Code student applications are due April 8!

If you're a student planning to apply to GSoC with OBF, it's very much 
in your best interest to write your proposal *early*, like now, and get 
it into the hands of the developers and mentors on your subproject 
(BioPerl/Ruby/Python/etc) so that they can give you some feedback on it.

The final proposals must, of course, still be submitted to Google 
through the GSoC web application, as described on the main GSoC site 
(http://www.google-melange.com/gsoc/homepage/google/gsoc2011).

Rob Buels
OBF GSoC 2011 Administrator


From rmb32 at cornell.edu  Thu Mar 31 22:04:49 2011
From: rmb32 at cornell.edu (Robert Buels)
Date: Thu, 31 Mar 2011 15:04:49 -0700
Subject: [Biopython] GSoC call for mentors
Message-ID: <4D94FA81.5090701@cornell.edu>

Hi all,

For current developers on OBF projects:

If you would not mind being a mentor to a Summer of Code student this 
summer, please make sure you sign up as an OBF mentor in the GSoC web 
app.  There's a link under "mentors: apply now!" midway down the page at 
http://www.google-melange.com/.  If you didn't do last year's summer of 
code, it would be a good idea to drop me an email introducing yourself, 
as well, or I won't know whether to approve your request. :-)

Being signed up as an OBF GSoC mentor will give you access to the 
student proposals, as they come in, and the ability to comment on them 
and assign scores to the ones you think show the most promise.

If you sign up as a mentor, please also add yourself to the two OBF GSoC 
mailing lists: OBF-GSoC and OBF-GSoC-mentors

OBF-GSoC list: http://lists.open-bio.org/mailman/listinfo/gsoc
OBF mentors:   http://lists.open-bio.org/mailman/listinfo/gsoc-mentors


Thanks in advance!

Rob

---
Robert Buels
OBF GSoC 2011 Administrator


From philip.machanick at gmail.com  Thu Mar 31 23:49:33 2011
From: philip.machanick at gmail.com (Philip Machanick)
Date: Fri, 1 Apr 2011 09:49:33 +1000
Subject: [Biopython] extending Motif class
Message-ID: <AANLkTi=Ku-Wem_DBFP2zZA+F28dKCPmrwrZrzRnhPn=8@mail.gmail.com>

I want to add a new scoring function to the Motif class and in true
object-oriented spirit would like to do it by deriving a new class rather
than hacking the existing code.

The general structure of my test program (all in 1 file) is:

from Bio.Motif import Motif

class ScannableMotif(Motif):
    def pwm_score_hit(self,sequence,position):
    ## stuff to compute my new score

from Bio import Motif
def main ():
    for motif in
ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
        for i in range(3):
          print
motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)

The two different imports appear to be necessary. I need the first to be
able to use the base class to derive a new one, and without the second when
I use metaclass methods, I get

TypeError: Error when calling the metaclass bases
    module.__init__() takes at most 2 arguments (3 given)

The other problem: I can't directly invoke a metaclass method on a derived
instance as above. The snippet below works as expected, but looks like a
kludge to me. Is there a better way of accessing metaclass methods from a
derived class object?

    for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"):
        motif.__class__ = ScannableMotif # promote to the new class
        for i in range(3):
          print
motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i)

I think I have the class vs. metaclass concept straight but understanding
why I need the two different flavours of import would be useful.
-- 
Philip Machanick
Rhodes University, Grahamstown 6140, South Africa
http://opinion-nation.blogspot.com/
+61-7-3871-0963 mobile +61 42 234 6909 skype philipmach