From p.j.a.cock at googlemail.com  Fri Apr  2 13:34:00 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 2 Apr 2010 18:34:00 +0100
Subject: [Biopython] Biopython 1.54 beta released
Message-ID: <t2o320fb6e01004021034p44915b44w421d6e9a152b13e4@mail.gmail.com>

Dear all,

A beta release for Biopython 1.54 is now available for download
and testing, as announced here:

http://news.open-bio.org/news/2009/06/biopython-154-beta-released/

Noted that I haven't done a fully detailed release announcement,
we'll leave that for the official release.

Source distributions and Windows installers are available from
the downloads page on the Biopython website.
http://biopython.org/wiki/Download

We are interested in getting feedback on the beta release as
a whole, but especially on the new features - including the
updated multiple sequence alignment object (which is what
you?ll now get when parsing alignments with Bio.AlignIO), the
new Bio.Phylo module, and the Bio.SeqIO support for Standard
Flowgram Format (SFF) files.

(At least) 10 people contributed to this release (so far), which
includes 4 new people:

Anne Pajon (first contribution)
Brad Chapman
Christian Zmasek
Eric Talevich
Jose Blanca (first contribution)
Kevin Jacobs (first contribution)
Leighton Pritchard
Michiel de Hoon
Peter Cock
Thomas Holder (first contribution)

On behalf of the Biopython team, thank you for any feedback,
bug reports, and contributions.

Peter

P.S.

You may wish to subscribe to our news feed.  For RSS links etc, see:
http://biopython.org/wiki/News

Biopython news is also on twitter:
http://twitter.com/biopython


From p.j.a.cock at googlemail.com  Fri Apr  2 13:39:08 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 2 Apr 2010 18:39:08 +0100
Subject: [Biopython] Biopython 1.54 beta released
In-Reply-To: <t2o320fb6e01004021034p44915b44w421d6e9a152b13e4@mail.gmail.com>
References: <t2o320fb6e01004021034p44915b44w421d6e9a152b13e4@mail.gmail.com>
Message-ID: <h2n320fb6e01004021039x302510bel7a81ff3fb833025@mail.gmail.com>

> Dear all,
>
> A beta release for Biopython 1.54 is now available for download
> and testing, as announced here:
>
> http://news.open-bio.org/news/2009/06/biopython-154-beta-released/
>
> Noted that I haven't done a fully detailed release announcement,
> we'll leave that for the official release.

That URL should have been:
http://news.open-bio.org/news/2010/04/biopython-1-54-beta-released/

Sorry for the extra email,

Peter

From cgohlke at uci.edu  Fri Apr  2 19:05:25 2010
From: cgohlke at uci.edu (Christoph Gohlke)
Date: Fri, 02 Apr 2010 16:05:25 -0700
Subject: [Biopython] Biopython 1.54b test failures
Message-ID: <4BB67835.7030303@uci.edu>

Hello,

I get two test failures (see below) when running 'setup.py test' for 
biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are related 
to line ending style. Maybe it would be a good idea to use Python's 
universal newline support (available since 2.3) when opening text files 
for iteration over lines. All tests pass after the following changes:

BIO/SCOP/Raf.py

line 104:
         f = open(self.filename, 'rU')

line 121:
         f = open(self.filename, 'rU')

BIO/SCOP/Cla.py

line 103:
         f = open(self.filename, 'rU')

line 123:
         f = open(self.filename, 'rU')

line 72 (inconsistent indentation):
             h.append("=".join(map(str,ht)))


-- Christoph


======================================================================
ERROR: Test CLA file indexing
----------------------------------------------------------------------
Traceback (most recent call last):
   File "test_SCOP_Cla.py", line 74, in testIndex
     rec = index['d1hbia_']
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", 
line 127,
  in __getitem__
     record = Record(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", 
line 45,
in __init__
     self._process(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", 
line 51,
in _process
     raise ValueError("I don't understand the format of %s" % line)
ValueError: I don't understand the format of 5

======================================================================
ERROR: testSeqMapIndex (test_SCOP_Raf.RafTests)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "test_SCOP_Raf.py", line 68, in testSeqMapIndex
     r = index.getSeqMap("103m")
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 152,
  in getSeqMap
     sm = self[id]
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 125,
  in __getitem__
     record = SeqMap(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 196,
  in __init__
     self._process(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 216,
  in _process
     raise ValueError("Incompatible RAF version: "+self.version)
ValueError: Incompatible RAF version: .01

----------------------------------------------------------------------
Ran 143 tests in 98.871 seconds

FAILED (failures = 2)

From biopython at maubp.freeserve.co.uk  Fri Apr  2 19:22:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 3 Apr 2010 00:22:32 +0100
Subject: [Biopython] Biopython 1.54b test failures
In-Reply-To: <4BB67835.7030303@uci.edu>
References: <4BB67835.7030303@uci.edu>
Message-ID: <u2p320fb6e01004021622i9a350a0eqfaf9663b5d6e9e62@mail.gmail.com>

On Sat, Apr 3, 2010 at 12:05 AM, Christoph Gohlke <cgohlke at uci.edu> wrote:
> Hello,
>
> I get two test failures (see below) when running 'setup.py test' for
> biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are
> related to line ending style.

It is a known issue - a simple work around is just run something
like unix2dos on the SCOP test files, and then the tests pass.

> Maybe it would be a good idea to use Python's universal
> newline support (available since 2.3) when opening text
> files for iteration over lines.

I had tried that in the past without success...

> All tests pass after the following changes:
>
> BIO/SCOP/Raf.py
>
> line 104:
> ? ? ? ?f = open(self.filename, 'rU')
>
> line 121:
> ? ? ? ?f = open(self.filename, 'rU')
>
> BIO/SCOP/Cla.py
>
> line 103:
> ? ? ? ?f = open(self.filename, 'rU')
>
> line 123:
> ? ? ? ?f = open(self.filename, 'rU')
>
> line 72 (inconsistent indentation):
> ? ? ? ? ? ?h.append("=".join(map(str,ht)))
>

I recall trying the universal read lines thing before without
success in the SCOP tests - maybe it was this line 72 thing
that I missed. I'll take another look at this next week (when
I have access to a Windows machine).

Thanks,

Peter


From skhadar at gmail.com  Fri Apr  2 21:33:01 2010
From: skhadar at gmail.com (Khader Shameer)
Date: Fri, 2 Apr 2010 19:33:01 -0600
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
Message-ID: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>

Hi,

I was trying to install BioPython using fink.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

Used the command "fink install biopython-py24"
Got the following error:
Failed: no package found for specification 'biopython-py24'!
Tried 23, 24 and 25 - it is not working.

Any idea why it is not working ?

Thanks,
Shameer

From vincent at vincentdavis.net  Fri Apr  2 23:04:17 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Fri, 2 Apr 2010 21:04:17 -0600
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
Message-ID: <q2t77e831101004022004jc3b06856g91f2cf3780650ffd@mail.gmail.com>

Installing from source, instructions here is straight forward, just did it
with the newest version, no problems
http://biopython.org/wiki/Download

*Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Fri, Apr 2, 2010 at 7:33 PM, Khader Shameer <skhadar at gmail.com> wrote:

> Hi,
>
> I was trying to install BioPython using fink.
>
> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>
> Used the command "fink install biopython-py24"
> Got the following error:
> Failed: no package found for specification 'biopython-py24'!
> Tried 23, 24 and 25 - it is not working.
>
> Any idea why it is not working ?
>
> Thanks,
> Shameer
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From p.j.a.cock at googlemail.com  Sat Apr  3 06:33:48 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 3 Apr 2010 11:33:48 +0100
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
Message-ID: <j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>

On Sat, Apr 3, 2010 at 2:33 AM, Khader Shameer <skhadar at gmail.com> wrote:
> Hi,
>
> I was trying to install BioPython using fink.
>
> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>
> Used the command "fink install biopython-py24"
> Got the following error:
> Failed: no package found for specification 'biopython-py24'!
> Tried 23, 24 and 25 - it is not working.
>
> Any idea why it is not working ?

Something to do with Fink? Also note we don't
support Python 2.3 anymore (and Python 2.4 is
on its last few releases as a supported version
for Biopython).

Apple provides python 2.5 (32bit) and python
2.6 (64bit) on Snow Leopard. I actually use
python 2.6 on the Mac specifically because it
is 64bit and can cope with more memory.

As Vincent and our documentation suggests,
try just installing from source. You'll need to
install Apple's XCode tools first, and it seems
to help if you tick the optional older SDKs as well.

Peter

From p.j.a.cock at googlemail.com  Sat Apr  3 09:52:11 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 3 Apr 2010 14:52:11 +0100
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
	<j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>
Message-ID: <n2u320fb6e01004030652la67e081fq24ce895e29e7c9b@mail.gmail.com>

>> Hi,
>>
>> I was trying to install BioPython using fink.
>>
>> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
>> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>>
>> Used the command "fink install biopython-py24"
>> Got the following error:
>> Failed: no package found for specification 'biopython-py24'!
>> Tried 23, 24 and 25 - it is not working.
>>
>> Any idea why it is not working ?
>
> Something to do with Fink? Also note we don't
> support Python 2.3 anymore (and Python 2.4 is
> on its last few releases as a supported version
> for Biopython).

If you really want to use fink, I think you'll have to
contact the fink team. Specifically it looks like
Koen van der Drift is kindly taking care of packaging
Biopython on Fink:

http://pdb.finkproject.org/pdb/package.php/biopython-py24
http://pdb.finkproject.org/pdb/package.php/biopython-py25
http://pdb.finkproject.org/pdb/package.php/biopython-py26

Peter

From skhadar at gmail.com  Sat Apr  3 13:19:49 2010
From: skhadar at gmail.com (Khader Shameer)
Date: Sat, 3 Apr 2010 11:19:49 -0600
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <n2u320fb6e01004030652la67e081fq24ce895e29e7c9b@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
	<j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>
	<n2u320fb6e01004030652la67e081fq24ce895e29e7c9b@mail.gmail.com>
Message-ID: <k2nb6ff81951004031019vd49d921lfad25a6636781086@mail.gmail.com>

Thanks Vincent, Peter :  I have installed BioPython from source.


On Sat, Apr 3, 2010 at 7:52 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> >> Hi,
> >>
> >> I was trying to install BioPython using fink.
> >>
> >> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
> >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> >>
> >> Used the command "fink install biopython-py24"
> >> Got the following error:
> >> Failed: no package found for specification 'biopython-py24'!
> >> Tried 23, 24 and 25 - it is not working.
> >>
> >> Any idea why it is not working ?
> >
> > Something to do with Fink? Also note we don't
> > support Python 2.3 anymore (and Python 2.4 is
> > on its last few releases as a supported version
> > for Biopython).
>
> If you really want to use fink, I think you'll have to
> contact the fink team. Specifically it looks like
> Koen van der Drift is kindly taking care of packaging
> Biopython on Fink:
>
> http://pdb.finkproject.org/pdb/package.php/biopython-py24
> http://pdb.finkproject.org/pdb/package.php/biopython-py25
> http://pdb.finkproject.org/pdb/package.php/biopython-py26
>
> Peter
>

From rmb32 at cornell.edu  Sat Apr  3 16:09:27 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 13:09:27 -0700
Subject: [Biopython] Google Summer of Code is *ON* for OBF projects!
Message-ID: <4BB7A077.4070802@cornell.edu>

Hi all,

Reminder:  GSoC student proposals must be submitted to Google by April 
9th, 19:00 UTC.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project 
mailing lists, they can help you get your proposal into shape.

So far, we have 5 proposals submitted to our org in Google's web app. 
Keep them coming, and let's see some really good ones!

Rob Buels
OBF GSoC 2010 Administrator


From rmb32 at cornell.edu  Sun Apr  4 00:37:38 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 21:37:38 -0700
Subject: [Biopython] Reminder: GSoC student applications due April 9,
	19:00 UTC
Message-ID: <4BB81792.8060001@cornell.edu>

Hi all,

Sending this again with a different subject line, just in case.

GSoC student proposals must be submitted to Google through their web 
application by *April 9th, 19:00 UTC*.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project
mailing lists, they can help you get your proposal into shape.

So far, we have 6 proposals submitted to our org in Google's web app.
Keep them coming, and keep them good!

Rob Buels
OBF GSoC 2010 Administrator


From ulfada at gmail.com  Sun Apr  4 21:46:14 2010
From: ulfada at gmail.com (Sofia Lemons)
Date: Sun, 4 Apr 2010 21:46:14 -0400
Subject: [Biopython] SoC project (BioPython and PyCogent)
Message-ID: <y2r838208c01004041846q78520c57m17135145edbbbc61@mail.gmail.com>

I'm working on an application for the Summer of Code project of
integrating BioPython and PyCogent. I've looked through the list
archives and saw Brad's general advice to other potential SoC
applicants, but I thought I'd introduce myself and see if there was
any advice specific to this project. I've used BioPython in the past
and even explored the code a bit. I'm considering working on one or
more of the bugs in Bugzilla if I can find time, and will work to
familiarize myself with PyCogent. Are there any other concepts,
projects, or people I should familiarize myself with (aside from
what's listed on the ideas page, of course)? As you can see from my
GitHub and Google Code accounts, I've got some experience with open
source projects, but please do suggest any specific tools or methods
you think I should try to get up to speed on, as well. Feel free to
contact me off-list.

Thanks,
Sofia

From stran104 at chapman.edu  Mon Apr  5 06:59:28 2010
From: stran104 at chapman.edu (Matthew Strand)
Date: Mon, 5 Apr 2010 03:59:28 -0700
Subject: [Biopython] GSoC Ortholog Module Proposal
Message-ID: <g2h2a63cc351004050359kbdeb99afva24ca1e2a0ac3871@mail.gmail.com>

Dear Biopython GSoC list,

I am a student at Chapman University and over the last 18 months I have been
using biopython to produce phylogenetic trees with ClustalW, T-Coffee, and
PHYLIP. I have found the most difficult part to be identifying ortholgos for
the particular species that our lab is interested in studying. The orthology
databases provide a great deal of matches but each database requires its own
wrapper and some databases are stronger than others with particular species.


So far I have written wrappers to get ortholog IDs from InParanoid and then
fetch the sequences from either NCBI or BioMart. This provides good results
for most common species but not all. To handle rare species I have
implemented the Reverse Smallest Distance orthology algorithm to run
protein-protein searches. It is available at http://ortholog.us. I also have
automated scripts to align protein families, concatenate aligned families,
and create trees.

For GSoC I would like to write a module to abstract finding orthologs as
much as possible. This would greatly simplify creating custom evolutionary
trees for biologists. The module could fetch orthologs from TreeFam,
InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also
provide support for producing alignments, concatenating alignments, removing
sections of gaps, and constructing trees. Ortholog identification could be
done with no dependency other than an internet connection. Alignments and
trees would require the user to have the appropriate tools installed.

The overhead of writing this type of code makes it difficult for
evolutionary biologists and bio wet labs to get a picture of evolutionary
relationships in specific groups of species. This module would aim to
simplify creating custom phylogenetic trees.

A timeline of milestones might look something like this:
Week 1-2: Stable wrappers for InParanoid
Week 3-4: Stable wrappers for Roundup
Week 5-6: Stable wrappers for Treefam
Week 6-7: Stable wrappers for BlastO
Week 8-9: Ortholog module to abstract the database wrappers
Week 10-11: Alignment and tree tools

Is there any interest in having such a project? I'd be grateful to get some
feedback either on or off list.

Best,
-Matthew Strand

From chapmanb at 50mail.com  Mon Apr  5 07:50:00 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 5 Apr 2010 07:50:00 -0400
Subject: [Biopython] SoC project (BioPython and PyCogent)
In-Reply-To: <y2r838208c01004041846q78520c57m17135145edbbbc61@mail.gmail.com>
References: <y2r838208c01004041846q78520c57m17135145edbbbc61@mail.gmail.com>
Message-ID: <20100405115000.GB62718@sobchak.mgh.harvard.edu>

Sofia;

> I'm working on an application for the Summer of Code project of
> integrating BioPython and PyCogent. 

Great -- glad to you hear you are interested in the project.

> I've looked through the list
> archives and saw Brad's general advice to other potential SoC
> applicants, but I thought I'd introduce myself and see if there was
> any advice specific to this project.

The overall goal is to provide integration between Biopython and
PyCogent so programmers can benefit from the unique features and
algorithms in each library. This has two general themes:

- Ensuring interoperability between core objects like sequences,
  alignments and phylogenetic trees.
- Using this interoperability to develop analysis workflows that
  utilize functionality from both libraries.

Within this broad scope you are free to orient your proposal to
whatever set of biological questions that interest you. We've tried
to sketch out some ideas we had on the GSoC page as a starting
point.

> I've used BioPython in the past
> and even explored the code a bit. I'm considering working on one or
> more of the bugs in Bugzilla if I can find time, and will work to
> familiarize myself with PyCogent. Are there any other concepts,
> projects, or people I should familiarize myself with (aside from
> what's listed on the ideas page, of course)? 

Proposals are due this Friday, April 9th and normally require a few
rounds of back and forth revisions to get to a competitive level. My
suggestion would be to focus on learning enough of Biopython and
PyCogent to write out a detailed project plan, with a week by week
description of activities and specific goals.

> As you can see from my
> GitHub and Google Code accounts, I've got some experience with open
> source projects, but please do suggest any specific tools or methods
> you think I should try to get up to speed on, as well. 

The open source work is great; definitely include this in your
proposal. A good outline to start with is:

- Project summary -- A short abstract describing what you hope to
  accomplish during the summer, how you plan to go about it, and
  what motivates you to work on the project.

- Personal summary -- Describe your background and how it will help
  you be successful during GSoC. Here is where you can sell yourself
  to all of the mentors ranking the project: why are you a good
  coder? Why is this project useful to use? How will working on the
  summer project encourage you to stay active in the community?

- Project plan -- The detailed week by week description of plans
  mentioned above.

Hope this helps,
Brad

From chapmanb at 50mail.com  Mon Apr  5 08:05:54 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 5 Apr 2010 08:05:54 -0400
Subject: [Biopython] GSoC Ortholog Module Proposal
In-Reply-To: <g2h2a63cc351004050359kbdeb99afva24ca1e2a0ac3871@mail.gmail.com>
References: <g2h2a63cc351004050359kbdeb99afva24ca1e2a0ac3871@mail.gmail.com>
Message-ID: <20100405120554.GC62718@sobchak.mgh.harvard.edu>

Matthew;
Thanks for the introduction and pointers to your work. Your
http://ortholog.us interface looks like a useful resource; it's
really nice to see web interfaces being developed with programmable
JSON APIs. Out of curiousity, is the code available for what you've
done so far?

> For GSoC I would like to write a module to abstract finding orthologs as
> much as possible. This would greatly simplify creating custom evolutionary
> trees for biologists. The module could fetch orthologs from TreeFam,
> InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also
> provide support for producing alignments, concatenating alignments, removing
> sections of gaps, and constructing trees. Ortholog identification could be
> done with no dependency other than an internet connection. Alignments and
> trees would require the user to have the appropriate tools installed.
[...]
> Is there any interest in having such a project? I'd be grateful to get some
> feedback either on or off list.

This is a good project idea and nicely spec'ed out. One additional
direction that might also be worth exploring is using BioMart to
retrieve orthologs from the Ensembl Compara work. Here's a recent
thread on BioStar with the queries to use:

http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale

I don't know of Python programming interfaces to BioMart, but there
is a nice R bioconductor library that can be leveraged with Rpy2:

http://www.bioconductor.org/packages/bioc/html/biomaRt.html
http://rpy.sourceforge.net/rpy2.html

For the practical GSoC things, project proposals are due this
Friday, April 9th so time is running short. I'm unfortunately a bit 
over-committed as this point to mentor but hopefully someone will 
be available to step in that role. I'm happy to make suggestions on
the proposal as it comes together.

Thanks,
Brad

From bjorn_johansson at bio.uminho.pt  Mon Apr  5 09:50:25 2010
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Mon, 5 Apr 2010 14:50:25 +0100
Subject: [Biopython] pro
Message-ID: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>

Hi,
I have a problem that may be related to biopython (or not).
I have written a plugin for a cross platform program (Wikidpad) that relies
on some biopython modules.
I do the development on ubuntu 9.10 and have Wikidpad installed using wine
to be able to test the functionality on windows.

Under wine I have added the following code to make biopython installed under
linux  available to the python interpreter (py2exe) under wine:

if sys.platform == 'win32':
    sys.path.append("z:\usr\local\lib\python2.6\dist-packages")
    sys.path.append("z:\usr\lib/python2.6")

line 40 in "SeqTools.py" below reads: from Bio import SeqIO
I get the error below when importing the module under wikidpad running under
wine

  File "C:\Program Files\WikidPad\user_extensions\SeqTools.py", line 40, in
<module>
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\__init__.py",
line 303, in <module>
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\InsdcIO.py", line
29, in <module>
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\__init__.py",
line 53, in <module>
  File
"z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py",
line 319, in <module>
  File
"z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py",
line 177, in __init__
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line
88, in __init__
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line
129, in collectRules
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line
101, in addRule
AttributeError: 'NoneType' object has no attribute 'split'

I wonder if anyone has an immediate idea of what I am doing wrong?
The python interpreter under wine seem to find the biopython modules.
I cannot understand the error that I get afterwards.....

grateful for help!
/bjorn

From eric.talevich at gmail.com  Mon Apr  5 11:48:04 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 5 Apr 2010 11:48:04 -0400
Subject: [Biopython] pro
In-Reply-To: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
Message-ID: <p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>

2010/4/5 Bj?rn Johansson <bjorn_johansson at bio.uminho.pt>

> Hi,
> I have a problem that may be related to biopython (or not).
> I have written a plugin for a cross platform program (Wikidpad) that relies
> on some biopython modules.
> I do the development on ubuntu 9.10 and have Wikidpad installed using wine
> to be able to test the functionality on windows.
>
> Under wine I have added the following code to make biopython installed
> under
> linux  available to the python interpreter (py2exe) under wine:
> [...]
>

It looks like spark relies on the docstrings in Bio.GenBank.LocationParser.
Is there anything in py2exe that would strip the docstrings from compiled
modules? Some optimizations do this -- I think "python -O3" strips
docstrings, for instance.

-Eric


From p.j.a.cock at googlemail.com  Mon Apr  5 12:16:43 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 5 Apr 2010 17:16:43 +0100
Subject: [Biopython] pro
In-Reply-To: <p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
	<p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
Message-ID: <q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>

2010/4/5 Eric Talevich <eric.talevich at gmail.com>
>
> It looks like spark relies on the docstrings in Bio.GenBank.LocationParser.
> Is there anything in py2exe that would strip the docstrings from compiled
> modules? Some optimizations do this -- I think "python -O3" strips
> docstrings, for instance.

You may be on to something there Eric.

Bj?rn, could compare your file:
z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py
with the version we provide:
http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py
or:
http://biopython.org/SRC/biopython/Bio/Parsers/spark.py

In the medium term, I'd like to move the GenBank/EMBL location
parsing to something simpler and faster (using regular expressions)
and then deprecate Bio.GenBank.LocationParser and indeed the
whole of Bio.parsers (which just has a copy of spark). There is
a bug open on this with some code. But that isn't going to help
Bj?rn right now.

Peter


From stran104 at chapman.edu  Mon Apr  5 15:02:21 2010
From: stran104 at chapman.edu (Matthew Strand)
Date: Mon, 5 Apr 2010 12:02:21 -0700
Subject: [Biopython] GSoC Ortholog Module Proposal
Message-ID: <p2s2a63cc351004051202o5270b105x79266ca2f75c1ccf@mail.gmail.com>

> Thanks for the introduction and pointers to your work. Your
> http://ortholog.us interface looks like a useful resource; it's
> really nice to see web interfaces being developed with programmable
> JSON APIs. Out of curiousity, is the code available for what you've
> done so far?
>

Thanks, we have found it useful for finding unindexed orthologs. Fetching
results from the pre-compiled databases is faster but of course requires
writing wrappers that are time consuming to develop. The plan is to release
all code as an open source Django app with a paper that is in the works.
However, I'd be happy to share any code with mentors/organizers for
evaluation purposes off-list in the meantime.


>
> This is a good project idea and nicely spec'ed out. One additional
> direction that might also be worth exploring is using BioMart to
> retrieve orthologs from the Ensembl Compara work. Here's a recent
> thread on BioStar with the queries to use:
>
>
> http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale
>
> I don't know of Python programming interfaces to BioMart, but there
> is a nice R bioconductor library that can be leveraged with Rpy2:
>

I agree, this would be a good addition. I have some messy Python wrappers to
BioMart but the Rpy route would probably provide a more reliable solution
with less effort.


> http://www.bioconductor.org/packages/bioc/html/biomaRt.html
> http://rpy.sourceforge.net/rpy2.html
>
> For the practical GSoC things, project proposals are due this
> Friday, April 9th so time is running short. I'm unfortunately a bit
> over-committed as this point to mentor but hopefully someone will
> be available to step in that role. I'm happy to make suggestions on
> the proposal as it comes together.
>

Thanks, I hope so too. I will post a full proposal in the near future.
Feedback would of course be greatly appreciated. I'm a little unclear, do I
need a mentor to submit a proposal? Is writing a proposal a mute point
without a mentor?

Best,
-Matt Strand

From vincent at vincentdavis.net  Mon Apr  5 15:51:46 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Mon, 5 Apr 2010 13:51:46 -0600
Subject: [Biopython] Build CDF file
Message-ID: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>

The custom array for which I have data does not have a CDF file. I have been
told that others have changed the header on the CEL files to reference
different CDF file. That only kinda makes sense to me. I obviously have CEL
files. I also have the sequences that each probe matches and finally I have
genome match data. By that I mean I know which probes are a perfect match
and which are a mismatch and the location of the mismatch. Can I build a CDF
file from this? How?
Does it make sense to build a CDF for each hybrid(not sure thats the right
word) of the organism if the genome is known for each.

Not sure if this is better ask here or the BioConductor, If there is a
python solution I would try that first, I think.

I think the bioconductor package altcdfenvs
LINK<http://bioconductor.org/packages/2.5/bioc/html/altcdfenvs.html>
does
this.
I guess I should email Laurent Gautier, maybe he reads this :)


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

From biopython at maubp.freeserve.co.uk  Mon Apr  5 16:35:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 5 Apr 2010 21:35:20 +0100
Subject: [Biopython] Build CDF file
In-Reply-To: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>
References: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>
Message-ID: <q2v320fb6e01004051335v8e456933jd46cfd31bbb71ddf@mail.gmail.com>

On Mon, Apr 5, 2010 at 8:51 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> The custom array for which I have data does not have a CDF
> file...

Hi Vincent,

Did you mean to post this to the BioConductor mailing list?

Peter

From biopython at maubp.freeserve.co.uk  Mon Apr  5 16:53:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 5 Apr 2010 21:53:42 +0100
Subject: [Biopython] Build CDF file
In-Reply-To: <-3455855938884949614@unknownmsgid>
References: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>
	<q2v320fb6e01004051335v8e456933jd46cfd31bbb71ddf@mail.gmail.com>
	<-3455855938884949614@unknownmsgid>
Message-ID: <h2o320fb6e01004051353jb804b065m2448899c81b4d4d2@mail.gmail.com>

On Mon, Apr 5, 2010 at 9:46 PM, Vincent Davis <vincent at vincentdavis.com> wrote:
>
> No, but maybe I should. I was hopping for a python solution
>

Are these CDF files of yours NetCDF files?
http://en.wikipedia.org/wiki/NetCDF

If so, try Scientific.IO.NetCDF from Konrad Hinsen's ScientificPython
http://sourcesup.cru.fr/projects/scientific-py/

Peter

From chapmanb at 50mail.com  Tue Apr  6 08:26:27 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 6 Apr 2010 08:26:27 -0400
Subject: [Biopython] GSoC Ortholog Module Proposal
In-Reply-To: <p2s2a63cc351004051202o5270b105x79266ca2f75c1ccf@mail.gmail.com>
References: <p2s2a63cc351004051202o5270b105x79266ca2f75c1ccf@mail.gmail.com>
Message-ID: <20100406122627.GE66230@sobchak.mgh.harvard.edu>

Matthew;

> > Thanks for the introduction and pointers to your work. Your
> > http://ortholog.us interface looks like a useful resource; it's
> > really nice to see web interfaces being developed with programmable
> > JSON APIs. Out of curiousity, is the code available for what you've
> > done so far?
> 
> Thanks, we have found it useful for finding unindexed orthologs. Fetching
> results from the pre-compiled databases is faster but of course requires
> writing wrappers that are time consuming to develop. The plan is to release
> all code as an open source Django app with a paper that is in the works.
> However, I'd be happy to share any code with mentors/organizers for
> evaluation purposes off-list in the meantime.

Cool; definitely let us know on the mailing lists when the paper and
code are out. It would be fun to see.

> > For the practical GSoC things, project proposals are due this
> > Friday, April 9th so time is running short. I'm unfortunately a bit
> > over-committed as this point to mentor but hopefully someone will
> > be available to step in that role. I'm happy to make suggestions on
> > the proposal as it comes together.
> 
> Thanks, I hope so too. I will post a full proposal in the near future.
> Feedback would of course be greatly appreciated. I'm a little unclear, do I
> need a mentor to submit a proposal? Is writing a proposal a mute point
> without a mentor?

You will need a mentor and this is always the tough part of GSoC: there
are more good students and ideas than mentors and funded spots. I would
never discourage anyone from getting together a proposal; it is a good
exercise and helps you think through the work you are planning to do.
In terms of acceptance rates, it is lower when coming in later in the
process with your own ideas since mentors will have already settled on
a few ideas and begun feeling committed to students working on those.
However, nothing is locked down or decided until the deadline hits,
proposals are ranked by all of the mentors, and we see how many spots
we'll get from Google.

GSoC is kind of like interviewing job candidates without being sure how
many positions you'll have at the end. In summary, if you feel like the
proposal writing process would be interesting and useful to you, I'd
definitely encourage you to go for it and see where it takes you.

Brad

From bjorn_johansson at bio.uminho.pt  Wed Apr  7 05:33:39 2010
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Wed, 7 Apr 2010 10:33:39 +0100
Subject: [Biopython] pro
In-Reply-To: <q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
	<p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
	<q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>
Message-ID: <p2yc3ee7c591004070233ob8752794o6c32dc942d333585@mail.gmail.com>

Hi,
thank you very much for the information, I think it has to do with the
docstrings, if I run with python -OO under linux, I get the same error msg.

as for the two spark files, they seem identical, spark.py is the one i
downloaded from
http://biopython.org/SRC/biopython/Bio/Parsers/spark.py:

diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py

produces no output at all.

I will try and find out if the optimization can be overridden for one file
only.

Thanks!
/bjorn


2010/4/5 Peter Cock <p.j.a.cock at googlemail.com>

> 2010/4/5 Eric Talevich <eric.talevich at gmail.com>
> >
> > It looks like spark relies on the docstrings in
> Bio.GenBank.LocationParser.
> > Is there anything in py2exe that would strip the docstrings from compiled
> > modules? Some optimizations do this -- I think "python -O3" strips
> > docstrings, for instance.
>
> You may be on to something there Eric.
>
> Bj?rn, could compare your file:
> z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py
> with the version we provide:
> http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py
> or:
> http://biopython.org/SRC/biopython/Bio/Parsers/spark.py
>
> In the medium term, I'd like to move the GenBank/EMBL location
> parsing to something simpler and faster (using regular expressions)
> and then deprecate Bio.GenBank.LocationParser and indeed the
> whole of Bio.parsers (which just has a copy of spark). There is
> a bug open on this with some code. But that isn't going to help
> Bj?rn right now.
>
> Peter
>


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
http://www.bio.uminho.pt
http://sites.google.com/site/bjornhome
Work (direct) +351-253 601517
Private mob. +351-967 147 704
Dept of Biology (secretariate) +351-253 60 4310
Dept of Biology (fax) +351-253 678980


From p.j.a.cock at googlemail.com  Wed Apr  7 05:37:59 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 7 Apr 2010 10:37:59 +0100
Subject: [Biopython] pro
In-Reply-To: <p2yc3ee7c591004070233ob8752794o6c32dc942d333585@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
	<p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
	<q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>
	<p2yc3ee7c591004070233ob8752794o6c32dc942d333585@mail.gmail.com>
Message-ID: <s2h320fb6e01004070237v6cc2532dg70605411c7a4fc5d@mail.gmail.com>

010/4/7 Bj?rn Johansson <bjorn_johansson at bio.uminho.pt>:
> Hi,
> thank you very much for the information, I think it has to do with the
> docstrings, if I run with python -OO under linux, I get the same error msg.
>
> as for the two spark files, they seem identical, spark.py is the one i
> downloaded from
> http://biopython.org/SRC/biopython/Bio/Parsers/spark.py:
>
> diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py
>
> produces no output at all.

OK, thanks. I wanted to find out if py2exe was optimising the python
files by editing them to remove the docstrings. It seems not.

> I will try and find out if the optimization can be overridden for one file
> only.
>
> Thanks!
> /bjorn

Peter


From lunt at ctbp.ucsd.edu  Wed Apr  7 20:57:07 2010
From: lunt at ctbp.ucsd.edu (Bryan Lunt)
Date: Wed, 7 Apr 2010 17:57:07 -0700
Subject: [Biopython] StockholmIO replaces "." with "-", why?
Message-ID: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>

Greetings All!

It looks like line 364 of Bio.AlignIO.StockholmIO reads:

seqs[id] += seq.replace(".","-")

So when you load into memory alignments that mark gaps created to
allow alignment to inserts with ".", (such as PFam alignments or the
output of hmmer) that information is lost.

I know there must be a good reason for this, but I am finding it a
problem on my end..

-Bryan Lunt

From fuxin at umail.iu.edu  Wed Apr  7 21:40:02 2010
From: fuxin at umail.iu.edu (Fuxiao Xin)
Date: Wed, 7 Apr 2010 21:40:02 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
Message-ID: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>

Dear all,

I am a third year Phd student in Bioinformatics from Indiana University
Bloomington.  I am very in interested in the google summer code project of
biopython "PDB-Tidy: command-line tools for manipulating PDB files".

My own research needs extensive manipulation of PDB files, and I think  this
idea of adding more features to Bio.PDB and more command line options to
analyze/present PDB data is excellent. This project is of strong interest to
me since it will benefit my own research project as well.

Programming Skills: I use perl and python during my daily research. I am now
working on developing a new functional site predictor using protein
structure information. The code will be open source, but the work is under
review so the code is not released yet.

My project plan:

week1
1. Renumber residues starting from 1 (or N)
function name: renumberPDB, given a pdb file, rename the atom field
numbering of the file to remove missing amino acids
communicate with mentors to set standards of the code to follow for the rest
of the functions
create work log to keep track of process;

week2-3
2. Select a portion of the structure -- models, chains, etc. -- and write it
to a new file (PDB, FASTA, and other formats)
function name: rewritePDB, inputs will be a particular portion of a PDB file
you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
fasta), and the output name.
3. Perform some basic, well-established measures of model quality/validity
function name: PDBquality
the function will report RESOLUTION and ? of the structure
4. extract disorder region in PDB structure
function name: PDBdisorder
report missing residues in the structure atom field

week3-4
5. make a function to draw a Ramachandran plot
function name: ramaPLOT
combine the two steps(calcualting torsion angles and draw the plot) into one
function, give the option to draw the plot or not

week5
6. open PDB files in the window for visulization, visulize PDBsuperpose
results, output RMSD
function name: superposePDB
the function will look like the PDBsuperpose function in matlab; use
Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
visulization tool to see the results
week6
7. write a function to extract all experimental conditions of a PDB file,
includes PH, temperature, and salt
function name: PDBconditon
it will be easy to get PH and temperature information, but for salt, it will
be hard to parse because there is no general rule of such information in the
PDB file; parse REMARK 200 field;

week7-8
8. extract PTM,
function name: PDBptm
difficult: the Post-translational modification annotation in PDB is not
consistant, need to make a list of PTMs to work on
parse MODRES field

week9-10
9. extract ligand binding information
function name: PDBligand
parse HETNAM field


Other obligations:  I am aware that google summer code starts from May 24th,
but I will have a review paper with my advisor due on June 1st, I hope it
will be OK for me to start after June 1st, and I will makeup the first week
in Auguest.

Best,
Fuxiao

From eric.talevich at gmail.com  Wed Apr  7 23:48:08 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 7 Apr 2010 23:48:08 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
In-Reply-To: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
References: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
Message-ID: <v2t3f6baf361004072048rc5e9a9e7kffb744d922733362@mail.gmail.com>

Hi Fuxiao,

Thanks for your interest in this project. I see you've been working on this
proposal for awhile already, so although the submission deadline is very
close, I think you'll still be OK. I've interleaved my comments with your
proposal below:

On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin <fuxin at umail.iu.edu> wrote:

> Dear all,
>
> I am a third year Phd student in Bioinformatics from Indiana University
> Bloomington.  I am very in interested in the google summer code project of
> biopython "PDB-Tidy: command-line tools for manipulating PDB files".
>
> My own research needs extensive manipulation of PDB files, and I think
>  this
> idea of adding more features to Bio.PDB and more command line options to
> analyze/present PDB data is excellent. This project is of strong interest
> to
> me since it will benefit my own research project as well.
>

Good to hear. Does your lab have a website? This project requires some
knowledge of structural biology, so it helps if we can see what specific
research you've already done in that area.

Programming Skills: I use perl and python during my daily research. I am now
> working on developing a new functional site predictor using protein
> structure information. The code will be open source, but the work is under
> review so the code is not released yet.
>

Is there any other programming work you've done in the past that you could
let us see? It doesn't have to be part of an existing open-source project;
even some functioning snippets posted somewhere would help us get a sense of
your coding style and abilities. Examples where you've used Biopython or
another established toolkit for working with PDB files or other scientific
data would be especially useful.

We also like to see that you're familiar with a project's build tools, which
in Biopython's case is GitHub and the standard Python mechanisms. So, if you
could upload some of your prior work to GitHub and send us the link, that
would be ideal.


My project plan:
>
> week1
> 1. Renumber residues starting from 1 (or N)
> function name: renumberPDB, given a pdb file, rename the atom field
> numbering of the file to remove missing amino acids
> communicate with mentors to set standards of the code to follow for the
> rest
> of the functions
> create work log to keep track of process;
>

Biopython's coding standards generally follow an earlier version of PEP 8;
hopefully you can pick it up quickly just by reading the source code for
Bio.PDB -- so you don't really need that item listed here.

In the past, students have maintained their weekly schedules on a wiki or
other public document, and updated them continually throughout the summer.
This functions as a work log, in a way. You would also have an e-mail record
of your work from your weekly reports to this list.

week2-3
> 2. Select a portion of the structure -- models, chains, etc. -- and write
> it
> to a new file (PDB, FASTA, and other formats)
> function name: rewritePDB, inputs will be a particular portion of a PDB
> file
> you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
> fasta), and the output name.
> 3. Perform some basic, well-established measures of model quality/validity
> function name: PDBquality
> the function will report RESOLUTION and ? of the structure
> 4. extract disorder region in PDB structure
> function name: PDBdisorder
> report missing residues in the structure atom field
>

These tasks seem reasonable. You don't need to commit to specific function
names yet; it would be more helpful to describe the overall module layout
you're planning, and list the dependencies for each (especially the
components of Bio.PDB that come into play).


> week3-4
> 5. make a function to draw a Ramachandran plot
> function name: ramaPLOT
> combine the two steps(calcualting torsion angles and draw the plot) into
> one
> function, give the option to draw the plot or not
>

This task has a number of dependencies which I think you should list and
describe here. Because of those dependencies there's a significant chance of
it taking longer than you planned -- so I'd recommend moving it to after the
midterm evaluations, wherever those fit into your schedule.

week5
> 6. open PDB files in the window for visulization, visulize PDBsuperpose
> results, output RMSD
> function name: superposePDB
> the function will look like the PDBsuperpose function in matlab; use
> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
> visulization tool to see the results
>

Would you build Python wrappers for interacting with the chosen
visualization tool, or just write a set of files and launch the viewer in a
script?


> week6
> 7. write a function to extract all experimental conditions of a PDB file,
> includes PH, temperature, and salt
> function name: PDBconditon
> it will be easy to get PH and temperature information, but for salt, it
> will
> be hard to parse because there is no general rule of such information in
> the
> PDB file; parse REMARK 200 field;
>

Sounds handy. Would your script write out a report combining all of this
info, or just extract requested elements?


> week7-8
> 8. extract PTM,
> function name: PDBptm
> difficult: the Post-translational modification annotation in PDB is not
> consistant, need to make a list of PTMs to work on
> parse MODRES field
>
> week9-10
> 9. extract ligand binding information
> function name: PDBligand
> parse HETNAM field
>

Good. Some of these later items sound straightforward enough that it would
be better to tackle them earlier in the summer.


> Other obligations:  I am aware that google summer code starts from May
> 24th,
> but I will have a review paper with my advisor due on June 1st, I hope it
> will be OK for me to start after June 1st, and I will makeup the first week
> in Auguest.
>

How much of the "community bonding period" will this occupy? The guideline
is that you get set up with the build system, read documentation and do
background research part-time between GSoC acceptance and May 24, and start
writing code full-time on May 24. You can make up for a gap in your project
plan by doing extra preparation before coding starts; would this be possible
for you?

Finally, the GSoC administration app (socghop.appspot.com) gets crowded as
the deadline approaches, so it's best if you register yourself there and
take care of the administrivia as soon as you can to avoid any trouble on
Friday.

Best regards,
Eric

From rozziite at gmail.com  Wed Apr  7 23:48:16 2010
From: rozziite at gmail.com (Diana Jaunzeikare)
Date: Wed, 7 Apr 2010 23:48:16 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
In-Reply-To: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
References: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
Message-ID: <q2l4057d3bf1004072048k9c3f348au586c61bf9d2902f0@mail.gmail.com>

Hi Fuxiao,

Good start on the application! Some comments below.

On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin <fuxin at umail.iu.edu> wrote:
> Dear all,
>
> I am a third year Phd student in Bioinformatics from Indiana University
> Bloomington. ?I am very in interested in the google summer code project of
> biopython "PDB-Tidy: command-line tools for manipulating PDB files".
>
> My own research needs extensive manipulation of PDB files, and I think ?this
> idea of adding more features to Bio.PDB and more command line options to
> analyze/present PDB data is excellent. This project is of strong interest to
> me since it will benefit my own research project as well.
>
> Programming Skills: I use perl and python during my daily research. I am now
> working on developing a new functional site predictor using protein
> structure information. The code will be open source, but the work is under
> review so the code is not released yet.
>
> My project plan:
>
> week1
> 1. Renumber residues starting from 1 (or N)
> function name: renumberPDB, given a pdb file, rename the atom field
> numbering of the file to remove missing amino acids
> communicate with mentors to set standards of the code to follow for the rest
> of the functions
> create work log to keep track of process;
>
> week2-3
> 2. Select a portion of the structure -- models, chains, etc. -- and write it
> to a new file (PDB, FASTA, and other formats)
> function name: rewritePDB, inputs will be a particular portion of a PDB file
> you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
> fasta), and the output name.
> 3. Perform some basic, well-established measures of model quality/validity
> function name: PDBquality
> the function will report RESOLUTION and ? of the structure

Maybe you can get some inspiration of measures of model
quality/validity from PDBREPORT database [0] and WHAT_IF [1] software.

[0] http://swift.cmbi.ru.nl/gv/pdbreport/
[1] http://swift.cmbi.ru.nl/whatif/


> 4. extract disorder region in PDB structure
> function name: PDBdisorder
> report missing residues in the structure atom field
>
> week3-4
> 5. make a function to draw a Ramachandran plot
> function name: ramaPLOT
> combine the two steps(calcualting torsion angles and draw the plot) into one
> function, give the option to draw the plot or not
>
> week5
> 6. open PDB files in the window for visulization, visulize PDBsuperpose
> results, output RMSD
> function name: superposePDB
> the function will look like the PDBsuperpose function in matlab; use
> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
> visulization tool to see the results
> week6
> 7. write a function to extract all experimental conditions of a PDB file,
> includes PH, temperature, and salt
> function name: PDBconditon
> it will be easy to get PH and temperature information, but for salt, it will
> be hard to parse because there is no general rule of such information in the
> PDB file; parse REMARK 200 field;
>
> week7-8
> 8. extract PTM,
> function name: PDBptm
> difficult: the Post-translational modification annotation in PDB is not
> consistant, need to make a list of PTMs to work on
> parse MODRES field
>
> week9-10
> 9. extract ligand binding information
> function name: PDBligand
> parse HETNAM field
>
>
> Other obligations: ?I am aware that google summer code starts from May 24th,
> but I will have a review paper with my advisor due on June 1st, I hope it
> will be OK for me to start after June 1st, and I will makeup the first week
> in Auguest.
>
> Best,
> Fuxiao
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From fuxin at indiana.edu  Thu Apr  8 03:40:36 2010
From: fuxin at indiana.edu (Fuxiao Xin)
Date: Thu, 8 Apr 2010 03:40:36 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
In-Reply-To: <v2t3f6baf361004072048rc5e9a9e7kffb744d922733362@mail.gmail.com>
References: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
	<v2t3f6baf361004072048rc5e9a9e7kffb744d922733362@mail.gmail.com>
Message-ID: <m2xccb717081004080040u3dfac46by28bcb421d734b504@mail.gmail.com>

hi Eric and Diana,

Thanks for your quick reply.

For the quality/validation problem, thanks Diana for pointing me to the two
resources,  I am surprised that there are so many "problems" defined for PDB
files, and obviously  I underestimate this task, and I think it's a very
interesting problem to study and  I'd like to devote more time on this task,
 I am thinking to make this task the main focus of my first period
coding(before midterm check).  What do you think?

For Eric's responses, please find my reply in line.

My own research needs extensive manipulation of PDB files, and I think  this
>> idea of adding more features to Bio.PDB and more command line options to
>> analyze/present PDB data is excellent. This project is of strong interest
>> to
>> me since it will benefit my own research project as well.
>>
>
> Good to hear. Does your lab have a website? This project requires some
> knowledge of structural biology, so it helps if we can see what specific
> research you've already done in that area.
>

Our lab's website is : http://www.informatics.indiana.edu/predrag/ , and one
main focus of our lab is PTM and disorder, both need to deal with PDB files.
A poster title shows my protein structure-based kernel work:*
http://www.iscb.org/rocky09-program/rocky09-poster-presenters-abstracts,
they didn't put the abstract online. I could send you the abstract if you
are interested.  *


> Programming Skills: I use perl and python during my daily research. I am
>> now
>> working on developing a new functional site predictor using protein
>> structure information. The code will be open source, but the work is under
>> review so the code is not released yet.
>>
>
> Is there any other programming work you've done in the past that you could
> let us see? It doesn't have to be part of an existing open-source project;
> even some functioning snippets posted somewhere would help us get a sense of
> your coding style and abilities. Examples where you've used Biopython or
> another established toolkit for working with PDB files or other scientific
> data would be especially useful.
>
We also like to see that you're familiar with a project's build tools, which
> in Biopython's case is GitHub and the standard Python mechanisms. So, if you
> could upload some of your prior work to GitHub and send us the link, that
> would be ideal.
>

I put some of my python code here:
http://github.com/fuxiaoxin/my_python_code. I don't have code in python
using Bio.PDB. For parsing PDB, my code are in perl for the sake of its
regular expression, I seldomly use bioperl or biopython in the past, I write
all my own code, that's also why I think I am very clear of all kinds of
problems in PDB files. I am quite surprised to find Bio.PDB already have so
many modules for various functions. I could upload some of my perl functions
if you would like to have a look: I have functions similar to PDBparser,
NeighborSearch, DSSP, NACCESS.

I have to say I am not very familiar with the build tools of python. But I
hope to learn it during the bonding period. I just guided myself through to
upload my codes to Github, :)

My project plan:
>>
>> week1
>> 1. Renumber residues starting from 1 (or N)
>> function name: renumberPDB, given a pdb file, rename the atom field
>> numbering of the file to remove missing amino acids
>> communicate with mentors to set standards of the code to follow for the
>> rest
>> of the functions
>> create work log to keep track of process;
>>
>
> Biopython's coding standards generally follow an earlier version of PEP 8;
> hopefully you can pick it up quickly just by reading the source code for
> Bio.PDB -- so you don't really need that item listed here.
>
>
I will learn from Bio.PDB source code and remove this one.


> In the past, students have maintained their weekly schedules on a wiki or
> other public document, and updated them continually throughout the summer.
> This functions as a work log, in a way. You would also have an e-mail record
> of your work from your weekly reports to this list.
>

That's great to know.


> week2-3
>> 2. Select a portion of the structure -- models, chains, etc. -- and write
>> it
>> to a new file (PDB, FASTA, and other formats)
>> function name: rewritePDB, inputs will be a particular portion of a PDB
>> file
>> you want to write out(support 'chain', 'model', 'atom'), a file
>> format(PDB,
>> fasta), and the output name.
>> 3. Perform some basic, well-established measures of model quality/validity
>> function name: PDBquality
>> the function will report RESOLUTION and ? of the structure
>> 4. extract disorder region in PDB structure
>> function name: PDBdisorder
>> report missing residues in the structure atom field
>>
>
> These tasks seem reasonable. You don't need to commit to specific function
> names yet; it would be more helpful to describe the overall module layout
> you're planning, and list the dependencies for each (especially the
> components of Bio.PDB that come into play).
>

I will make a new  proposal with these details by tomorrow.


>
>> week3-4
>> 5. make a function to draw a Ramachandran plot
>> function name: ramaPLOT
>> combine the two steps(calcualting torsion angles and draw the plot) into
>> one
>> function, give the option to draw the plot or not
>>
>
> This task has a number of dependencies which I think you should list and
> describe here. Because of those dependencies there's a significant chance of
> it taking longer than you planned -- so I'd recommend moving it to after the
> midterm evaluations, wherever those fit into your schedule.
>

 I will add more details here.


> week5
>> 6. open PDB files in the window for visulization, visulize PDBsuperpose
>> results, output RMSD
>> function name: superposePDB
>> the function will look like the PDBsuperpose function in matlab; use
>> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
>> visualization tool to see the results
>>
>
> Would you build Python wrappers for interacting with the chosen
> visualization tool, or just write a set of files and launch the viewer in a
> script?
>

I am thinking of launching the script, since those PDB visualization tools
already have very nice command line options and interfaces.  But I think it
is really important to be able to visualize the structure on the
fly, especially when you are doing PDB superimpose.


>  week6
>> 7. write a function to extract all experimental conditions of a PDB file,
>> includes PH, temperature, and salt
>> function name: PDBconditon
>> it will be easy to get PH and temperature information, but for salt, it
>> will
>> be hard to parse because there is no general rule of such information in
>> the
>> PDB file; parse REMARK 200 field;
>>
>
> Sounds handy. Would your script write out a report combining all of this
> info, or just extract requested elements?
>

I am thinking to put the results into a variable instead of a report, since
it will be great for batch processing, and display the results immediately
in interactive mode.

>
> Other obligations:  I am aware that google summer code starts from May
>> 24th,
>> but I will have a review paper with my advisor due on June 1st, I hope it
>> will be OK for me to start after June 1st, and I will makeup the first
>> week
>> in Auguest.
>>
>
> How much of the "community bonding period" will this occupy? The guideline
> is that you get set up with the build system, read documentation and do
> background research part-time between GSoC acceptance and May 24, and start
> writing code full-time on May 24. You can make up for a gap in your project
> plan by doing extra preparation before coding starts; would this be possible
> for you?
>

I think the bonding period will be really important for me to get known
about the python build tools, and of course other stuff you mentors suggest
me to learn,  so I will devote my time for "bonding".  But since I will get
busy near the end of May, I plan to start early and do things more
efficiently.


>
> Finally, the GSoC administration app (socghop.appspot.com) gets crowded as
> the deadline approaches, so it's best if you register yourself there and
> take care of the administrivia as soon as you can to avoid any trouble on
> Friday.
>

Thanks for the reminding. I will incorporate you and Diana's suggestions to
make a new version of proposal, by tomorrow night.  But the idea is,  the
main project for the first period would be the quality/validation task , and
the second period will be the Ramachandran plot.  And I will fill in the
time with other small functions.


Thanks,
Fuxiao

From biopython at maubp.freeserve.co.uk  Thu Apr  8 04:04:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Apr 2010 09:04:27 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
Message-ID: <g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>

On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
> Greetings All!
>
> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>
> seqs[id] += seq.replace(".","-")
>
> So when you load into memory alignments that mark gaps created to
> allow alignment to inserts with ".", (such as PFam alignments or the
> output of hmmer) that information is lost.
>
> I know there must be a good reason for this, but I am finding it a
> problem on my end..
>
> -Bryan Lunt

Hi Bryan,

Yes, is it done deliberately. The dot is a problem - it has a quite
specific meaning of "same as above" on other alignment file
formats, while "-" is an almost universal shorthand for gap/insertion.
Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
conversion.

Have you got a sample output file we can use as a unit test or
at least discuss? As I recall, on the PFAM alignments I looked
at there was no data loss by doing the dot to dash mapping.

Peter

From sma.hmc at gmail.com  Thu Apr  8 05:41:26 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Thu, 8 Apr 2010 02:41:26 -0700
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
Message-ID: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>

I am a junior Computer Science major with heavy bioinformatic leanings
at Harvey Mudd College. I know that it is very late for new summer of
code applications, but I was wondering if you could have a look at my
proposed schedule to give me some pointers and answer a few questions.
I am also considering applying for the project involving adding more
ways to use R through python, but I was unsure of which project had
more users who wanted it completed.

Questions:
What does it mean by BioPython's acquired sequences? I can't seem to
find out what or where information about "acquired sequences" is.
Thus, I do not discuss anything about it in my current proposal.

For the creation of workflows, do there already exist use and test
cases for this or would I be best off looking for ones in papers and
trying to mimic them? Right now, I have an example paper where the
interoperability would have been helpful.

Any other use cases I should immediately consider in my proposal?

My current proposed schedule:

For Bio Python and PyCogent interoperability.
Week 1: Familiarization with the code and soliciting requests. While
what seems intuitive to me might not seem so to others. It would be
best to spend this time to determine a group of people who would
highly benefit from the interoperability and ask them for what they
would look for. For example, would they rather use one, save the data,
and use the other. Would they want to use them directly. Basically, I
want to get a good idea of how this code will be used before making my
own decisions on how I think people will use it. Also important here
is to create sets of data which can be used later on the process.

Week 2 and 3: Code converting PyCogent and BioPython. The core objects
in each package seem like they should not be too difficult to convert.
This step will involve looking into the documentation and coding for
PyCogent and BioPython, to determine what the core objects contain for
each. One possible problem here is if either PyCogent or BioPython
core objects use heavy subclassing, as determining subclassing in
Python has been a nightmare in the past. Testing at this point will
likely involve going through the entire round trip conversion, and
seeing if everything looks the same.

Week 4: Ensure that conversions allow the use of data from one program
to the other. The workflows of codon usage to clustering code can be
tested. One possible test set is from Sharp et. al. 1986. Here they
found different codon usage for different genes. Additionally, it
should be considered how codon usage can be used to help with making
biologically accurate clusters.

Week 5: Familiarize with phyloXML and make interoperable with
PyCogent. phyloXML has already been added with BioPython. Making
phyloXML work with PyCogent could be based on how it was adapted for
BioPython. Clear risks here include problems with making sure that the
API for phyloXML in PyCogent gives an intuitive interface to use
phyloXML.

Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
there is at least some support for PyCogent to query ENSEMBL. It seems
like it would be useful to query other genomics databases such as
Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
queries into their MySQL database. Ideally, if everything previously
has been alright, the conversion of PyCogent to BioPython forms shoudl
already be accounted for.

Week 8-12: Slip days and additional features. The initial set of use
cases will surely expand and this is extra time to allow for those use
cases to be accounted for.

Thanks,
Singer Ma

From biopython at maubp.freeserve.co.uk  Thu Apr  8 06:04:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Apr 2010 11:04:10 +0100
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
In-Reply-To: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
References: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
Message-ID: <l2l320fb6e01004080304p44ad227arbd2fa9ab9cc11f4b@mail.gmail.com>

On Thu, Apr 8, 2010 at 10:41 AM, Singer Ma <sma.hmc at gmail.com> wrote:
> I am a junior Computer Science major with heavy bioinformatic leanings
> at Harvey Mudd College. I know that it is very late for new summer of
> code applications, but I was wondering if you could have a look at my
> proposed schedule to give me some pointers and answer a few questions.
> I am also considering applying for the project involving adding more
> ways to use R through python, but I was unsure of which project had
> more users who wanted it completed.
>
> Questions:
> What does it mean by BioPython's acquired sequences? I can't seem to
> find out what or where information about "acquired sequences" is.
> Thus, I do not discuss anything about it in my current proposal.

http://www.biopython.org/wiki/Google_Summer_of_Code#Biopython_and_PyCogent_interoperability

You mean "Connecting Biopython acquired sequences to PyCogent's
alignment, phylogenetic tree preparation and tree visualization code."?

I think Brad means using Biopython to load (parse) sequence data (e.g.
with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in
the sense of get/load data.

> Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
> there is at least some support for PyCogent to query ENSEMBL. It seems
> like it would be useful to query other genomics databases such as
> Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
> queries into their MySQL database. ...

Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are
language neutral and we have Bio.Entrez to support them in Biopython.

Peter

From biopython at maubp.freeserve.co.uk  Thu Apr  8 06:26:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Apr 2010 11:26:10 +0100
Subject: [Biopython] Biopython 1.54b test failures
In-Reply-To: <u2p320fb6e01004021622i9a350a0eqfaf9663b5d6e9e62@mail.gmail.com>
References: <4BB67835.7030303@uci.edu>
	<u2p320fb6e01004021622i9a350a0eqfaf9663b5d6e9e62@mail.gmail.com>
Message-ID: <u2x320fb6e01004080326y796fba45lcdc476d9acc9c5c3@mail.gmail.com>

On Sat, Apr 3, 2010 at 12:22 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I recall trying the universal read lines thing before without
> success in the SCOP tests - maybe it was this line 72 thing
> that I missed. I'll take another look at this next week (when
> I have access to a Windows machine).
>

You are right - that does make the two SCOP tests pass on
Windows without having to first convert the SCOP example
files from Unix to DOS/Windows newlines. Checked in.
Would you like to be credited for this in the NEWS and
CONTRIB files?

Thanks,

Peter

From sma.hmc at gmail.com  Thu Apr  8 06:31:10 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Thu, 8 Apr 2010 03:31:10 -0700
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
In-Reply-To: <l2l320fb6e01004080304p44ad227arbd2fa9ab9cc11f4b@mail.gmail.com>
References: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
	<l2l320fb6e01004080304p44ad227arbd2fa9ab9cc11f4b@mail.gmail.com>
Message-ID: <w2m62ed8c081004080331l8522b8dm994ccaac9c57761a@mail.gmail.com>

> You mean "Connecting Biopython acquired sequences to PyCogent's
> alignment, phylogenetic tree preparation and tree visualization code."?
>
> I think Brad means using Biopython to load (parse) sequence data (e.g.
> with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in
> the sense of get/load data.

Ah, so, its just the most straightforward use of the conversion tools
that would be made. Sorry, I thought I was missing something here.
Shouldn't be this be taken care of in the first use case of "Allow
round-trip conversion between biopython and pycogent core objects
(sequence, alignment, tree, etc.)."? Or does this require me to
determine how the interactions will be made?

>
> Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are
> language neutral and we have Bio.Entrez to support them in Biopython.

Ah, I misread my information, so NCBI Entrez can already be queried.
What exactly do we need to get from ENSEMBL that isn't already
supported then?

Singer

From chapmanb at 50mail.com  Thu Apr  8 08:39:53 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 8 Apr 2010 08:39:53 -0400
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
In-Reply-To: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
References: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
Message-ID: <20100408123953.GG911@sobchak.mgh.harvard.edu>

Singer;
Thanks for the introduction and initial project plan. Glad that you
are interested. I'll try to tackle a few of the specific points
Peter has not already talked about, and suggest some specifics for
the application.

> Questions:
> What does it mean by BioPython's acquired sequences? I can't seem to
> find out what or where information about "acquired sequences" is.
> Thus, I do not discuss anything about it in my current proposal.

Following up on what Peter mentioned, what we're trying to say there
is to use the results from step 1 (interoperability) to create
unique workflows that use both Biopython and PyCogent. This is a
suggested workflow to utilize some of the strengths of both
packages.

> For the creation of workflows, do there already exist use and test
> cases for this or would I be best off looking for ones in papers and
> trying to mimic them? Right now, I have an example paper where the
> interoperability would have been helpful.

Yes, that is exactly the right approach. The ideas we've suggested
are just brainstorming; please select workflows that are interesting
to you.

> My current proposed schedule:
> 
> For Bio Python and PyCogent interoperability.
> Week 1: Familiarization with the code and soliciting requests. While
> what seems intuitive to me might not seem so to others. It would be
> best to spend this time to determine a group of people who would
> highly benefit from the interoperability and ask them for what they
> would look for. For example, would they rather use one, save the data,
> and use the other. Would they want to use them directly. Basically, I
> want to get a good idea of how this code will be used before making my
> own decisions on how I think people will use it. Also important here
> is to create sets of data which can be used later on the process.

All of this type of non-coding work should be done in the community
bonding period, from April 26th to the start of coding. When week 1
hits, you want to be ready to code. See the timeline for more
specific information on dates:

http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/timeline

> Week 5: Familiarize with phyloXML and make interoperable with
> PyCogent. phyloXML has already been added with BioPython. Making
> phyloXML work with PyCogent could be based on how it was adapted for
> BioPython. Clear risks here include problems with making sure that the
> API for phyloXML in PyCogent gives an intuitive interface to use
> phyloXML.

Again, all of the non-coding activities should be moved to before
the actual coding period. In your timeline you want to focus on code
deliverables for each week. Of course there will be learning and
reading during the program, but you want to be sure to have a code
centric focus.

> Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
> there is at least some support for PyCogent to query ENSEMBL. It seems
> like it would be useful to query other genomics databases such as
> Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
> queries into their MySQL database. Ideally, if everything previously
> has been alright, the conversion of PyCogent to BioPython forms shoudl
> already be accounted for.

Following up on your discussion with Peter, you should think about
some workflows that use Biopython Entrez queries and PyCogent
Ensembl queries to answer interesting questions that could not be
done with either. This should help to focus your ideas on integration 
and workflows, as opposed to implementing new functionality.

> Week 8-12: Slip days and additional features. The initial set of use
> cases will surely expand and this is extra time to allow for those use
> cases to be accounted for.

You need to continue your detailed project plan for the entire
period. See the examples in the NESCent application documentation to 
get an idea of the level of detail in accepted projects from previous years:

https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply
http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html

Practically, applications are due tomorrow, so you should have a
submission sent in to OpenBio through the GSoC interface
(http://socghop.appspot.com).

Hope this helps,
Brad

From vincent at vincentdavis.net  Thu Apr  8 14:33:41 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 12:33:41 -0600
Subject: [Biopython] affy CEL and CDF reader
Message-ID: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>

I ended up writing my own modules for reading both affy Cel and CDF files.
Long story as to why I did not just use what was available in biopython.
I plan on making what I have done available to the biopython and will upload
it as a fork. I will outline what ways what I have is different below.
My question is: Are there any improvements(features) others would like to
see beyond what is avalible in the current CelFile.py?
I saw some posts a month or so ago about checking for consistency in cell
file, I think it was something about making sure the stated number of probes
was consistent with the intensity measurements.

What is different,
when an file is read Affycel.read('file') many atributes are set. for
example
a = affcel()
a.read('testfile')
a.filename,
a.version,
a.header.items()  # a dictionary of all header items
a.num_intensity
a.intensity
a.num_masks
a.masks
a.num_outliers
a.outliers
a.numb_modified
a.modified

I plan to add the ability return/call intensity values with our with
outliers or mask values.
All data is currently store in numpy structured arrays,
currently a.intensity returns the structured array, but I plan on making it
an option to easily choose how this is returned.
also what to make an optional normalized intensity array so that if the data
is normalized it can be stored with the affycel instance. My use case was
that I was opening about 80 cel files and reading them in was slow. this
allowed me to read each file as an instance of affycel stored in a list that
I then pickled. It was then much faster to open them.

Are improvements to the CelFile.py are of value to biopython?

I hope to have the code pushed up to my fork on github late tonight. Just
thought I would ask if there was any suggestion before I did.

Also have an CDF file reader, but only have done some basic testing. I don't
have a lot of use for this, do other biopython users?

I am kinda working in a vacuum and am trying to get more involved in
projects to improve my skills and knowledge. Any suggestions would be
appreciated.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

From sdavis2 at mail.nih.gov  Thu Apr  8 14:56:12 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 8 Apr 2010 14:56:12 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
Message-ID: <j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>

On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> I ended up writing my own modules for reading both affy Cel and CDF files.
> Long story as to why I did not just use what was available in biopython.
> I plan on making what I have done available to the biopython and will upload
> it as a fork. I will outline what ways what I have is different below.
> My question is: Are there any improvements(features) others would like to
> see beyond what is avalible in the current CelFile.py?
> I saw some posts a month or so ago about checking for consistency in cell
> file, I think it was something about making sure the stated number of probes
> was consistent with the intensity measurements.
>
> What is different,
> when an file is read Affycel.read('file') many atributes are set. for
> example
> a = affcel()
> a.read('testfile')
> a.filename,
> a.version,
> a.header.items() ?# a dictionary of all header items
> a.num_intensity
> a.intensity
> a.num_masks
> a.masks
> a.num_outliers
> a.outliers
> a.numb_modified
> a.modified
>
> I plan to add the ability return/call intensity values with our with
> outliers or mask values.
> All data is currently store in numpy structured arrays,
> currently a.intensity returns the structured array, but I plan on making it
> an option to easily choose how this is returned.
> also what to make an optional normalized intensity array so that if the data
> is normalized it can be stored with the affycel instance. My use case was
> that I was opening about 80 cel files and reading them in was slow. this
> allowed me to read each file as an instance of affycel stored in a list that
> I then pickled. It was then much faster to open them.
>
> Are improvements to the CelFile.py are of value to biopython?
>
> I hope to have the code pushed up to my fork on github late tonight. Just
> thought I would ask if there was any suggestion before I did.
>
> Also have an CDF file reader, but only have done some basic testing. I don't
> have a lot of use for this, do other biopython users?
>
> I am kinda working in a vacuum and am trying to get more involved in
> projects to improve my skills and knowledge. Any suggestions would be
> appreciated.

Just out of curiosity, is your work based on the affy sdk, or are you
parsing stuff yourself?

Sean


From vincent at vincentdavis.net  Thu Apr  8 15:03:38 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 13:03:38 -0600
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
Message-ID: <q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>

Parsing it myself, But based directly an the affy documentation found here.
http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:

> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
> > I ended up writing my own modules for reading both affy Cel and CDF
> files.
> > Long story as to why I did not just use what was available in biopython.
> > I plan on making what I have done available to the biopython and will
> upload
> > it as a fork. I will outline what ways what I have is different below.
> > My question is: Are there any improvements(features) others would like to
> > see beyond what is avalible in the current CelFile.py?
> > I saw some posts a month or so ago about checking for consistency in cell
> > file, I think it was something about making sure the stated number of
> probes
> > was consistent with the intensity measurements.
> >
> > What is different,
> > when an file is read Affycel.read('file') many atributes are set. for
> > example
> > a = affcel()
> > a.read('testfile')
> > a.filename,
> > a.version,
> > a.header.items()  # a dictionary of all header items
> > a.num_intensity
> > a.intensity
> > a.num_masks
> > a.masks
> > a.num_outliers
> > a.outliers
> > a.numb_modified
> > a.modified
> >
> > I plan to add the ability return/call intensity values with our with
> > outliers or mask values.
> > All data is currently store in numpy structured arrays,
> > currently a.intensity returns the structured array, but I plan on making
> it
> > an option to easily choose how this is returned.
> > also what to make an optional normalized intensity array so that if the
> data
> > is normalized it can be stored with the affycel instance. My use case was
> > that I was opening about 80 cel files and reading them in was slow. this
> > allowed me to read each file as an instance of affycel stored in a list
> that
> > I then pickled. It was then much faster to open them.
> >
> > Are improvements to the CelFile.py are of value to biopython?
> >
> > I hope to have the code pushed up to my fork on github late tonight. Just
> > thought I would ask if there was any suggestion before I did.
> >
> > Also have an CDF file reader, but only have done some basic testing. I
> don't
> > have a lot of use for this, do other biopython users?
> >
> > I am kinda working in a vacuum and am trying to get more involved in
> > projects to improve my skills and knowledge. Any suggestions would be
> > appreciated.
>
> Just out of curiosity, is your work based on the affy sdk, or are you
> parsing stuff yourself?
>
> Sean
>

From sdavis2 at mail.nih.gov  Thu Apr  8 15:40:01 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 8 Apr 2010 15:40:01 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
	<q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
Message-ID: <i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>

On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> Parsing it myself, But based directly an the affy documentation found here.
> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/

So, are you covering both binary and text formats for .CEL files?  I
think that modern .CEL files (those produced by GCOS) are binary and
represent the majority of .CEL files produced today.  Some of the I/O
issues that you discuss are almost definitely dealt with by using the
binary .CEL files.

I'm certainly not an expert on Affy, so take all these
questions/comments with a grain of salt.

Sean


> On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net>
>> wrote:
>> > I ended up writing my own modules for reading both affy Cel and CDF
>> files.
>> > Long story as to why I did not just use what was available in biopython.
>> > I plan on making what I have done available to the biopython and will
>> upload
>> > it as a fork. I will outline what ways what I have is different below.
>> > My question is: Are there any improvements(features) others would like to
>> > see beyond what is avalible in the current CelFile.py?
>> > I saw some posts a month or so ago about checking for consistency in cell
>> > file, I think it was something about making sure the stated number of
>> probes
>> > was consistent with the intensity measurements.
>> >
>> > What is different,
>> > when an file is read Affycel.read('file') many atributes are set. for
>> > example
>> > a = affcel()
>> > a.read('testfile')
>> > a.filename,
>> > a.version,
>> > a.header.items() ?# a dictionary of all header items
>> > a.num_intensity
>> > a.intensity
>> > a.num_masks
>> > a.masks
>> > a.num_outliers
>> > a.outliers
>> > a.numb_modified
>> > a.modified
>> >
>> > I plan to add the ability return/call intensity values with our with
>> > outliers or mask values.
>> > All data is currently store in numpy structured arrays,
>> > currently a.intensity returns the structured array, but I plan on making
>> it
>> > an option to easily choose how this is returned.
>> > also what to make an optional normalized intensity array so that if the
>> data
>> > is normalized it can be stored with the affycel instance. My use case was
>> > that I was opening about 80 cel files and reading them in was slow. this
>> > allowed me to read each file as an instance of affycel stored in a list
>> that
>> > I then pickled. It was then much faster to open them.
>> >
>> > Are improvements to the CelFile.py are of value to biopython?
>> >
>> > I hope to have the code pushed up to my fork on github late tonight. Just
>> > thought I would ask if there was any suggestion before I did.
>> >
>> > Also have an CDF file reader, but only have done some basic testing. I
>> don't
>> > have a lot of use for this, do other biopython users?
>> >
>> > I am kinda working in a vacuum and am trying to get more involved in
>> > projects to improve my skills and knowledge. Any suggestions would be
>> > appreciated.
>>
>> Just out of curiosity, is your work based on the affy sdk, or are you
>> parsing stuff yourself?
>>
>> Sean
>>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From vincent at vincentdavis.net  Thu Apr  8 15:43:57 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 13:43:57 -0600
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
	<q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
	<i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>
Message-ID: <k2m77e831101004081243v46a53399y408f1db30c3b6115@mail.gmail.com>

No I was not reading the binary files. That said I am interested in perusing
that if there is interest.
Do you have a link to the SDK?

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:

> On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
> > Parsing it myself, But based directly an the affy documentation found
> here.
> >
> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/
>
> So, are you covering both binary and text formats for .CEL files?  I
> think that modern .CEL files (those produced by GCOS) are binary and
> represent the majority of .CEL files produced today.  Some of the I/O
> issues that you discuss are almost definitely dealt with by using the
> binary .CEL files.
>
> I'm certainly not an expert on Affy, so take all these
> questions/comments with a grain of salt.
>
> Sean
>
>
> > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov>
> wrote:
> >
> >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net
> >
> >> wrote:
> >> > I ended up writing my own modules for reading both affy Cel and CDF
> >> files.
> >> > Long story as to why I did not just use what was available in
> biopython.
> >> > I plan on making what I have done available to the biopython and will
> >> upload
> >> > it as a fork. I will outline what ways what I have is different below.
> >> > My question is: Are there any improvements(features) others would like
> to
> >> > see beyond what is avalible in the current CelFile.py?
> >> > I saw some posts a month or so ago about checking for consistency in
> cell
> >> > file, I think it was something about making sure the stated number of
> >> probes
> >> > was consistent with the intensity measurements.
> >> >
> >> > What is different,
> >> > when an file is read Affycel.read('file') many atributes are set. for
> >> > example
> >> > a = affcel()
> >> > a.read('testfile')
> >> > a.filename,
> >> > a.version,
> >> > a.header.items()  # a dictionary of all header items
> >> > a.num_intensity
> >> > a.intensity
> >> > a.num_masks
> >> > a.masks
> >> > a.num_outliers
> >> > a.outliers
> >> > a.numb_modified
> >> > a.modified
> >> >
> >> > I plan to add the ability return/call intensity values with our with
> >> > outliers or mask values.
> >> > All data is currently store in numpy structured arrays,
> >> > currently a.intensity returns the structured array, but I plan on
> making
> >> it
> >> > an option to easily choose how this is returned.
> >> > also what to make an optional normalized intensity array so that if
> the
> >> data
> >> > is normalized it can be stored with the affycel instance. My use case
> was
> >> > that I was opening about 80 cel files and reading them in was slow.
> this
> >> > allowed me to read each file as an instance of affycel stored in a
> list
> >> that
> >> > I then pickled. It was then much faster to open them.
> >> >
> >> > Are improvements to the CelFile.py are of value to biopython?
> >> >
> >> > I hope to have the code pushed up to my fork on github late tonight.
> Just
> >> > thought I would ask if there was any suggestion before I did.
> >> >
> >> > Also have an CDF file reader, but only have done some basic testing. I
> >> don't
> >> > have a lot of use for this, do other biopython users?
> >> >
> >> > I am kinda working in a vacuum and am trying to get more involved in
> >> > projects to improve my skills and knowledge. Any suggestions would be
> >> > appreciated.
> >>
> >> Just out of curiosity, is your work based on the affy sdk, or are you
> >> parsing stuff yourself?
> >>
> >> Sean
> >>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>

From vincent at vincentdavis.net  Thu Apr  8 16:21:32 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 14:21:32 -0600
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
Message-ID: <m2h77e831101004081321xc3fef5efh86e4db4e61607406@mail.gmail.com>

Maybe I should have started this discussion differently.
Is there any need for improvements to the ability to read CEL files or CDF
files and if so what are they? I am interested in  contributing.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Apr 8, 2010 at 12:33 PM, Vincent Davis <vincent at vincentdavis.net>wrote:

> I ended up writing my own modules for reading both affy Cel and CDF files.
> Long story as to why I did not just use what was available in biopython.
> I plan on making what I have done available to the biopython and will
> upload it as a fork. I will outline what ways what I have is different
> below.
> My question is: Are there any improvements(features) others would like to
> see beyond what is avalible in the current CelFile.py?
> I saw some posts a month or so ago about checking for consistency in cell
> file, I think it was something about making sure the stated number of probes
> was consistent with the intensity measurements.
>
> What is different,
> when an file is read Affycel.read('file') many atributes are set. for
> example
> a = affcel()
> a.read('testfile')
> a.filename,
> a.version,
> a.header.items()  # a dictionary of all header items
> a.num_intensity
> a.intensity
> a.num_masks
> a.masks
> a.num_outliers
> a.outliers
>  a.numb_modified
> a.modified
>
> I plan to add the ability return/call intensity values with our with
> outliers or mask values.
> All data is currently store in numpy structured arrays,
> currently a.intensity returns the structured array, but I plan on making it
> an option to easily choose how this is returned.
> also what to make an optional normalized intensity array so that if the
> data is normalized it can be stored with the affycel instance. My use case
> was that I was opening about 80 cel files and reading them in was slow. this
> allowed me to read each file as an instance of affycel stored in a list that
> I then pickled. It was then much faster to open them.
>
> Are improvements to the CelFile.py are of value to biopython?
>
> I hope to have the code pushed up to my fork on github late tonight. Just
> thought I would ask if there was any suggestion before I did.
>
> Also have an CDF file reader, but only have done some basic testing. I
> don't have a lot of use for this, do other biopython users?
>
> I am kinda working in a vacuum and am trying to get more involved in
> projects to improve my skills and knowledge. Any suggestions would be
> appreciated.
>
>   *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>
>

From sdavis2 at mail.nih.gov  Thu Apr  8 18:31:43 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 8 Apr 2010 18:31:43 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <k2m77e831101004081243v46a53399y408f1db30c3b6115@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
	<q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
	<i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>
	<k2m77e831101004081243v46a53399y408f1db30c3b6115@mail.gmail.com>
Message-ID: <s2g264855a01004081531sc04a7630xf3fc08a0032c0d43@mail.gmail.com>

On Thu, Apr 8, 2010 at 3:43 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> No I was not reading the binary files. That said I am interested in perusing
> that if there is interest.
> Do you have a link to the SDK?

I believe this will get you close:

http://www.affymetrix.com/partners_programs/programs/developer/fusion/index.affx?terms=no

I hope my questions are not taken the wrong way, but I have learned
from the bioconductor project that dealing with vendor file formats is
often a non-trivial pursuit.  It isn't always easy to think of all the
edge cases.

Sean


> ?*Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
> ?my blog <http://vincentdavis.net> |
> LinkedIn<http://www.linkedin.com/in/vincentdavis>
>
>
> On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net>
>> wrote:
>> > Parsing it myself, But based directly an the affy documentation found
>> here.
>> >
>> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/
>>
>> So, are you covering both binary and text formats for .CEL files? ?I
>> think that modern .CEL files (those produced by GCOS) are binary and
>> represent the majority of .CEL files produced today. ?Some of the I/O
>> issues that you discuss are almost definitely dealt with by using the
>> binary .CEL files.
>>
>> I'm certainly not an expert on Affy, so take all these
>> questions/comments with a grain of salt.
>>
>> Sean
>>
>>
>> > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov>
>> wrote:
>> >
>> >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net
>> >
>> >> wrote:
>> >> > I ended up writing my own modules for reading both affy Cel and CDF
>> >> files.
>> >> > Long story as to why I did not just use what was available in
>> biopython.
>> >> > I plan on making what I have done available to the biopython and will
>> >> upload
>> >> > it as a fork. I will outline what ways what I have is different below.
>> >> > My question is: Are there any improvements(features) others would like
>> to
>> >> > see beyond what is avalible in the current CelFile.py?
>> >> > I saw some posts a month or so ago about checking for consistency in
>> cell
>> >> > file, I think it was something about making sure the stated number of
>> >> probes
>> >> > was consistent with the intensity measurements.
>> >> >
>> >> > What is different,
>> >> > when an file is read Affycel.read('file') many atributes are set. for
>> >> > example
>> >> > a = affcel()
>> >> > a.read('testfile')
>> >> > a.filename,
>> >> > a.version,
>> >> > a.header.items() ?# a dictionary of all header items
>> >> > a.num_intensity
>> >> > a.intensity
>> >> > a.num_masks
>> >> > a.masks
>> >> > a.num_outliers
>> >> > a.outliers
>> >> > a.numb_modified
>> >> > a.modified
>> >> >
>> >> > I plan to add the ability return/call intensity values with our with
>> >> > outliers or mask values.
>> >> > All data is currently store in numpy structured arrays,
>> >> > currently a.intensity returns the structured array, but I plan on
>> making
>> >> it
>> >> > an option to easily choose how this is returned.
>> >> > also what to make an optional normalized intensity array so that if
>> the
>> >> data
>> >> > is normalized it can be stored with the affycel instance. My use case
>> was
>> >> > that I was opening about 80 cel files and reading them in was slow.
>> this
>> >> > allowed me to read each file as an instance of affycel stored in a
>> list
>> >> that
>> >> > I then pickled. It was then much faster to open them.
>> >> >
>> >> > Are improvements to the CelFile.py are of value to biopython?
>> >> >
>> >> > I hope to have the code pushed up to my fork on github late tonight.
>> Just
>> >> > thought I would ask if there was any suggestion before I did.
>> >> >
>> >> > Also have an CDF file reader, but only have done some basic testing. I
>> >> don't
>> >> > have a lot of use for this, do other biopython users?
>> >> >
>> >> > I am kinda working in a vacuum and am trying to get more involved in
>> >> > projects to improve my skills and knowledge. Any suggestions would be
>> >> > appreciated.
>> >>
>> >> Just out of curiosity, is your work based on the affy sdk, or are you
>> >> parsing stuff yourself?
>> >>
>> >> Sean
>> >>
>> > _______________________________________________
>> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From reece at berkeley.edu  Thu Apr  8 19:38:10 2010
From: reece at berkeley.edu (Reece Hart)
Date: Thu, 08 Apr 2010 16:38:10 -0700
Subject: [Biopython] SeqIO.parse exception on Google App Engine
Message-ID: <4BBE68E2.2030803@berkeley.edu>

Hi-

I'm trying to fetch a Genbank record and parse it in the Google App Engine
environment. A command line version works fine, but when using exactly the
same code under Google App Engine, SeqIO throws throws the following
exception:
   ...
   File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line 746, 
in parse_footer
     self.line = self.line.rstrip(os.linesep)
AttributeError: 'module' object has no attribute 'linesep'

The environment:
- Ubuntu Lucid beta1
- Python 2.6.5
- Biopython 1.53
- GAE 1.3.2

Test case:
I put together a simple test case that retrieves a raw (text) Genbank
record using Bio.Entrez (efetch); this works in both environments.
Parsing that record works on the command line, but not under GAE.

- curl http://harts.net/reece/tmp/demo1.tgz | tar -xvzf-
- cd demo1
- update symlink ./Bio to a Biopython tree
eg$ ln -s /usr/share/pyshared/Bio Bio
My intent is to prepend Bio to sys.paths much the way I would expect this
to be deployed (i.e., without updating sys.path).


Command line test:
$ ./lookup
fetch_text:LOCUS       NM_004006              13993 bp    mRNA    
linear   PRI 25-MAR-2010
fetch_parse:NM_004006.2 / NM_004006 / Homo sapiens dystrophin (DMD), 
transcript variant Dp427m,

GAE test:
In the demo1 directory:
$ dev_appserver.py .
and, in another terminal:
$ curl http://localhost:8080/
You'll see the exception in the http reply and in the appserver log


Thanks for any help/advice/pointers,
Reece

P.S. I'm learning Python and GAE at the same time, so silly errors are
possible (nay, likely).


From chapmanb at 50mail.com  Thu Apr  8 21:19:45 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 8 Apr 2010 21:19:45 -0400
Subject: [Biopython] SeqIO.parse exception on Google App Engine
In-Reply-To: <4BBE68E2.2030803@berkeley.edu>
References: <4BBE68E2.2030803@berkeley.edu>
Message-ID: <20100409011945.GE2011@kunkel>

Hi Reece;

> I'm trying to fetch a Genbank record and parse it in the Google App Engine
> environment. A command line version works fine, but when using exactly the
> same code under Google App Engine, SeqIO throws throws the following
> exception:
>   ...
>   File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line
> 746, in parse_footer
>     self.line = self.line.rstrip(os.linesep)
> AttributeError: 'module' object has no attribute 'linesep'

The python on Google App Engine is a bit crippled and lacks some of
the functionality of a full python install. It looks like one issue
must be that os.linesep is not defined on GAE. A quick fix is to
modify this to "\n", or just do:

os.linesep = "\n"

at the top of the Scanner.py file.

It would be really useful if you were able to submit a patch or list
of areas where Biopython fails on app engine and we can think about
how to suitably modify the code base to work on GAE and still be
compatible with Windows.

I did a bit of work on this using Biopython in Google App Engine
last year; code is on GitHub here:

http://github.com/chapmanb/biosqlweb

that might be helpful as a starting place for other ideas.

Good luck and let us know how your GAE experience goes,
Brad

From reece at berkeley.edu  Thu Apr  8 22:34:48 2010
From: reece at berkeley.edu (Reece Hart)
Date: Thu, 08 Apr 2010 19:34:48 -0700
Subject: [Biopython] SeqIO.parse exception on Google App Engine
In-Reply-To: <20100409011945.GE2011@kunkel>
References: <4BBE68E2.2030803@berkeley.edu> <20100409011945.GE2011@kunkel>
Message-ID: <4BBE9248.2080502@berkeley.edu>

Hi Brad. Thanks for the quick reply.

On 04/08/2010 06:19 PM, Brad Chapman wrote:
> A quick fix is to
> modify this to "\n", or just do:
>
> os.linesep = "\n"
>
> at the top of the Scanner.py file.
>    
It turns out that this fix also works within the module that does the 
parse. To wit:
from Bio import SeqIO
os.linesep = '\n'
rec = SeqIO.parse(...)

> I did a bit of work on this using Biopython in Google App Engine
> last year; code is on GitHub here:
> http://github.com/chapmanb/biosqlweb
> that might be helpful as a starting place for other ideas.
>    
Yes, thank you for this. This is precisely where I started only a few 
days ago...

Cheers,
Reece

From reece at berkeley.edu  Fri Apr  9 00:46:36 2010
From: reece at berkeley.edu (Reece Hart)
Date: Thu, 08 Apr 2010 21:46:36 -0700
Subject: [Biopython] GenBank.Scanner use of os.linesep
Message-ID: <4BBEB12C.8030907@berkeley.edu>

Hi All-

I recently discovered that the GenBank parser doesn't work on Google App 
Engine because os.linesep is undefined (GenBank/Scanner.py:746):

    745    #            if self.line[-1] == "\n" : self.line = 
self.line[:-1]
    746                self.line = self.line.rstrip(os.linesep)
    747                misc_lines.append(self.line)

Defining os.linesep is sufficient to fix the problem (thanks to Brad 
Chapman).

It seems to me that this use of os.linesep is probably mistaken here. If 
the file comes from efetch, the line separator will be \n regardless of 
platform [1] and that is what should be used in rstrip. It's possible 
that the file might come from a dog-foresaken CRLF platform and 
therefore contain that line separator.

So, I humbly propose that 746 be changed to either rstrip('\n') or, 
perhaps, rstrip('\n\r'). Although the need for the latter is probably 
rare, I don't see that it costs anything to cover that case by adding \r.

I'm new to this community, so I don't know whether we now have ferocious 
debate about the merits of line terminators or, rather, I submit a lame 
one-liner patch against the git HEAD.

Thanks for Biopython.

Cheers,
Reece


[1] For reference, here's a web request that should be equivalent to the 
efetch. On line 5, 0a is LF is \n.
apt12j$ curl -s 
'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=238018044&rettype=gb' 
| hexdump -C | head
00000000  4c 4f 43 55 53 20 20 20  20 20 20 20 4e 4d 5f 30  |LOCUS       
NM_0|
00000010  30 34 30 30 36 20 20 20  20 20 20 20 20 20 20 20  
|04006           |
00000020  20 20 20 31 33 39 39 33  20 62 70 20 20 20 20 6d  |   13993 
bp    m|
00000030  52 4e 41 20 20 20 20 6c  69 6e 65 61 72 20 20 20  |RNA    
linear   |
00000040  50 52 49 20 32 35 2d 4d  41 52 2d 32 30 31 30 0a  |PRI 
25-MAR-2010.|
00000050  44 45 46 49 4e 49 54 49  4f 4e 20 20 48 6f 6d 6f  |DEFINITION  
Homo|

-- 
Reece Hart, Ph.D.
Chief Scientist, Genome Commons                http://genomecommons.org/
Center for Computational Biology               324G Stanley Hall
UC Berkeley / QB3                              Berkeley, CA 94720


From biopython at maubp.freeserve.co.uk  Fri Apr  9 04:54:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 09:54:53 +0100
Subject: [Biopython] GenBank.Scanner use of os.linesep
In-Reply-To: <4BBEB12C.8030907@berkeley.edu>
References: <4BBEB12C.8030907@berkeley.edu>
Message-ID: <u2q320fb6e01004090154v83ff8badq743a4369cdf78bee@mail.gmail.com>

On Fri, Apr 9, 2010 at 5:46 AM, Reece Hart <reece at berkeley.edu> wrote:
> Hi All-
>
> I recently discovered that the GenBank parser doesn't work on Google App
> Engine because os.linesep is undefined (GenBank/Scanner.py:746):
>
> ? 745 ? ?# ? ? ? ? ? ?if self.line[-1] == "\n" : self.line = self.line[:-1]
> ? 746 ? ? ? ? ? ? ? ?self.line = self.line.rstrip(os.linesep)
> ? 747 ? ? ? ? ? ? ? ?misc_lines.append(self.line)
>
> Defining os.linesep is sufficient to fix the problem (thanks to Brad
> Chapman).
>
> It seems to me that this use of os.linesep is probably mistaken here.

I agree.

> If the
> file comes from efetch, the line separator will be \n regardless of platform
> [1] and that is what should be used in rstrip. It's possible that the file
> might come from a dog-foresaken CRLF platform and therefore contain that
> line separator.

I think it would break in a more common setting - passing a file on
Windows with CRLF, since Python will turn that into just \n.

> So, I humbly propose that 746 be changed to either rstrip('\n') or, perhaps,
> rstrip('\n\r'). Although the need for the latter is probably rare, I don't
> see that it costs anything to cover that case by adding \r.

A plain rstrip() would also work and get rid of any trailing whitespace.
I've checked that in.

> I'm new to this community, so I don't know whether we now have ferocious
> debate about the merits of line terminators or, rather, I submit a lame
> one-liner patch against the git HEAD.

For something this trivial, your verbal patch is fine. Would you like
to be added to the NEWS and CONTRIB file?

Peter


From biopython at maubp.freeserve.co.uk  Fri Apr  9 08:08:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 13:08:03 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
Message-ID: <k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>

On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>> Greetings All!
>>
>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>
>> seqs[id] += seq.replace(".","-")
>>
>> So when you load into memory alignments that mark gaps created to
>> allow alignment to inserts with ".", (such as PFam alignments or the
>> output of hmmer) that information is lost.
>>
>> I know there must be a good reason for this, but I am finding it a
>> problem on my end..
>>
>> -Bryan Lunt
>
> Hi Bryan,
>
> Yes, is it done deliberately. The dot is a problem - it has a quite
> specific meaning of "same as above" on other alignment file
> formats, while "-" is an almost universal shorthand for gap/insertion.
> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
> conversion.
>
> Have you got a sample output file we can use as a unit test or
> at least discuss? As I recall, on the PFAM alignments I looked
> at there was no data loss by doing the dot to dash mapping.

According to http://sonnhammer.sbc.su.se/Stockholm.html
>> Sequence letters may include any characters except
>> whitespace. Gaps may be indicated by "." or "-".

So a Stockholm file using a mixture of "." and "-" would be
valid but a bit odd. Why would anyone do that?

Peter

From cjfields at illinois.edu  Fri Apr  9 08:51:35 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 9 Apr 2010 07:51:35 -0500
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
Message-ID: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>

On Apr 9, 2010, at 7:08 AM, Peter wrote:

> On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>>> Greetings All!
>>> 
>>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>> 
>>> seqs[id] += seq.replace(".","-")
>>> 
>>> So when you load into memory alignments that mark gaps created to
>>> allow alignment to inserts with ".", (such as PFam alignments or the
>>> output of hmmer) that information is lost.
>>> 
>>> I know there must be a good reason for this, but I am finding it a
>>> problem on my end..
>>> 
>>> -Bryan Lunt
>> 
>> Hi Bryan,
>> 
>> Yes, is it done deliberately. The dot is a problem - it has a quite
>> specific meaning of "same as above" on other alignment file
>> formats, while "-" is an almost universal shorthand for gap/insertion.
>> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
>> conversion.
>> 
>> Have you got a sample output file we can use as a unit test or
>> at least discuss? As I recall, on the PFAM alignments I looked
>> at there was no data loss by doing the dot to dash mapping.
> 
> According to http://sonnhammer.sbc.su.se/Stockholm.html
>>> Sequence letters may include any characters except
>>> whitespace. Gaps may be indicated by "." or "-".
> 
> So a Stockholm file using a mixture of "." and "-" would be
> valid but a bit odd. Why would anyone do that?
> 
> Peter

Just curious, b/c this is a point of contention in BioPerl.  How does BioPython internally set what symbols correspond to residues/gaps/frameshifts/other?  BioPerl retains the original sequence but uses regexes for validation and methods that return symbol-related information (e.g. gap counts).  

(BTW, the contention here isn't that we use regexes, but that we set them globally).


chris


From biopython at maubp.freeserve.co.uk  Fri Apr  9 09:21:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 14:21:03 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>
Message-ID: <l2u320fb6e01004090621m94ab4301g1a262340eafc2fe1@mail.gmail.com>

On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
>
> Just curious, b/c this is a point of contention in BioPerl. ?How does BioPython
> internally set what symbols correspond to residues/gaps/frameshifts/other?
> BioPerl retains the original sequence but uses regexes for validation and
> methods that return symbol-related information (e.g. gap counts).
>
> (BTW, the contention here isn't that we use regexes, but that we set them globally).
>
> chris

Hi Chris,

The short answer is gaps are by default "-", and stop codons are "*", but
beyond that it would be down to user code to interpret odd symbols.

Our sequences have an alphabet object which can specify the letters (as
a set of expected characters), with explicit support for a single gap
character (usually "-"), and for proteins a single stop codon symbol (usually
"*"). This could in theory be extended to define other symbols too. The gap
char does get treated specially in some of the alignment code (e.g. for
calling a consensus), but I don't think we have anything built in regarding
frameshifts.

Peter


From biopython at maubp.freeserve.co.uk  Fri Apr  9 09:30:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 14:30:55 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <alpine.DEB.1.10.1004091501460.3715@gramsci.bo.biodec.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<alpine.DEB.1.10.1004091501460.3715@gramsci.bo.biodec.com>
Message-ID: <o2r320fb6e01004090630vfe4b740azb287ebf8b77fa702@mail.gmail.com>

On Fri, Apr 9, 2010 at 2:09 PM, Ivan Rossi <ivan at biodec.com> wrote:
>
> On Fri, 9 Apr 2010, Peter wrote:
>
>> So a Stockholm file using a mixture of "." and "-" would be
>> valid but a bit odd. Why would anyone do that?
>
> IIRC the "." are used for "gaps" at the extremes of sequences in a MSA. When
> you do local sequence alignments, like blast and most HMMs do, gaps at the
> extremes of sequences do not pay the usual penalty for gap opening. So in
> Stockholm format distinguishes between gaps for what you paid a price during
> the alignment ("-") and gaps-for-free (".") which are there just to pad each
> row to the MSA width.

So internal gaps (true gaps), versus leading or trailing padding. That makes
sense - and is certainly how PFAM does things according to their FAQ:

Quoting from http://pfam.sanger.ac.uk/help#tabview=tab3
>>> What is the difference between the - and . characters in your full alignments ?
>>>
>>> The '-' and '.' characters both represent gap characters. However they
>>> do tell you some extra information about how the HMM has generated
>>> the alignment. The '-' symbols are where the alignment of the sequence
>>> has used a delete state in the HMM to jump past a match state. This
>>> means that the sequence is missing a column that the HMM was
>>> expecting to be there. The '.' character is used to pad gaps where one
>>> sequence in the alignment has sequence from the HMMs insert state.
>>> See the alignment below where both characters are used. The HMM
>>> states emitting each column are shown. Note that residues emitted
>>> from the Insert (I) state are in lower case.

I wonder why doesn't this get mentioned anywhere on the format definitions:
http://sonnhammer.sbc.su.se/Stockholm.html
http://en.wikipedia.org/wiki/Stockholm_format

Peter

From cjfields at illinois.edu  Fri Apr  9 09:28:42 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 9 Apr 2010 08:28:42 -0500
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <l2u320fb6e01004090621m94ab4301g1a262340eafc2fe1@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>
	<l2u320fb6e01004090621m94ab4301g1a262340eafc2fe1@mail.gmail.com>
Message-ID: <9D6E3C31-B273-4B37-BFE8-8C951C025CBB@illinois.edu>


On Apr 9, 2010, at 8:21 AM, Peter wrote:

> On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields <cjfields at illinois.edu> wrote:
>> 
>> 
>> Just curious, b/c this is a point of contention in BioPerl.  How does BioPython
>> internally set what symbols correspond to residues/gaps/frameshifts/other?
>> BioPerl retains the original sequence but uses regexes for validation and
>> methods that return symbol-related information (e.g. gap counts).
>> 
>> (BTW, the contention here isn't that we use regexes, but that we set them globally).
>> 
>> chris
> 
> Hi Chris,
> 
> The short answer is gaps are by default "-", and stop codons are "*", but
> beyond that it would be down to user code to interpret odd symbols.
> 
> Our sequences have an alphabet object which can specify the letters (as
> a set of expected characters), with explicit support for a single gap
> character (usually "-"), and for proteins a single stop codon symbol (usually
> "*"). This could in theory be extended to define other symbols too. The gap
> char does get treated specially in some of the alignment code (e.g. for
> calling a consensus), but I don't think we have anything built in regarding
> frameshifts.
> 
> Peter

Within LocatableSeq we define the following:

$GAP_SYMBOLS = '\-\.=~';
$FRAMESHIFT_SYMBOLS = '\\\/';
$OTHER_SYMBOLS = '\?';
$RESIDUE_SYMBOLS = '0-9A-Za-z\*';

Combined these can be used in a regex to validate sequence, or separately used for other purposes (counting gaps, frameshifts, etc.).  The OTHER_SYMBOLS is rally a catch-all for anything residue-like (counted in the sequence).  All of these can be redefined, but currently that's global, so it can have consequences in rare cases when mixing sequences from different formats.  We may localize them to work around that (part of GSoC project for alignment reimplementation).

We had a Symbol class at one point but I believe it was considered too 'heavy,' though this may be more a consequence of Perl's hammered-on OO.  

chris


From reece at berkeley.edu  Fri Apr  9 11:18:36 2010
From: reece at berkeley.edu (Reece Hart)
Date: Fri, 09 Apr 2010 08:18:36 -0700
Subject: [Biopython] GenBank.Scanner use of os.linesep
In-Reply-To: <u2q320fb6e01004090154v83ff8badq743a4369cdf78bee@mail.gmail.com>
References: <4BBEB12C.8030907@berkeley.edu>
	<u2q320fb6e01004090154v83ff8badq743a4369cdf78bee@mail.gmail.com>
Message-ID: <4BBF454C.4020502@berkeley.edu>

Peter-

> A plain rstrip() would also work and get rid of any trailing whitespace.
> I've checked that in.
> For something this trivial, your verbal patch is fine. Would you like
> to be added to the NEWS and CONTRIB file?
>    

Thanks for making this change so quickly. Please don't bother with the 
NEWS and CONTRIB file changes.

Cheers,
Reece


From davidpkilgore at gmail.com  Fri Apr  9 11:44:12 2010
From: davidpkilgore at gmail.com (Kizzo Kilgore)
Date: Fri, 9 Apr 2010 08:44:12 -0700
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
Message-ID: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>

Hello

I just wanted to introduce myself to the Biopython project/community,
and my intentions for participating as a student in this year's
Google's Summer of Code.  I have posted a rough draft of my proposal
to the GSOC applications site for mentors to see.  It is not complete
but I am currently working on it, so as to make final improvements
before the deadline.  I haven't had time (due to school/work) to fix
any of the bugs in the bug tracking system that has been pointed to
before, but please no that I am no stranger to source code, and that I
will make a great addition to the Biopython community after the
summer.  Please leave me feedback either by shooting me an email or
leaving a message in the GSOC applications site.  Also, be sure to
check out my website shown in the proposal for additional
qualifications.  Thank you.

-- 
Kizzo

From lunt at ctbp.ucsd.edu  Fri Apr  9 11:55:31 2010
From: lunt at ctbp.ucsd.edu (Bryan Lunt)
Date: Fri, 9 Apr 2010 08:55:31 -0700
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
Message-ID: <i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>

Hello Peter,
The HMMER suit of tools, and the Pfam website use "-" to indicate that
an HMM visited a deletion state, and "." to indicate that the HMM on a
different sequence visited an insertion state, and this gap is just
added to maintain alignment.


>foo
AA...BBB---CCC
>bar
AAbazBBBDDDCCC

In this example, the sequence "foo" doesn't have the DDD section of
the profile HMM,
the second sequence has not only the full model, but also contains an
insert, "baz" that is not part of the HMM, for example, an extra-long
loop.

I hope this helps...
-Bryan

On Fri, Apr 9, 2010 at 5:08 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>>> Greetings All!
>>>
>>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>>
>>> seqs[id] += seq.replace(".","-")
>>>
>>> So when you load into memory alignments that mark gaps created to
>>> allow alignment to inserts with ".", (such as PFam alignments or the
>>> output of hmmer) that information is lost.
>>>
>>> I know there must be a good reason for this, but I am finding it a
>>> problem on my end..
>>>
>>> -Bryan Lunt
>>
>> Hi Bryan,
>>
>> Yes, is it done deliberately. The dot is a problem - it has a quite
>> specific meaning of "same as above" on other alignment file
>> formats, while "-" is an almost universal shorthand for gap/insertion.
>> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
>> conversion.
>>
>> Have you got a sample output file we can use as a unit test or
>> at least discuss? As I recall, on the PFAM alignments I looked
>> at there was no data loss by doing the dot to dash mapping.
>
> According to http://sonnhammer.sbc.su.se/Stockholm.html
>>> Sequence letters may include any characters except
>>> whitespace. Gaps may be indicated by "." or "-".
>
> So a Stockholm file using a mixture of "." and "-" would be
> valid but a bit odd. Why would anyone do that?
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Fri Apr  9 12:09:16 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 17:09:16 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>
Message-ID: <l2r320fb6e01004090909r2a13fd17qf81b397a9df7bef@mail.gmail.com>

Hi Bryan,

On Fri, Apr 9, 2010 at 4:55 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>
> Hello Peter,
> The HMMER suit of tools, and the Pfam website use "-" to indicate that
> an HMM visited a deletion state, and "." to indicate that the HMM on a
> different sequence visited an insertion state, and this gap is just
> added to maintain alignment.
>
>>foo
> AA...BBB---CCC
>>bar
> AAbazBBBDDDCCC
>
> In this example, the sequence "foo" doesn't have the DDD section of
> the profile HMM,
> the second sequence has not only the full model, but also contains an
> insert, "baz" that is not part of the HMM, for example, an extra-long
> loop.
>
> I hope this helps...
> -Bryan

Yes, it does. I think this HMMER/PFAM convention should be noted
on the definition of the Stockholm format - that might have prevented
this problem in Biopython since none of the examples I'd looked at
when writing the parser had this behaviour. Note your example is
more subtle than the different between internal gaps and leading or
trailing padding described by Ivan earlier:
http://lists.open-bio.org/pipermail/biopython/2010-April/006396.html

Could you point out a suitable (small) example from PFAM we can
use for a unit test, or email me an example (off list)?

Now, as to how to deal with this: We could extend the Biopython
Alphabet objects to explicitly support multiple types of gaps (the
current setup only really copes with a single gap character). Using
this information we could handle some special cases like Stockholm
to PHYLIP would require merging either gap onto a dash. This
doesn't sound that straight forward though.

Or, we can avoid explicit declarations about the sequence (just
ignore the Biopython Alphabet object capabilities and use one
of the generic alphabets), and leave the problem in the hands of
the end user. This is bound to cause some unpleasant surprises
one day, but might be the best solution.

Peter

From chapmanb at 50mail.com  Fri Apr  9 16:21:32 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 9 Apr 2010 16:21:32 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <m2h77e831101004081321xc3fef5efh86e4db4e61607406@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<m2h77e831101004081321xc3fef5efh86e4db4e61607406@mail.gmail.com>
Message-ID: <20100409202132.GA20004@sobchak.mgh.harvard.edu>

Vincent;
Thanks for the work on the Affy Cel/CDF parsers. I don't know
anything at all about the formats so can't help much with the
technical questions, but wanted to help with a few more general 
points you raise.

> > I ended up writing my own modules for reading both affy Cel and CDF files.

This and the following discussion are a bit hard to follow. When I
read through this thread I wasn't sure exactly what improvements
you've made, how they affect back compatibility of the code, and how
they help make the parser better going forward.

A lot of this work is very specialized, so you are trying to catch the
attention of the few people who know enough to help. If you can organize
your code and e-mail in a way that makes it easy for them to comment
and contribute, you'll increase the number of valuable responses you
receive.

It's an under appreciated skill, but very valuable for grabbing busy
people's attention and getting feedback.

> > Are improvements to the CelFile.py are of value to biopython?

Absolutely.

> Is there any need for improvements to the ability to read CEL files or CDF
> files and if so what are they? I am interested in  contributing.

Yes. Make it faster, more complete, easier to use.

There are general answers you can apply across the board. We definitely
are looking for contributions and happy to have you interested.

Brad

From chapmanb at 50mail.com  Fri Apr  9 16:39:12 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 9 Apr 2010 16:39:12 -0400
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
Message-ID: <20100409203912.GB20004@sobchak.mgh.harvard.edu>

Kizzo;

> I just wanted to introduce myself to the Biopython project/community,
> and my intentions for participating as a student in this year's
> Google's Summer of Code.  I have posted a rough draft of my proposal
> to the GSOC applications site for mentors to see.  

Glad you are interested in this and thanks for getting together a
proposal. I wish you would have dropped us a line a bit earlier as
we would have been happy to help with getting the application
together.

> It is not complete
> but I am currently working on it, so as to make final improvements
> before the deadline.  I haven't had time (due to school/work) to fix
> any of the bugs in the bug tracking system that has been pointed to
> before, but please no that I am no stranger to source code, and that I
> will make a great addition to the Biopython community after the
> summer.  

Great. I noticed that you worked on GSoC with OpenCog last year. Is
this the most recent code base from that work?

https://code.launchpad.net/~kizzobot/opencog/python-bindings

Have you still been involved with that community after the work? Did
they decide not to do GSoC this year?

Thanks again,
Brad

From davidpkilgore at gmail.com  Fri Apr  9 16:52:57 2010
From: davidpkilgore at gmail.com (Kizzo Kilgore)
Date: Fri, 9 Apr 2010 13:52:57 -0700
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <20100409203912.GB20004@sobchak.mgh.harvard.edu>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
	<20100409203912.GB20004@sobchak.mgh.harvard.edu>
Message-ID: <j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>

On Fri, Apr 9, 2010 at 1:39 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Kizzo;
>
>> I just wanted to introduce myself to the Biopython project/community,
>> and my intentions for participating as a student in this year's
>> Google's Summer of Code. ?I have posted a rough draft of my proposal
>> to the GSOC applications site for mentors to see.
>
> Glad you are interested in this and thanks for getting together a
> proposal. I wish you would have dropped us a line a bit earlier as
> we would have been happy to help with getting the application
> together.
>
>> It is not complete
>> but I am currently working on it, so as to make final improvements
>> before the deadline. ?I haven't had time (due to school/work) to fix
>> any of the bugs in the bug tracking system that has been pointed to
>> before, but please no that I am no stranger to source code, and that I
>> will make a great addition to the Biopython community after the
>> summer.
>
> Great. I noticed that you worked on GSoC with OpenCog last year. Is
> this the most recent code base from that work?
>
> https://code.launchpad.net/~kizzobot/opencog/python-bindings
>

The core developers merged my bindings in with the main branch a long
time ago, and yes that's the most recent codebase from that work.

> Have you still been involved with that community after the work? Did
> they decide not to do GSoC this year?
>

Oh yes, I'm still a regular on their IRC channel and mailing lists.
OpenCog is closer to my passion, and I already had 2 proposals for
OpenCog this summer ready, but unfortunately the project didn't get
accepted for GSoC this year.  I plan to work more with OpenCog as a
potential PhD project, so am still am involved with OpenCog.

> Thanks again,
> Brad
>


-- 
Kizzo


From vincent at vincentdavis.net  Sat Apr 10 01:43:06 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Fri, 9 Apr 2010 23:43:06 -0600
Subject: [Biopython] Bio.Application now subprocess?
Message-ID: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>

I was considering writing a module for using the command line Affymetrix
Power Tools Software
LINK<http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>
Mostly
to convert between CEL file types but there are lots of other features
<http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>If
I read correctly will be replaced using subprocess. Are there any modules
currently using subprcess rather than Bio.Application?
Anything I should know but don't (as if you know what I know) or consider

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

From biopython at maubp.freeserve.co.uk  Sat Apr 10 06:28:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 11:28:19 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
Message-ID: <j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>

On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis <vincent at vincentdavis.net> wrote:
> I was considering writing a module for using the command line Affymetrix
> Power Tools Software
> LINK<http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>
> Mostly
> to convert between CEL file types but there are lots of other features
> <http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>If
> I read correctly will be replaced using subprocess. Are there any modules
> currently using subprcess rather than Bio.Application?
> Anything I should know but don't (as if you know what I know) or consider

Hi Vincent,

The idea is to use a Bio.Application based wrapper to build a command
line string, and invoke that with the subprocess module (i.e. use BOTH).
The tutorial has several examples of this (e.g. alignment tools and BLAST).

What have you been reading that makes you think Bio.Application is
being replaced with subprocess? We should probably clarify it.

Peter

From vincent at vincentdavis.net  Sat Apr 10 09:12:34 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Sat, 10 Apr 2010 07:12:34 -0600
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
Message-ID: <m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>

Let me say it was late at night when I started reading thorough this and I
am very new to it so....
The first function defines in Bio/Applications.py
def generic_run(commandline):
"""Run an application with the given commandline (DEPRECATED)......We now
recommend you invoke subprocess directly, using
str(commandline)............."""

The second
class ApplicationResult:
""""""Make results of a program available through a standard interface
(DEPRECATED)................."""

I think these should be moved tp the bottom if possible maybe below a
comment section that indicates the item below are or are going to be
deprecated.

The last line in
class AbstractCommandline(object):
""".......................
You would typically run the command line via a standard Python operating
    system call (e.g. using the subprocess module)."""

I started to read though this example but thought I would read more about
subprocess module, At this point it is not clear to me what bio/Applications
is doing for me. subprocess seems simple. But I  have a lot to learn and I
assume that if I start by getting basic functionality with subprocess then
it will make more sence

One of the parts that is not clear to me is for example in Emboss
class WaterCommandline(_EmbossCommandLine):
..........
self.parameters = \
         [_Option(["-asequence","asequence"], ["input", "file"], None,
1, "First sequence to align")

Not really sure where the parts to the _option line are documented, I assume
in the ...for p in parameters:......
Just not clear, I guess I need to study it more.


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Sat, Apr 10, 2010 at 4:28 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
> > I was considering writing a module for using the command line Affymetrix
> > Power Tools Software
> > LINK<
> http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx
> >
> > Mostly
> > to convert between CEL file types but there are lots of other features
> > <
> http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx
> >If
> > I read correctly will be replaced using subprocess. Are there any modules
> > currently using subprcess rather than Bio.Application?
> > Anything I should know but don't (as if you know what I know) or consider
>
> Hi Vincent,
>
> The idea is to use a Bio.Application based wrapper to build a command
> line string, and invoke that with the subprocess module (i.e. use BOTH).
> The tutorial has several examples of this (e.g. alignment tools and BLAST).
>
> What have you been reading that makes you think Bio.Application is
> being replaced with subprocess? We should probably clarify it.
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Sat Apr 10 09:58:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 14:58:28 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
	<m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
Message-ID: <l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>

On Sat, Apr 10, 2010 at 2:12 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> Let me say it was late at night when I started reading thorough this and I
> am very new to it so....
> The first function defines in Bio/Applications.py
> def generic_run(commandline):

OK, so you are looking at the API docs and/or the code.
Bits of Bio/Applications.py are deprecated, and I think
you are right - we can try and make the status clearer.

Peter

From rodrigo_faccioli at uol.com.br  Sat Apr 10 13:23:19 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Sat, 10 Apr 2010 14:23:19 -0300
Subject: [Biopython] Bio.Application now subprocess?
Message-ID: <q2t3715adb71004101023g8dd9102ha64748311c15d9f1@mail.gmail.com>

I've developed a class for this proposed. It might help you. Please, see the
link below.

http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py

Thanks,

--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218

From vincent at vincentdavis.net  Sat Apr 10 13:30:05 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Sat, 10 Apr 2010 11:30:05 -0600
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <q2t3715adb71004101023g8dd9102ha64748311c15d9f1@mail.gmail.com>
References: <q2t3715adb71004101023g8dd9102ha64748311c15d9f1@mail.gmail.com>
Message-ID: <z2z77e831101004101030p2680e3f7md91c83528f2b3f57@mail.gmail.com>

>
> On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli <
> rodrigo_faccioli at uol.com.br> wrote:
>
>> I've developed a class for this proposed. It might help you. Please, see
>> the
>> link below.
>
>
> http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py
>>
>>
>
> Thanks, This might be a good place for me to start. Nit sure how this is
different than Bio/Applications.py other than it is much simpler from a
quick look.


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli <
rodrigo_faccioli at uol.com.br> wrote:

> I've developed a class for this proposed. It might help you. Please, see
> the
> link below.
>
>
> http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py
>
> Thanks,
>
> --
> Rodrigo Antonio Faccioli
> Ph.D Student in Electrical Engineering
> University of Sao Paulo - USP
> Engineering School of Sao Carlos - EESC
> Department of Electrical Engineering - SEL
> Intelligent System in Structure Bioinformatics
> http://laips.sel.eesc.usp.br
> Phone: 55 (16) 3373-9366 Ext 229
> Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From biopython at maubp.freeserve.co.uk  Sat Apr 10 15:02:08 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 20:02:08 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
	<m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
	<l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>
Message-ID: <m2t320fb6e01004101202yf8946bb3k609e0814b96b3cb8@mail.gmail.com>

On Sat, Apr 10, 2010 at 2:58 PM, Peter wrote:
>
> OK, so you are looking at the API docs and/or the code.
> Bits of Bio/Applications.py are deprecated, and I think
> you are right - we can try and make the status clearer.
>

Hi Vincent,

I updated that a bit, hopefully it is clearer that a typical
user doesn't need to look at Bio.Applications at all.
Rather you might use the alignment tool wrappers in
Bio.Align.Applications, or the EMBOSS wrappers in
Bio.Emboss.Applications (etc) which internally use
the classes defined in Bio.Applications.

The *only* reason you'd use Bio.Applications directly
now is to write a new command line tool wrapper.

[Historically you might have used the old generic_run
function in Bio.Applications, but that is deprecated now]

Peter

From biopython at maubp.freeserve.co.uk  Sat Apr 10 16:33:57 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 21:33:57 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <1101855478758905131@unknownmsgid>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
	<m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
	<l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>
	<m2t320fb6e01004101202yf8946bb3k609e0814b96b3cb8@mail.gmail.com>
	<1101855478758905131@unknownmsgid>
Message-ID: <t2i320fb6e01004101333zfb1b64c2i7bacb5b407fadaa@mail.gmail.com>

On Sat, Apr 10, 2010 at 8:27 PM, Vincent Davis wrote:
>
> So that was/is my plan to use it to writes command lone tools for the
> affymetrix apt dev commandline app. unless this is redundant in a way
> I am not aware of.
> Thanks

Ah - right, now this makes sense. Are you on the dev mailing
list (CC'd)? That would be a better place to ask. I'd start by
looking at Bio.Align.Applications (less subclasses there) as a
model.

Peter

From chapmanb at 50mail.com  Mon Apr 12 08:37:31 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 12 Apr 2010 08:37:31 -0400
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
	<20100409203912.GB20004@sobchak.mgh.harvard.edu>
	<j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>
Message-ID: <20100412123731.GJ20004@sobchak.mgh.harvard.edu>

Kizzo;

> > Have you still been involved with that community after the work? Did
> > they decide not to do GSoC this year?
> 
> Oh yes, I'm still a regular on their IRC channel and mailing lists.
> OpenCog is closer to my passion, and I already had 2 proposals for
> OpenCog this summer ready, but unfortunately the project didn't get
> accepted for GSoC this year.  I plan to work more with OpenCog as a
> potential PhD project, so am still am involved with OpenCog.

That's great to hear. One of the most important parts of GSoC for
myself and many mentors is the chance to get additional folks
involved in open source.

Reviews of the applications have started, and the main aspect which
would improve your proposal is to develop a specific project plan 
with detailed descriptions of week to week goals. For each week you 
should have:

- Description of the specific weekly goal. 
- Details on the PyCogent and Biopython code you expect to be working with
- Possible issues or areas of expansion you expect might impact the
  timeline
- Expected work on documentation and testing. You want to have this
  integrated throughout the proposal.

See the examples in the NESCent application documentation to get an
idea of the level of detail in accepted projects from previous years:

https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply
http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html

The content we'd like to see in the proposal is interconversion of
core object (Sequence, Alignment, Phylogeny) in the first half of
the summer, and applications of this interconversion to developing
biological workflows in the second half of the summer. Feel free to
be creative and pick work that is of interest to your studies.

Since you can't edit the proposal currently, please prepare this in
a publicly accessible Google Doc and provide a link from the public
comments so other mentors can view it.

Thanks,
Brad

From biopython at maubp.freeserve.co.uk  Mon Apr 12 09:35:44 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 12 Apr 2010 14:35:44 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <y2ob34be8bd1004091150j154f6444s3cf1a2731f190fe3@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>
	<l2r320fb6e01004090909r2a13fd17qf81b397a9df7bef@mail.gmail.com>
	<y2ob34be8bd1004091150j154f6444s3cf1a2731f190fe3@mail.gmail.com>
Message-ID: <w2w320fb6e01004120635jad7fd9c3r9088db0f1da8138e@mail.gmail.com>

On Fri, Apr 9, 2010 at 7:50 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
> Hello Peter,
>
> Thanks for your help recently on this!
> I have here two files that I like to use as examples, because they are
> fairly small, (203 sequences)
>
> The Pfam page summarizing this family is :
> http://pfam.sanger.ac.uk/family/PF07750
>
> Cheers!
> -Bryan Lunt

I see what you mean - using that webpage to get the full alignment
(in any of the supported file formats) using the mixed gap option
(dot or dash) does show both symbols in a meaningful way.

Peter

From tiagoantao at gmail.com  Mon Apr 12 19:39:29 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 13 Apr 2010 00:39:29 +0100
Subject: [Biopython] ASN.1 and Entrez SNP
Message-ID: <m2o6d941f121004121639h33efb822ha9079afcad374615@mail.gmail.com>

Hi,

Just a simple question:
Entrez SNP seems to return ASN.1 format only.
Is there any way to parse this in biopython? I've looked at SeqIO and
found nothing...
I can think of tools to process this outside, but I am just curious if
this is processed natively with Biopython (being an exposed NCBI
format...)

Many thanks,
Tiago
PS - You can easily try this with:
hdl = Entrez.efetch(db="snp", id="3739022")
print hdl.read()

-- 
"If you want to get laid, go to college.  If you want an education, go
to the library." - Frank Zappa

From biopython at maubp.freeserve.co.uk  Tue Apr 13 04:22:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Apr 2010 09:22:42 +0100
Subject: [Biopython] ASN.1 and Entrez SNP
In-Reply-To: <m2o6d941f121004121639h33efb822ha9079afcad374615@mail.gmail.com>
References: <m2o6d941f121004121639h33efb822ha9079afcad374615@mail.gmail.com>
Message-ID: <v2u320fb6e01004130122xd06aecccmec11066651ad7607@mail.gmail.com>

2010/4/13 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> Just a simple question:
> Entrez SNP seems to return ASN.1 format only.
> Is there any way to parse this in biopython? I've looked at SeqIO and
> found nothing...
> I can think of tools to process this outside, but I am just curious if
> this is processed natively with Biopython (being an exposed NCBI
> format...)
>
> Many thanks,
> Tiago
> PS - You can easily try this with:
> hdl = Entrez.efetch(db="snp", id="3739022")
> print hdl.read()

Hi Tiago,

No, we don't support ASN.1, and I don't see any good reason to - I
think it would only be NCBI ASN.1 we'd we interested in, and I think
that all their resources are available in other easier to use formats
like XML these days.

See also http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Instead ask Entrez to give you the SNP data as XML:

Entrez.efetch(db="snp", id="3739022", retmode="xml")

Hopefully the SNP XML file has everything in it.

You have a choice of Python XML parsers to use. However, the
Bio.Entrez parser doesn't like this XML. This appears to be related
(or caused by) a known NCBI bug. See
http://bugzilla.open-bio.org/show_bug.cgi?id=2771

Peter


From bala.biophysics at gmail.com  Tue Apr 13 10:49:03 2010
From: bala.biophysics at gmail.com (Bala subramanian)
Date: Tue, 13 Apr 2010 16:49:03 +0200
Subject: [Biopython] removing redundant sequence
Message-ID: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>

Friends,
Sorry if this question was asked before. Is there any function in Biopython
that can remove redundant sequence records from a fasta file.

Thanks,
Bala

From biopython at maubp.freeserve.co.uk  Tue Apr 13 11:02:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Apr 2010 16:02:52 +0100
Subject: [Biopython] removing redundant sequence
In-Reply-To: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
Message-ID: <t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>

On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Friends,
> Sorry if this question was asked before. Is there any function in Biopython
> that can remove redundant sequence records from a fasta file.
>
> Thanks,
> Bala

No, but you should be able to do this with Biopython - depending on
what exactly you are asking for.

When you say "redundant" do you mean 100% perfect identify?

How big is your FASTA file - are you working with next-gen sequencing
data and millions of reads?. If it is small enough you can keep all
the data in memory to compare sequences to each other. Otherwise
you might try using a checksum (e.g. SEGUID) to spot duplicates.

Peter

From schafer at rostlab.org  Tue Apr 13 11:08:31 2010
From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=)
Date: Tue, 13 Apr 2010 17:08:31 +0200
Subject: [Biopython] removing redundant sequence
In-Reply-To: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
Message-ID: <4BC488EF.3000505@rostlab.org>

Hey,

I think not. But you can use an external tool like cd-hit or uniqueprot 
and implement a wrapper function for that in your code.

Chris

On 04/13/2010 04:49 PM, Bala subramanian wrote:
> Friends,
> Sorry if this question was asked before. Is there any function in Biopython
> that can remove redundant sequence records from a fasta file.
>
> Thanks,
> Bala
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From p.j.a.cock at googlemail.com  Thu Apr 15 11:03:02 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 15 Apr 2010 16:03:02 +0100
Subject: [Biopython] Draft abstract for BOSC 2010 Biopython Project Update
Message-ID: <i2g320fb6e01004150803s58e14fa6nef112d3b522e5dfe@mail.gmail.com>

Hi all,

I should have circulated this earlier, but here is a draft abstract
for a "Biopython Project Update" talk at BOSC 2010, to be submitted
*today*.
http://www.open-bio.org/wiki/BOSC_2010

I'm hoping to attend BOSC again this year and give the talk, but
haven't sorted out the finances - Brad has offered to present if I
can't go, hence the talk author list. If anyone else wants to help
with slides etc (or as a standby speaker) please let me know.

This is based on the abstract from last year, included in this PDF:
http://www.open-bio.org/w/images/c/c7/BOSC2009_program_20090601.pdf

In the PDF version of the abstract I've made the logo smaller this time ;)

Comments welcome,

Thanks,

Peter

--

Biopython Project Update
Peter Cock, Brad Chapman

In this talk we present the current status of the Biopython project
(www.biopython.org), described in a application note published last
year (Cock et al., 2009). Biopython celebrated its 10th Birthday last
year, and has now been cited or referred to in over 150 scientific
publications (a list is included on our website).

At the end of 2009, following an extended evaluation period, Biopython
successfully migrated from using CVS for source code control to using
git, hosted on github.com. This has helped our existing developers to
work and test new features on publicly viewable branches before being
merged, and has also encouraged new contributors to work on additions
or improvements. Currently about fifty people have their own Biopython
repository on GitHub.

In summer 2009 we had two Google Summer of Code (GSoC) project
students working on phylogenetic code for Biopython in conjunction
with the National Evolutionary Synthesis Center (NESCent). Eric
Talevich?s work on phylogenetic trees including phyloXML support (Han
and Zamesk, 2009) was merged and included with Biopython 1.54, and he
continues to be actively involved with Biopython. We hope to include
Nick Matzke?s module for biogeographical data from the Global
Biodiversity Information Facility (GBIF) later this year. For summer
2010 we have Biopython related GSoC projects submitted via both
NESCent and the Open Bioinformatics Foundation (OBF), and hope to have
students working on Biopython once again.

Since BOSC 2009, Biopython has seen four releases. Biopython 1.51
(August 2009) was an important milestone in dropping support for
Python 2.3 and our legacy parsing infra-structure (Martel/Mindy), but
was most noteworthy for FASTQ support (Cock et al., 2010). Biopython
1.52 (September 2009) introduced indexing of most sequence file
formats for random access, and made interconverting sequence and
alignment files easier. Biopython 1.53 (December 2009) included
wrappers for the new NCBI BLAST+ command line tools, and much improved
support for running under Jython. Our latest release is Biopython 1.54
(April/May 2010), new features include Bio.Phylo for phylogenetic
trees (GSoC project), and support for Standard Flowgram Format (SFF)
files used for 454 Life Sciences (Roche) sequencing.

Biopython is free open source software available from
www.biopython.org under the Biopython License Agreement (an MIT style
license, http://www.biopython.org/DIST/LICENSE).

References

Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke,
A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon,
M.J. (2009) Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
doi:10.1093/bioinformatics/btp163

Han, M.V. and Zmasek, C.M. (2009) phyloXML: XML for evolutionary
biology and comparative genomics. BMC Bioinformatics 10:356.
doi:10.1186/1471-2105-10-356

Cock, P.J.A., Fields, C.J., Goto N., Heuer, M.L., and Rice, P.M.
(2010) The Sanger FASTQ file format for sequences with quality scores,
and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38(6)
1767-71. doi:10.1093/nar/gkp1137


From mok at bioxray.dk  Thu Apr 15 11:15:01 2010
From: mok at bioxray.dk (Morten Kjeldgaard)
Date: Thu, 15 Apr 2010 17:15:01 +0200
Subject: [Biopython] Entrez.efetch bug?
Message-ID: <4BC72D75.1040505@bioxray.dk>

Hi,

I am getting an error with Entrez.efetch() with Biopython version 1.51. This
is my handle:

handle = Entrez.efetch(db='protein', id='114391',rettype='gp')

When I subsequently do this:

 record = Entrez.read(handle)

I get a syntax error from Expat:

ExpatError: syntax error: line 1, column 0

However, if I do the following, it works:

record = handle.read()

but then I need to parse the resulting record using the Genbank parser,
which is a nuisance since I normally should get this for free from the
Entrez module.

Comments, anyone?


-- Morten

From biopython at maubp.freeserve.co.uk  Thu Apr 15 11:31:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Apr 2010 16:31:28 +0100
Subject: [Biopython] Entrez.efetch bug?
In-Reply-To: <4BC72D75.1040505@bioxray.dk>
References: <4BC72D75.1040505@bioxray.dk>
Message-ID: <x2p320fb6e01004150831g77f069e8kf46a84175c35c641@mail.gmail.com>

On Thu, Apr 15, 2010 at 4:15 PM, Morten Kjeldgaard <mok at bioxray.dk> wrote:
> Hi,
>
> I am getting an error with Entrez.efetch() with Biopython version 1.51. This
> is my handle:
>
> handle = Entrez.efetch(db='protein', id='114391',rettype='gp')
>

In the above, you've asked Entrez to give you a plain text GenPept file
(a protein GenBank file).

> When I subsequently do this:
>
> ?record = Entrez.read(handle)
>
> I get a syntax error from Expat:
>
> ExpatError: syntax error: line 1, column 0
>

The Bio.Entrez.read() and Bio.Entrez.parse() functions expect XML.

> However, if I do the following, it works:
>
> record = handle.read()

Well, yes, you get a big string stored as the variable record.

> but then I need to parse the resulting record using the Genbank parser,
> which is a nuisance since I normally should get this for free from the
> Entrez module.
>
> Comments, anyone?

Try this:

from Bio import Entrez
from Bio import SeqIO
handle = Entrez.efetch(db='protein', id='114391',rettype='gp')
record = SeqIO.read(handle, 'genbank')

Peter


From mok at bioxray.dk  Thu Apr 15 17:28:24 2010
From: mok at bioxray.dk (Morten Kjeldgaard)
Date: Thu, 15 Apr 2010 23:28:24 +0200
Subject: [Biopython] Entrez.efetch bug?
In-Reply-To: <x2p320fb6e01004150831g77f069e8kf46a84175c35c641@mail.gmail.com>
References: <4BC72D75.1040505@bioxray.dk>
	<x2p320fb6e01004150831g77f069e8kf46a84175c35c641@mail.gmail.com>
Message-ID: <26E933F7-D7D2-48EC-82B4-4B654403F177@bioxray.dk>


On 15/04/2010, at 17.31, Peter wrote:

> record = SeqIO.read(handle, 'genbank')

d'Oh!! :-) Thanks, just the hint I needed.

Cheers,
Morten


From davidpkilgore at gmail.com  Mon Apr 19 02:54:55 2010
From: davidpkilgore at gmail.com (Kizzo Kilgore)
Date: Sun, 18 Apr 2010 23:54:55 -0700
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <20100412123731.GJ20004@sobchak.mgh.harvard.edu>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
	<20100409203912.GB20004@sobchak.mgh.harvard.edu>
	<j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>
	<20100412123731.GJ20004@sobchak.mgh.harvard.edu>
Message-ID: <x2u6e4c2e9e1004182354i60852680m315e6ba9b6813119@mail.gmail.com>

I have taken the time to carefully look over the links and examples
you suggested, and came up with my own draft week by week plan for the
summer.  It is not perfect, or even complete, as I am in the closing
weeks of school and things are getting really busy, but I managed to
pull this together.  You can visit the following public Google Docs
link to get the Gnumeric spreadsheet of my timeline.  If you would
like me to, I will also convert it to some other format if you like
(and if I can), or I can attach a copy of the file itself (or post it
on my website) if for some reason the link does not work.  Thank you.

https://docs.google.com/leaf?id=0B4KRpw_6YxAjMzU3NDgxMWYtZGIxZi00YmY3LTk5MGQtNDlmMjYyYTRhN2M0&hl=en

On Mon, Apr 12, 2010 at 5:37 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Kizzo;
>
>> > Have you still been involved with that community after the work? Did
>> > they decide not to do GSoC this year?
>>
>> Oh yes, I'm still a regular on their IRC channel and mailing lists.
>> OpenCog is closer to my passion, and I already had 2 proposals for
>> OpenCog this summer ready, but unfortunately the project didn't get
>> accepted for GSoC this year. ?I plan to work more with OpenCog as a
>> potential PhD project, so am still am involved with OpenCog.
>
> That's great to hear. One of the most important parts of GSoC for
> myself and many mentors is the chance to get additional folks
> involved in open source.
>
> Reviews of the applications have started, and the main aspect which
> would improve your proposal is to develop a specific project plan
> with detailed descriptions of week to week goals. For each week you
> should have:
>
> - Description of the specific weekly goal.
> - Details on the PyCogent and Biopython code you expect to be working with
> - Possible issues or areas of expansion you expect might impact the
> ?timeline
> - Expected work on documentation and testing. You want to have this
> ?integrated throughout the proposal.
>
> See the examples in the NESCent application documentation to get an
> idea of the level of detail in accepted projects from previous years:
>
> https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply
> http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html
>
> The content we'd like to see in the proposal is interconversion of
> core object (Sequence, Alignment, Phylogeny) in the first half of
> the summer, and applications of this interconversion to developing
> biological workflows in the second half of the summer. Feel free to
> be creative and pick work that is of interest to your studies.
>
> Since you can't edit the proposal currently, please prepare this in
> a publicly accessible Google Doc and provide a link from the public
> comments so other mentors can view it.
>
> Thanks,
> Brad
>


-- 
Kizzo


From mjldehoon at yahoo.com  Mon Apr 19 03:08:04 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 19 Apr 2010 00:08:04 -0700 (PDT)
Subject: [Biopython] Fw: Entrez.efetch
In-Reply-To: <910794.43889.qm@web56207.mail.re3.yahoo.com>
Message-ID: <870000.56671.qm@web62402.mail.re1.yahoo.com>

> I sent the mail to the biopython at biopython.org
> but it was not delivered.

It will be delivered if you subscribe to the mailing list.

--- On Mon, 4/19/10, olumide olufuwa <ludax5 at yahoo.com> wrote:

> From: olumide olufuwa <ludax5 at yahoo.com>
> Subject: Fw: [Biopython]Entrez.efetch
> To: biopython-owner at lists.open-bio.org
> Cc: "Biopython mailing list" <biopython at biopython.org>
> Date: Monday, April 19, 2010, 2:50 AM
> 
> 
> Hello Michel,
> I sent the mail to the biopython at biopython.org
> but it was not delivered. I have edited the message. 
> 
> 
> The code that 
> accepts UNIPROT ID, retrieves the record using
> Entrez.efetch and then it
>  parsed to obtain the Pubmed ID which i use to search
> Medline for the 
> Title, Abstract and other information about the entry. 
> The code:
> 
> query_id=str(raw_input("please
> 
> 
>  enter your UNIPROT_ID: ")) #Request UNIPROT ID from user
> Entrez.email="ludax5 at yahoo.com"
> prothandle=Entrez.efetch(db="protein",
> 
> 
>  id=query_id, rettype="gb" #queries Protein DB with the
> given ID
> #The
>  program returns an error here if a wrong ID is given.
> Details of the 
> error is given below
> seq_record=SeqIO.read(prothandle, "gb")
> for
> 
>  record in seq_record.annotations['references']: # To
> obtain Pubmed id 
> from the seqrecord
> ?? key_word=record.pubmed_id
> ?? if key_word:
> ????
>  handle=Entrez.efetch(db="pubmed",
> 
>  id=key_word, rettype="medline")
> ???? 
> medRecords=Medline.parse(handle)
> ???? for rec in medRecords: #prints 
> title and Abstract
> ???????? if rec.has_key('AB') and 
> rec.has_key('TI'):
> ?????????? print "TITLE: ",rec['TI']
> ??????????
>  print "ABSTRACT: ",rec['AB']
> ?????????? print ' '
> 
> 
> THE 
> PROBLEM: The program gives an error if a wrong ID is
> entered or an ID 
> other than UNIPROT ID e.g PDB ID, GSS ID etc. 
> 
> 
> 
> An Example Run:
> 
> 
> please enter your UNIPROT_ID:
>  1wio #A PDB ID is given instead
> 
> 
> Traceback (most recent call last):
> ? File "file.py", line 11, in 
> <module>
> ??? seq_record=SeqIO.read(prothandle, "gb")
> ? File 
> "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py",
> line 522, in 
> read
> ??? raise ValueError("No records found in handle")
> ValueError:
> 
>  No records found in handle
> 
> I want to avoid this error, thus i 
> want the program to print "INCORRECT ID GIVEN"? when a
> wrong or an 
> incorrect ID is given.
> 
> 
> Thanks a lot.
> lummy
> 
> 
> 
> 


From olumideolufuwa at yahoo.com  Mon Apr 19 03:30:24 2010
From: olumideolufuwa at yahoo.com (Olumide Olufuwa)
Date: Mon, 19 Apr 2010 00:30:24 -0700 (PDT)
Subject: [Biopython] Entrez.efetch
In-Reply-To: <mailman.0.1271661752.7278.biopython@lists.open-bio.org>
Message-ID: <221701.32474.qm@web45106.mail.sp1.yahoo.com>


Hello there,
?
I wrote a program, I am not awesome in biopython but this is what it does: The program code that 
accepts user defined UNIPROT ID, retrieves the record using Entrez.efetch and then it
 is parsed to obtain the Pubmed ID which i use to search Medline for Title, Abstract and other information about the entry. 
The code is simply:

query_id=str(raw_input("please


 enter your UNIPROT_ID: ")) #Request UNIPROT ID from user
Entrez.email="ludax5 at yahoo.com"
prothandle=Entrez.efetch(db="protein",


 id=query_id, rettype="gb" #queries Protein DB with the given ID
#The
 program returns an error here if a wrong ID is given. Details of the 
error is given below
seq_record=SeqIO.read(prothandle, "gb")
for

 record in seq_record.annotations['references']: # To obtain Pubmed id 
from the seqrecord
?? key_word=record.pubmed_id
?? if key_word:
????

 handle=Entrez.efetch(db="pubmed",

 id=key_word, rettype="medline")
???? 
medRecords=Medline.parse(handle)
???? for rec in medRecords: #prints 
title and Abstract
???????? if rec.has_key('AB') and 
rec.has_key('TI'):
?????????? print "TITLE: ",rec['TI']
??????????
 print "ABSTRACT: ",rec['AB']
?????????? print ' '


THE 
PROBLEM: The program gives an error if a wrong ID is entered or an ID 
other than UNIPROT ID e.g PDB ID, GSS ID etc. 


An Example Run with a wrong ID is shown below:


please enter your UNIPROT_ID:
 1wio #A PDB ID is given instead


Traceback (most recent call last):
? File "file.py", line 11, in 
<module>
??? seq_record=SeqIO.read(prothandle, "gb")
? File 
"/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", line 522, in 
read
??? raise ValueError("No records found in handle")
ValueError:


 No records found in handle

I want to avoid this error, thus i 
want the program to print "INCORRECT ID GIVEN"? when a wrong or an 
incorrect ID is given.


Thanks a lot.
lummy


From mjldehoon at yahoo.com  Mon Apr 19 03:45:59 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 19 Apr 2010 00:45:59 -0700 (PDT)
Subject: [Biopython] Entrez.efetch
In-Reply-To: <221701.32474.qm@web45106.mail.sp1.yahoo.com>
Message-ID: <902706.80063.qm@web62402.mail.re1.yahoo.com>

Put a try:/except: block around the call to SeqIO.read, as in:

try:
    seq_record=SeqIO.read(prothandle, "gb")
except ValueError:
    print "INCORRECT ID GIVEN"


--Michiel

--- On Mon, 4/19/10, Olumide Olufuwa <olumideolufuwa at yahoo.com> wrote:

> From: Olumide Olufuwa <olumideolufuwa at yahoo.com>
> Subject: [Biopython] Entrez.efetch
> To: biopython at lists.open-bio.org
> Date: Monday, April 19, 2010, 3:30 AM
> 
> Hello there,
> ?
> I wrote a program, I am not awesome in biopython but this
> is what it does: The program code that 
> accepts user defined UNIPROT ID, retrieves the record using
> Entrez.efetch and then it
>  is parsed to obtain the Pubmed ID which i use to search
> Medline for Title, Abstract and other information about the
> entry. 
> The code is simply:
> 
> query_id=str(raw_input("please
> 
> 
> 
>  enter your UNIPROT_ID: ")) #Request UNIPROT ID from user
> Entrez.email="ludax5 at yahoo.com"
> prothandle=Entrez.efetch(db="protein",
> 
> 
> 
>  id=query_id, rettype="gb" #queries Protein DB with the
> given ID
> #The
>  program returns an error here if a wrong ID is given.
> Details of the 
> error is given below
> seq_record=SeqIO.read(prothandle, "gb")
> for
> 
>  record in seq_record.annotations['references']: # To
> obtain Pubmed id 
> from the seqrecord
> ?? key_word=record.pubmed_id
> ?? if key_word:
> ????
> 
>  handle=Entrez.efetch(db="pubmed",
> 
>  id=key_word, rettype="medline")
> ???? 
> medRecords=Medline.parse(handle)
> ???? for rec in medRecords: #prints 
> title and Abstract
> ???????? if rec.has_key('AB') and 
> rec.has_key('TI'):
> ?????????? print "TITLE: ",rec['TI']
> ??????????
>  print "ABSTRACT: ",rec['AB']
> ?????????? print ' '
> 
> 
> THE 
> PROBLEM: The program gives an error if a wrong ID is
> entered or an ID 
> other than UNIPROT ID e.g PDB ID, GSS ID etc. 
> 
> 
> 
> An Example Run with a wrong ID is shown below:
> 
> 
> please enter your UNIPROT_ID:
>  1wio #A PDB ID is given instead
> 
> 
> Traceback (most recent call last):
> ? File "file.py", line 11, in 
> <module>
> ??? seq_record=SeqIO.read(prothandle, "gb")
> ? File 
> "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py",
> line 522, in 
> read
> ??? raise ValueError("No records found in handle")
> ValueError:
> 
> 
>  No records found in handle
> 
> I want to avoid this error, thus i 
> want the program to print "INCORRECT ID GIVEN"? when a
> wrong or an 
> incorrect ID is given.
> 
> 
> Thanks a lot.
> lummy
> 
> 
> 
> 
> ? ? ? 
> _______________________________________________
> Biopython mailing list? -? Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


From fkauff at biologie.uni-kl.de  Tue Apr 20 10:27:30 2010
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Tue, 20 Apr 2010 16:27:30 +0200
Subject: [Biopython] Code for protein alpha helix prediction
Message-ID: <4BCDB9D2.4050207@biologie.uni-kl.de>

Hi all,

I've recently been asked to help with screening protein sequences for 
certain features, something I don't really know much about... Yet!

My questions: Is there some code in Biopython that allows for a quick 
check whether an amino acid sequece is likely to be a alpha helix? 
Couldn't find any. Or is there an algorithm that could be 
straightforwardly implemented in python, or a commandline tool that 
could be called from within a python script?

Thanks in advance,
Frank

From rodrigo_faccioli at uol.com.br  Tue Apr 20 11:34:47 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Tue, 20 Apr 2010 12:34:47 -0300
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
Message-ID: <j2q3715adb71004200834l263718c1u3a3590a4750198a7@mail.gmail.com>

Hi Frank,

I'm not sure if I understood your question. I'm computer scientist and I'm
researching globular protein structure prediction. In fact, I've studied the
application of Evolutionary Algorithms for it. Therefore, our goals are
different.

if I understood your question, you have a Fasta file of your protein.  So,
you need to communicate with databases such as NCBI, scop and CATH. In this
way, I recommend you use Entrez BioPython module. Other suggestion is the
use of BioPython Blast module.

Sorry if my answer is not what you is looking for.

Thanks,

--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218


On Tue, Apr 20, 2010 at 11:27 AM, Frank Kauff <fkauff at biologie.uni-kl.de>wrote:

> Hi all,
>
> I've recently been asked to help with screening protein sequences for
> certain features, something I don't really know much about... Yet!
>
> My questions: Is there some code in Biopython that allows for a quick check
> whether an amino acid sequece is likely to be a alpha helix? Couldn't find
> any. Or is there an algorithm that could be straightforwardly implemented in
> python, or a commandline tool that could be called from within a python
> script?
>
> Thanks in advance,
> Frank
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From biopython at maubp.freeserve.co.uk  Tue Apr 20 11:43:02 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Apr 2010 16:43:02 +0100
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
Message-ID: <g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>

On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff <fkauff at biologie.uni-kl.de> wrote:
> Hi all,
>
> I've recently been asked to help with screening protein sequences for
> certain features, something I don't really know much about... Yet!
>
> My questions: Is there some code in Biopython that allows for a quick check
> whether an amino acid sequece is likely to be a alpha helix? Couldn't find
> any. Or is there an algorithm that could be straightforwardly implemented in
> python, or a commandline tool that could be called from within a python
> script?

Hi Frank,

There are lots of tools for predicting secondary structure (alpha helices,
beta sheets etc) both de novo, and guided by reference sequences with
known structures. Some of these are online web services.

I'm pretty sure there is nothing for this built into Biopython, so for scripting
this for a large number of sequences then (as you have also suggested),
my first approach would be to look for command line tools which you could
call from Python. I've never needed to do this myself, and have no specific
recommendations regarding which tools to try first.

If you do find some useful algorithms which could easily be implemented
in Python, they could be worth including - maybe under Bio.SeqUtils?

Peter

From darnells at dnastar.com  Tue Apr 20 14:16:22 2010
From: darnells at dnastar.com (Steve Darnell)
Date: Tue, 20 Apr 2010 13:16:22 -0500
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
	<g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610021BE668@FS1.dnastar.com>

Frank,

One of the most accurate (and popular) algorithms is PSIPRED.  A
stand-alone command line version is available:
http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/

If memory serves, it requires a local installation of blast and the nr
database.  A position weight matrix generated from PSI-BLAST acts as
input to a neural network, which makes the secondary structure
predictions.

The Rosetta Design group had a poll last year of people's favorite
tools.  There are plenty of others to try if PSIPRED doesn't meet your
needs.

http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi
ction-algorithm/

I am not a PSIPRED developer, just a satisfied user.

Regards,
Steve

-----Original Message-----
From: biopython-bounces at lists.open-bio.org
[mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
Sent: Tuesday, April 20, 2010 10:43 AM
To: Frank Kauff
Cc: BioPython Mailing List
Subject: Re: [Biopython] Code for protein alpha helix prediction

On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff <fkauff at biologie.uni-kl.de>
wrote:
> Hi all,
>
> I've recently been asked to help with screening protein sequences for
> certain features, something I don't really know much about... Yet!
>
> My questions: Is there some code in Biopython that allows for a quick
check
> whether an amino acid sequece is likely to be a alpha helix? Couldn't
find
> any. Or is there an algorithm that could be straightforwardly
implemented in
> python, or a commandline tool that could be called from within a
python
> script?

Hi Frank,

There are lots of tools for predicting secondary structure (alpha
helices,
beta sheets etc) both de novo, and guided by reference sequences with
known structures. Some of these are online web services.

I'm pretty sure there is nothing for this built into Biopython, so for
scripting
this for a large number of sequences then (as you have also suggested),
my first approach would be to look for command line tools which you
could
call from Python. I've never needed to do this myself, and have no
specific
recommendations regarding which tools to try first.

If you do find some useful algorithms which could easily be implemented
in Python, they could be worth including - maybe under Bio.SeqUtils?

Peter
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From fkauff at biologie.uni-kl.de  Wed Apr 21 07:50:30 2010
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Wed, 21 Apr 2010 13:50:30 +0200
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <A4009967D1886D4286A9B7931FD58610021BE668@FS1.dnastar.com>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
	<g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610021BE668@FS1.dnastar.com>
Message-ID: <4BCEE686.3080803@biologie.uni-kl.de>

Thanks everybody!

Now I have plenty of tools to look at - the standalone version of 
psipred certainly fulfills the easy-to-use and quick-to-try-out 
requirements.

Frank

On 04/20/2010 08:16 PM, Steve Darnell wrote:
> Frank,
>
> One of the most accurate (and popular) algorithms is PSIPRED.  A
> stand-alone command line version is available:
> http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/
>
> If memory serves, it requires a local installation of blast and the nr
> database.  A position weight matrix generated from PSI-BLAST acts as
> input to a neural network, which makes the secondary structure
> predictions.
>
> The Rosetta Design group had a poll last year of people's favorite
> tools.  There are plenty of others to try if PSIPRED doesn't meet your
> needs.
>
> http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi
> ction-algorithm/
>
> I am not a PSIPRED developer, just a satisfied user.
>
> Regards,
> Steve
>
> -----Original Message-----
> From: biopython-bounces at lists.open-bio.org
> [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
> Sent: Tuesday, April 20, 2010 10:43 AM
> To: Frank Kauff
> Cc: BioPython Mailing List
> Subject: Re: [Biopython] Code for protein alpha helix prediction
>
> On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff<fkauff at biologie.uni-kl.de>
> wrote:
>    
>> Hi all,
>>
>> I've recently been asked to help with screening protein sequences for
>> certain features, something I don't really know much about... Yet!
>>
>> My questions: Is there some code in Biopython that allows for a quick
>>      
> check
>    
>> whether an amino acid sequece is likely to be a alpha helix? Couldn't
>>      
> find
>    
>> any. Or is there an algorithm that could be straightforwardly
>>      
> implemented in
>    
>> python, or a commandline tool that could be called from within a
>>      
> python
>    
>> script?
>>      
> Hi Frank,
>
> There are lots of tools for predicting secondary structure (alpha
> helices,
> beta sheets etc) both de novo, and guided by reference sequences with
> known structures. Some of these are online web services.
>
> I'm pretty sure there is nothing for this built into Biopython, so for
> scripting
> this for a large number of sequences then (as you have also suggested),
> my first approach would be to look for command line tools which you
> could
> call from Python. I've never needed to do this myself, and have no
> specific
> recommendations regarding which tools to try first.
>
> If you do find some useful algorithms which could easily be implemented
> in Python, they could be worth including - maybe under Bio.SeqUtils?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>    

From fkauff at biologie.uni-kl.de  Wed Apr 21 07:59:31 2010
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Wed, 21 Apr 2010 13:59:31 +0200
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
	<g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
Message-ID: <4BCEE8A3.3010008@biologie.uni-kl.de>

Hi Peter,

for the start, it seems psipred is the easiest one to use and to 
implement. I'll start with that, and once the parser for the output goes 
beyond the quick-and-dirty level, we can think about including it.

Frank

On 04/20/2010 05:43 PM, Peter wrote:
> On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff<fkauff at biologie.uni-kl.de>  wrote:
>    
>> Hi all,
>>
>> I've recently been asked to help with screening protein sequences for
>> certain features, something I don't really know much about... Yet!
>>
>> My questions: Is there some code in Biopython that allows for a quick check
>> whether an amino acid sequece is likely to be a alpha helix? Couldn't find
>> any. Or is there an algorithm that could be straightforwardly implemented in
>> python, or a commandline tool that could be called from within a python
>> script?
>>      
> Hi Frank,
>
> There are lots of tools for predicting secondary structure (alpha helices,
> beta sheets etc) both de novo, and guided by reference sequences with
> known structures. Some of these are online web services.
>
> I'm pretty sure there is nothing for this built into Biopython, so for scripting
> this for a large number of sequences then (as you have also suggested),
> my first approach would be to look for command line tools which you could
> call from Python. I've never needed to do this myself, and have no specific
> recommendations regarding which tools to try first.
>
> If you do find some useful algorithms which could easily be implemented
> in Python, they could be worth including - maybe under Bio.SeqUtils?
>
> Peter
>    


From bala.biophysics at gmail.com  Wed Apr 21 10:25:35 2010
From: bala.biophysics at gmail.com (Bala subramanian)
Date: Wed, 21 Apr 2010 16:25:35 +0200
Subject: [Biopython] removing redundant sequence
In-Reply-To: <t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
	<t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
Message-ID: <r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>

Peter,
Sorry for the delayed reply. Yes i want to remove those sequences that are
100% identical but they have different identifier. I created a sample fasta
file with two redundant sequences. But when i use checksums seguid to spot
the redundancies, it spots only the first one.

In [36]: for record in SeqIO.parse(open('t'),'fasta'):
   ....:     print record.id, seguid(record.seq)
   ....:
   ....:
A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
*AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
*AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
*
In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
rec:seguid(rec.seq))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/home/cbala/test/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
to_dict(sequences, key_function)
    585         key = key_function(record)
    586         if key in d :
--> 587             raise ValueError("Duplicate key '%s'" % key)
    588         d[key] = record
    589     return d

ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'


On Tue, Apr 13, 2010 at 5:02 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian
> <bala.biophysics at gmail.com> wrote:
> > Friends,
> > Sorry if this question was asked before. Is there any function in
> Biopython
> > that can remove redundant sequence records from a fasta file.
> >
> > Thanks,
> > Bala
>
> No, but you should be able to do this with Biopython - depending on
> what exactly you are asking for.
>
> When you say "redundant" do you mean 100% perfect identify?
>
> How big is your FASTA file - are you working with next-gen sequencing
> data and millions of reads?. If it is small enough you can keep all
> the data in memory to compare sequences to each other. Otherwise
> you might try using a checksum (e.g. SEGUID) to spot duplicates.
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Wed Apr 21 11:10:45 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Apr 2010 16:10:45 +0100
Subject: [Biopython] removing redundant sequence
In-Reply-To: <r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
	<t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
	<r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>
Message-ID: <u2n320fb6e01004210810ve1c9a2f8qa09d0970c23e2062@mail.gmail.com>

On Wed, Apr 21, 2010 at 3:25 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Peter,
> Sorry for the delayed reply. Yes i want to remove those sequences that are
> 100% identical but they have different identifier. I created a sample fasta
> file with two redundant sequences. But when i use checksums seguid to spot
> the redundancies, it spots only the first one.
>
> In [36]: for record in SeqIO.parse(open('t'),'fasta'):
> ? ....: ? ? print record.id, seguid(record.seq)
> ? ....:
> ? ....:
> A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
> *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
> AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
> AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
> AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
> AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
> *
> In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
> rec:seguid(rec.seq))
> ---------------------------------------------------------------------------
> ValueError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Traceback (most recent call last)
>
> /home/cbala/test/<ipython console> in <module>()
>
> /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
> to_dict(sequences, key_function)
> ? ?585 ? ? ? ? key = key_function(record)
> ? ?586 ? ? ? ? if key in d :
> --> 587 ? ? ? ? ? ? raise ValueError("Duplicate key '%s'" % key)
> ? ?588 ? ? ? ? d[key] = record
> ? ?589 ? ? return d
>
> ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'

Hi Bala,

You know there are duplicate sequences in your file, so if you try
to use the SEGUID as a key, there will be duplicate keys. Thus
you get this error message. If you want to use Bio.SeqIO.to_dict
you have to have unique keys.

What you should do is loop over the records and keep a record
of the checksums you have saved, and use that to ignore duplicates.
I would use a python set rather than a python list for speed.

You could do this with a for loop. However, I would probably use an
iterator based approach with a generator function - I think it is more
elegant but perhaps not so easy for a beginner:

from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid

def remove_dup_seqs(records):
    """"SeqRecord iterator to removing duplicate sequences."""
    checksums = set()
    for record in records:
        checksum = seguid(record.seq)
        if checksum in checksums:
            print "Ignoring %s" % record.id
            continue
        checksums.add(checksum)
        yield record

records = remove_dup_seqs(SeqIO.parse("with_dups.fasta", "fasta"))
count = SeqIO.write(records, "no_dups.fasta", "fasta")
print "Saved %i records" % count

Note I've used filename with Bio.SeqIO which requires Biopython 1.54b
or later - for older versions use handles. See also:

http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/

Peter


From silvio.tschapke at googlemail.com  Wed Apr 21 14:34:54 2010
From: silvio.tschapke at googlemail.com (Silvio Tschapke)
Date: Wed, 21 Apr 2010 20:34:54 +0200
Subject: [Biopython] Entrez.efetch rettype retmode
Message-ID: <l2sd3ddc94e1004211134pe3590354kf65fdd8f76aa5cad@mail.gmail.com>

Hello.

I am new to Biopython and I tried to download a whole record with efetch.
The problem is that I get an error message in the output:
""Report 'full' not found in 'pmc' presentation""
Maybe I haven't understood the whole principle.

But isn't it the goal of pmc to provide full text? I have read the help-page
of efetch but it doesn't help me a lot.


----
handle = Entrez.efetch(db="pmc", id="2531137", rettype="full",
retmode="text")
string = str(handle.read())


f = open('./output.txt', 'w')
f.write(string)
----

Thanks for your help!

From robert.campbell at queensu.ca  Wed Apr 21 16:14:10 2010
From: robert.campbell at queensu.ca (Robert Campbell)
Date: Wed, 21 Apr 2010 16:14:10 -0400
Subject: [Biopython] Entrez.efetch rettype retmode
In-Reply-To: <l2sd3ddc94e1004211134pe3590354kf65fdd8f76aa5cad@mail.gmail.com>
References: <l2sd3ddc94e1004211134pe3590354kf65fdd8f76aa5cad@mail.gmail.com>
Message-ID: <20100421161410.4fd950ec@adelie.biochem.queensu.ca>

Hello Silvio,

On Wed, 21 Apr 2010 20:34:54 +0200 Silvio Tschapke
<silvio.tschapke at googlemail.com> wrote:

> Hello.
> 
> I am new to Biopython and I tried to download a whole record with efetch.
> The problem is that I get an error message in the output:
> ""Report 'full' not found in 'pmc' presentation""
> Maybe I haven't understood the whole principle.
> 
> But isn't it the goal of pmc to provide full text? I have read the help-page
> of efetch but it doesn't help me a lot.
> 
> 
> ----
> handle = Entrez.efetch(db="pmc", id="2531137", rettype="full",
> retmode="text")
> string = str(handle.read())

The documentation on efetch
(http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html)
specifies that:

  pmc - PubMed Central contains a number of articles classified as "open
  access" for which you may download the full text as XML. For the remaining
  articles in PMC you may download only the abstracts as XML.
 
So you just need to change your retmode='text' to retmode='xml' and omit the
rettype option altogether.  You will find that not all articles are free to
download this way though.  I tried a random one and got an error message that
the particular journal didn't allow download of full text as XML.

Cheers,
Rob
-- 
Robert L. Campbell, Ph.D.
Senior Research Associate/Adjunct Assistant Professor 
Botterell Hall Rm 644
Department of Biochemistry, Queen's University, 
Kingston, ON K7L 3N6  Canada
Tel: 613-533-6821            Fax: 613-533-2497
<robert.campbell at queensu.ca>    http://pldserver1.biochem.queensu.ca/~rlc

From laserson at mit.edu  Wed Apr 21 21:07:19 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 21 Apr 2010 21:07:19 -0400
Subject: [Biopython] Bug in GenBank/EMBL parser?
Message-ID: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>

Hi,

I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
supposedly conforms to the EMBL standard).

The short story is that whenever there is a feature, the parser checks
whether there are qualifiers in the feature with an assert statement, and
does not allow features with no qualifiers.  However, the IMGT flatfile is
full of entries that have features with no qualifiers (only coordinates).

Who is wrong here?  Does the EMBL specification require that a feature have
qualifiers?  Or is this a bug to be fixed in the parser.

To be more concrete, the parser broke on the following record:

ID   A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412
BP.
XX
AC   A03907;
XX
DT   11-MAR-1998 (Rel. 8, arrived in LIGM-DB )
DT   10-JUN-2008 (Rel. 200824-2, Last updated, Version 3)
XX
DE   H.sapiens antibody D1.3 variable region protein  ;
DE   unassigned DNA; rearranged configuration; Ig-Heavy; regular; group
IGHV.
XX
KW   antigen receptor; Immunoglobulin superfamily (IgSF);
KW   Immunoglobulin (IG); IG-Heavy; variable; diversity; joining;
KW   rearranged.
XX
OS   Homo sapiens (human)
OC   cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa;
Eumetazoa;
OC   Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata;
OC   Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda;
OC   Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates;
OC   Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae;
OC   Homo/Pan/Gorilla group; Homo.
XX
RN   [1]
RP   1-412
RA   ;
RT   "Recombinant antibodies and methods for their production.";
RL   Patent number EP0239400-A/10, 30-SEP-1987.
RL   MEDICAL RESEARCH COUNCIL.
XX
DR   EMBL; A03907.
XX
FH   Key             Location/Qualifiers (from EMBL)
FH
FT   source          1..412
FT                   /organism="Homo sapiens"
FT                   /mol_type="unassigned DNA"
FT                   /db_xref="taxon:9606"
FT   V_region        8..>412
FT                   /note="antibody D1.3 V region"
FT   sig_peptide     8..64
FT   CDS             8..>412
FT                   /product="antibody D1.3 V region (VDJ)"
FT                   /protein_id="CAA00308.1"
FT
/translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG
FT
FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL
FT                   HTDDTARYYCARERDYRLDYWGQGTTLTVSS"
FT   D_segment       356..371
FT   J_segment       372..>412
FT                   /note="J(H)2 region"
XX
SQ   Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other;
     tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct
 60
     ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc
120
     catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca
180
     gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta
240
     taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt
300
     cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag
360
     agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca
412
//

And the traceback was:

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (311, 0))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

/Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/<ipython
console> in <module>()

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_records(self, handle, do_features)
    418         #This is a generator function
    419         while True :
--> 420             record = self.parse(handle, do_features)
    421             if record is None : break
    422             assert record.id is not None

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse(self, handle, do_features)
    401                     feature_cleaner = FeatureValueCleaner())
    402
--> 403         if self.feed(handle, consumer, do_features) :
    404             return consumer.data
    405         else :

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in feed(self, handle, consumer, do_features)
    373         #Features (common to both EMBL and GenBank):
    374         if do_features :
--> 375             self._feed_feature_table(consumer,
self.parse_features(skip=False))
    376         else :
    377             self.parse_features(skip=True) # ignore the data

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_features(self, skip)
    170
feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip())
    171                     line = self.handle.readline()
--> 172                 features.append(self.parse_feature(feature_key,
feature_lines))
    173         self.line = line
    174         return features

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_feature(self, feature_key, lines)
    267                 else :
    268                     #Unquoted continuation
--> 269                     assert len(qualifiers) > 0
    270                     assert key==qualifiers[-1][0]
    271                     #if debug : print "Unquoted Cont %s:%s" % (key,
line)

AssertionError:

Which is tracked to an assert statement in Scanner.py at line 269.  It
appears that the assumption in the code is that there is an unquoted
continuation of a feature qualifier.

Finally, I am using biopython 1.51 that I built from source using python 2.5
(from an EPD install 4.3.0).  I am on a Mac running OS X 10.5.8 (Leopard)

Thanks!
Uri

From biopython at maubp.freeserve.co.uk  Thu Apr 22 04:56:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Apr 2010 09:56:52 +0100
Subject: [Biopython] Bug in GenBank/EMBL parser?
In-Reply-To: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>
References: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>
Message-ID: <h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com>

On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
> supposedly conforms to the EMBL standard).
>
> The short story is that whenever there is a feature, the parser checks
> whether there are qualifiers in the feature with an assert statement, and
> does not allow features with no qualifiers. ?However, the IMGT flatfile is
> full of entries that have features with no qualifiers (only coordinates).
>
> Who is wrong here? ?Does the EMBL specification require that a feature have
> qualifiers? ?Or is this a bug to be fixed in the parser.

Hi Uri,

Thank you for your detailed report,

Since you have raised this, I went back over the EMBL documentation.
All their example features qualifiers (and from personal experience all
EMBL files from the EMBL and GenBank files from the NCBI) do have
qualifiers. However, in Section 7.2 they are called "Optional qualifiers".
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2

So it does look like an unwarranted assumption in the Biopython
parser (even though it has been a safe assumption on "official" EMBL
and GenBank files thus far), which we should fix.

Could you file a bug please?
http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython

This also affect Biopython 1.54b (the latest release) and the current
code in the repository. I would hope we can solve this before
Biopython 1.54 proper is released.

Regards,

Peter


From chapmanb at 50mail.com  Thu Apr 22 08:18:10 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 22 Apr 2010 08:18:10 -0400
Subject: [Biopython] removing redundant sequence
In-Reply-To: <u2n320fb6e01004210810ve1c9a2f8qa09d0970c23e2062@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
	<t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
	<r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>
	<u2n320fb6e01004210810ve1c9a2f8qa09d0970c23e2062@mail.gmail.com>
Message-ID: <20100422121810.GV29724@sobchak.mgh.harvard.edu>

Bala;

> > I created a sample fasta
> > file with two redundant sequences. But when i use checksums seguid to spot
> > the redundancies, it spots only the first one.

> What you should do is loop over the records and keep a record
> of the checksums you have saved, and use that to ignore duplicates.
> I would use a python set rather than a python list for speed.
> 
> You could do this with a for loop. However, I would probably use an
> iterator based approach with a generator function - I think it is more
> elegant but perhaps not so easy for a beginner:
[... Nice code example from Peter ..]

This is a nice problem example and discussion. Bala, it sounds like
Peter provided some useful example code to solve this. Once you use
this to get together a program that solves your problem, it would be
very helpful if you could write it up as a Cookbook entry:

http://biopython.org/wiki/Category:Cookbook

That would help others in the future who will be tackling similar
issues. Thanks much,
Brad

From cloudycrimson at gmail.com  Fri Apr 23 03:56:45 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Fri, 23 Apr 2010 13:26:45 +0530
Subject: [Biopython] Qblast : no hits
Message-ID: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>

Hello freinds,


I have a  problem with qblast. I have sequences from the mass

spectromerty equipment that needs to be BLASTed to find the protein it
belongs

to. When I blast these sequences in the NCBI website it takes some time

(longer than usual ) but does gives me hits. When i blast them using the

following code in biopython they dont give me any hits.


CODE:

****************************************************************************

>>> from Bio.Blast import NCBIWWW

>>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR")

>>> blast_results = result_handle.read()

>>> save_file = open( "testseq.xml", "w")

>>> save_file.write(blast_results)

>>> save_file.close()

****************************************************************************


OUTPUT:

****************************************************************************

<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI

BlastOutput/EN" "NCBI_BlastOutput.dtd">

<BlastOutput>

  <BlastOutput_program>blastp</BlastOutput_program>

  <BlastOutput_version>BLASTP 2.2.23+</BlastOutput_version>

  <BlastOutput_reference>Alejandro A. Sch&auml;ffer, L. Aravind, Thomas L.

Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and


Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein

database searches with composition-based statistics and other

refinements", Nucleic Acids Res. 29:2994-3005.</BlastOutput_reference>

  <BlastOutput_db>nr</BlastOutput_db>

  <BlastOutput_query-ID>12361</BlastOutput_query-ID>

  <BlastOutput_query-def>unnamed protein product</BlastOutput_query-def>

  <BlastOutput_query-len>19</BlastOutput_query-len>

  <BlastOutput_param>

    <Parameters>

      <Parameters_matrix>BLOSUM62</Parameters_matrix>

      <Parameters_expect>10</Parameters_expect>

      <Parameters_gap-open>11</Parameters_gap-open>

      <Parameters_gap-extend>1</Parameters_gap-extend>

      <Parameters_filter>F</Parameters_filter>

    </Parameters>

  </BlastOutput_param>

<BlastOutput_iterations>

<Iteration>

  <Iteration_iter-num>1</Iteration_iter-num>

  <Iteration_query-ID>12361</Iteration_query-ID>

  <Iteration_query-def>unnamed protein product</Iteration_query-def>

  <Iteration_query-len>19</Iteration_query-len>

<Iteration_hits>

</Iteration_hits>

  <Iteration_stat>

    <Statistics>

      <Statistics_db-num>10888645</Statistics_db-num>

      <Statistics_db-len>-585703444</Statistics_db-len>

      <Statistics_hsp-len>0</Statistics_hsp-len>

      <Statistics_eff-space>0</Statistics_eff-space>

      <Statistics_kappa>0.041</Statistics_kappa>

      <Statistics_lambda>0.267</Statistics_lambda>

      <Statistics_entropy>0.14</Statistics_entropy>

    </Statistics>

  </Iteration_stat>

</Iteration>

</BlastOutput_iterations>

</BlastOutput>

*****************************************************************************


Is this because a normal blast code doesn wait long till the results are

given? I mean the RTOE error. if yes, how to control the "time of
execution"?

Or else what is the problem with my code?

If you guys know anything on this issue, please give me your ideas.

Thanking you in advance.

Sincerely,

Karthik

From biopython at maubp.freeserve.co.uk  Fri Apr 23 05:49:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Apr 2010 10:49:55 +0100
Subject: [Biopython] Qblast : no hits
In-Reply-To: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
Message-ID: <s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>

Hello Karthik

On Fri, Apr 23, 2010 at 8:56 AM, Karthik Raja <cloudycrimson at gmail.com> wrote:
> Hello freinds,
>
> I have a ?problem with qblast. I have sequences from the mass
> spectromerty equipment that needs to be BLASTed to find the protein it
> belongs to. When I blast these sequences in the NCBI website it takes
> some time (longer than usual ) but does gives me hits. When i blast
> them using the following code in biopython they dont give me any hits.
>
> CODE:
>
> ****************************************************************************
>
>>>> from Bio.Blast import NCBIWWW
>>>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR")
>>>> blast_results = result_handle.read()
>>>> save_file = open( "testseq.xml", "w")
>>>> save_file.write(blast_results)
>>>> save_file.close()
>
> ****************************************************************************
>
> Is this because a normal blast code doesn wait long till the results are
> given? I mean the RTOE error. if yes, how to control the "time of
> execution"?

What error? It looks like your example ran fine.

> Or else what is the problem with my code?
>
> If you guys know anything on this issue, please give me your ideas.

Differences between a manual BLAST search on the NCBI website
and a script search via QBLAST are almost always down to different
parameter settings. The NCBI have often adjusted the defaults on
the website, and they no longer match the defaults on QBLAST.
You should check things like the expectation cut off, the matrix,
gap penalties etc. The simplest option would be just to copy the
current defaults from the website into your python code.

We probably need to put this into the Biopython FAQ ...

Regards,

Peter


From cjfields at illinois.edu  Fri Apr 23 08:00:07 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 23 Apr 2010 07:00:07 -0500
Subject: [Biopython] Qblast : no hits
In-Reply-To: <s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
Message-ID: <C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>

On Apr 23, 2010, at 4:49 AM, Peter wrote:

>> ...
> 
> Differences between a manual BLAST search on the NCBI website
> and a script search via QBLAST are almost always down to different
> parameter settings. The NCBI have often adjusted the defaults on
> the website, and they no longer match the defaults on QBLAST.
> You should check things like the expectation cut off, the matrix,
> gap penalties etc. The simplest option would be just to copy the
> current defaults from the website into your python code.
> 
> We probably need to put this into the Biopython FAQ ...
> 
> Regards,
> 
> Peter

Same for BioPerl.

chris

From cloudycrimson at gmail.com  Fri Apr 23 23:27:10 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Sat, 24 Apr 2010 08:57:10 +0530
Subject: [Biopython] Qblast : no hits
In-Reply-To: <C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
Message-ID: <l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>

Hello Peter,

I did try changing the paramters according to the WWW BLAST and its gives an
error saying "no RID or no RTOE found". Its the same error i was trying to
tell you in the 1st post. Its the "request time of execution". Is there any
way to change this RTOE i.e. to increase it? Any idea?

On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields <cjfields at illinois.edu> wrote:

> On Apr 23, 2010, at 4:49 AM, Peter wrote:
>
> >> ...
> >
> > Differences between a manual BLAST search on the NCBI website
> > and a script search via QBLAST are almost always down to different
> > parameter settings. The NCBI have often adjusted the defaults on
> > the website, and they no longer match the defaults on QBLAST.
> > You should check things like the expectation cut off, the matrix,
> > gap penalties etc. The simplest option would be just to copy the
> > current defaults from the website into your python code.
> >
> > We probably need to put this into the Biopython FAQ ...
> >
> > Regards,
> >
> > Peter
>
> Same for BioPerl.
>
> chris
>

From p.j.a.cock at googlemail.com  Sat Apr 24 07:40:27 2010
From: p.j.a.cock at googlemail.com (Peter)
Date: Sat, 24 Apr 2010 12:40:27 +0100
Subject: [Biopython] Qblast : no hits
In-Reply-To: <l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
Message-ID: <6540A260-554B-488A-AED7-B0559883F7F7@googlemail.com>


On 24 Apr 2010, at 04:27, Karthik Raja <cloudycrimson at gmail.com> wrote:

> Hello Peter,
>
> I did try changing the paramters according to the WWW BLAST and its  
> gives an
> error saying "no RID or no RTOE found". Its the same error i was  
> trying to
> tell you in the 1st post. Its the "request time of execution". Is  
> there any
> way to change this RTOE i.e. to increase it? Any idea?
>
> On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields  
> <cjfields at illinois.edu> wrote:
>
>> On Apr 23, 2010, at 4:49 AM, Peter wrote:
>>
>>>> ...
>>>
>>> Differences between a manual BLAST search on the NCBI website
>>> and a script search via QBLAST are almost always down to different
>>> parameter settings. The NCBI have often adjusted the defaults on
>>> the website, and they no longer match the defaults on QBLAST.
>>> You should check things like the expectation cut off, the matrix,
>>> gap penalties etc. The simplest option would be just to copy the
>>> current defaults from the website into your python code.
>>>
>>> We probably need to put this into the Biopython FAQ ...
>>>
>>> Regards,
>>>
>>> Peter
>>
>> Same for BioPerl.
>>
>> chris
>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From biopython at maubp.freeserve.co.uk  Sat Apr 24 07:49:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 24 Apr 2010 12:49:55 +0100
Subject: [Biopython] Qblast : no hits
In-Reply-To: <l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
Message-ID: <m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>

Hi all,

Sorry for the blank email just now.

On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote:
> Hello Peter,
>
> I did try changing the paramters according to the WWW BLAST
> and its gives an error saying "no RID or no RTOE found". Its the
> same error i was trying to tell you in the 1st post. Its the "request
> time of execution". Is there any way to change this RTOE i.e. to
> increase it? Any idea?

Please show us an example with this problem (i.e. the python
code and the traceback).

What is meant to happen is we send the query to the NCBI, and
they reply with reference details (RID and RTOE) which are
used to fetch the results after BLAST has finished running.

My guess for what is happening is your parameters are for
some reason invalid, and the NCBI is giving an error page
(so no RID and no RTOE). Biopython tries to spot any error
message in this situation, but in your case could not.

Peter

From cloudycrimson at gmail.com  Sat Apr 24 23:24:59 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Sun, 25 Apr 2010 08:54:59 +0530
Subject: [Biopython] Fwd:  Qblast : no hits
In-Reply-To: <s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
Message-ID: <y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>

 Hello Peter,

As said i did try changing the parameters of qblast according to the set in
the web blast.
The parameters that I changed are
1. Martrix
2. Word size
3. Expect

There is a check box option in the web page that allows us to check it if we
want the web blast to adjust according short sequences. I am not sure how to
bring that option into the qblast.

*Below given are the code and the traceback. *
 >>> from Bio.Blast import NCBIWWW
>>> result_handle = NCBIWWW.qblast ("blastp", "nr", "SSRVQDGMGLYTARRVR",
auto_format=None, composition_based_statistics=None, db_genetic_code=None,
endpoints=None, entrez_query='(none)', expect=200000, filter=None,
gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None,
layout=None, lcase_mask=None, matrix_name= 'PAM30', nucl_penalty=None,
nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None,
query_file=None, query_believe_defline=None, query_from=None, query_to=None,
searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None,
word_size=2, alignments=500, alignment_view=None, descriptions=500,
entrez_links_new_window=None, expect_low=None, expect_high=None,
format_entrez_query=None, format_object=None, format_type='XML',
ncbi_gi=None, results_file=None, show_overview=None)

*Traceback (most recent call last):
*  File "<pyshell#2>", line 1, in <module>
    result_handle = NCBIWWW.qblast *("blastp", "nr",
"SSRVQDGMGLYTARRVR",*auto_format=None,
composition_based_statistics=None, db_genetic_code=None,
endpoints=None, entrez_query='(none)', *expect=200000*, filter=None,
gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None,
layout=None, lcase_mask=None, *matrix_name= 'PAM30'*, nucl_penalty=None,
nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None,
query_file=None, query_believe_defline=None, query_from=None, query_to=None,
searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, *
word_size=2*, alignments=500, alignment_view=None, descriptions=500,
entrez_links_new_window=None, expect_low=None, expect_high=None,
format_entrez_query=None, format_object=None, format_type='XML',
ncbi_gi=None, results_file=None, show_overview=None)
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in
qblast
    rid, rtoe = _parse_qblast_ref_page(handle)
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in
_parse_qblast_ref_page
    raise ValueError("No RID and no RTOE found in the 'please wait' page."
ValueError: No RID and no RTOE found in the 'please wait' page. (there was
probably a problem with your request)

Here are a few examples of my MS sequences.


   1. *IMYTALPVIGKRHFRPSFTR *
   2. *RSSRGRGR *
   3. *AGPGPRRAKAAPYR *
   4. *ASRSYSSERRAR *
   5. *AASAAPPRAGRPDRGPLALAGR *
   6. *GSDGKSRGR *
   7. *TYGWRAEPR *
   8. *PPEPAREPRLSPRR *
   9. *GVLTALRR *
   10. *AGMRLPSRRQSFPAPVSR *

*Sincerely, *
*Karthikraja*

On Sat, Apr 24, 2010 at 5:19 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi all,
>
> Sorry for the blank email just now.
>
> On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote:
> > Hello Peter,
> >
> > I did try changing the paramters according to the WWW BLAST
> > and its gives an error saying "no RID or no RTOE found". Its the
> > same error i was trying to tell you in the 1st post. Its the "request
> > time of execution". Is there any way to change this RTOE i.e. to
> > increase it? Any idea?
>
> Please show us an example with this problem (i.e. the python
> code and the traceback).
>
> What is meant to happen is we send the query to the NCBI, and
> they reply with reference details (RID and RTOE) which are
> used to fetch the results after BLAST has finished running.
>
> My guess for what is happening is your parameters are for
> some reason invalid, and the NCBI is giving an error page
> (so no RID and no RTOE). Biopython tries to spot any error
> message in this situation, but in your case could not.
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Sun Apr 25 08:45:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 25 Apr 2010 13:45:05 +0100
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
Message-ID: <t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>

On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote:

> *Below given are the code and the traceback. *

Great - I can run that and get the same traceback.

Here is a shorter version which does the same thing - removing all the
parameters you don't actually set:

from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
entrez_query='(none)', expect=200000, hitlist_size=50,
matrix_name='PAM30', word_size=2, alignments=500, descriptions=500,
format_type='XML')

Getting shorter still:

result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
matrix_name='PAM30')

The problem is the matrix name - remove that and the error goes away.
So progress :)

Doing a little digging, this is the error message from the NCBI is:

Message ID#35 Error: Cannot validate the Blast options:  Gap existence
and extension values of 11 and 1 not supported for PAM30
supported values are:
32767, 32767
7, 2
6, 2
5, 2
10, 1
9, 1
8, 1

As I guessed earlier, Biopython needed a little update to recognise
this error message and pass it to the user. I've done that.

In your case, you need to pick gap parameters appropriate for PAM30.

Peter

From cloudycrimson at gmail.com  Mon Apr 26 04:38:59 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Mon, 26 Apr 2010 14:08:59 +0530
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
Message-ID: <s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>

Hello Peter,

I tried out what you suggested and it works perfectly. I checked the result
XML file and there was no problem at all.
But I still have one more small issue that I am sure you can help me with.
The main reason i wanted to use python was that I could put all the query
sequences in a file and blast it. So when I tried the above code to blast a
sequence that I have put in a fasta file, it gives an error. Same kinda
error. Below are the code and traceback.

>>> fasta_string = open("test.fasta").read()
>>> result_handle = NCBIWWW.qblast("blastp", "nr",
fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50,
word_size=2, alignments=500, descriptions=500,format_type='XML')

*Traceback (most recent call last):
*  File "<pyshell#28>", line 2, in <module>
    word_size=2, alignments=500, descriptions=500,format_type='XML')
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in
qblast
    rid, rtoe = _parse_qblast_ref_page(handle)
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in
_parse_qblast_ref_page
    raise ValueError("No RID and no RTOE found in the 'please wait' page."
ValueError: No RID and no RTOE found in the 'please wait' page. (there was
probably a problem with your request)

Please let me know if you could sense in the problem with the code.

Sincerely,
Karthik

On Sun, Apr 25, 2010 at 6:15 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote:
>
> > *Below given are the code and the traceback. *
>
> Great - I can run that and get the same traceback.
>
> Here is a shorter version which does the same thing - removing all the
> parameters you don't actually set:
>
> from Bio.Blast import NCBIWWW
> result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
> entrez_query='(none)', expect=200000, hitlist_size=50,
> matrix_name='PAM30', word_size=2, alignments=500, descriptions=500,
> format_type='XML')
>
> Getting shorter still:
>
> result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
> matrix_name='PAM30')
>
> The problem is the matrix name - remove that and the error goes away.
> So progress :)
>
> Doing a little digging, this is the error message from the NCBI is:
>
> Message ID#35 Error: Cannot validate the Blast options:  Gap existence
> and extension values of 11 and 1 not supported for PAM30
> supported values are:
> 32767, 32767
> 7, 2
> 6, 2
> 5, 2
> 10, 1
> 9, 1
> 8, 1
>
> As I guessed earlier, Biopython needed a little update to recognise
> this error message and pass it to the user. I've done that.
>
> In your case, you need to pick gap parameters appropriate for PAM30.
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Mon Apr 26 06:02:24 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 11:02:24 +0100
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
Message-ID: <h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>

Hi Karthik,

On Mon, Apr 26, 2010 at 9:38 AM, Karthik Raja <cloudycrimson at gmail.com> wrote:
> Hello Peter,
>
> I tried out what you suggested and it works perfectly. I checked the result
> XML file and there was no problem at all.

That's good :)

> But I still have one more small issue that I am sure you can help me with.
> The main reason i wanted to use python was that I could put all the query
> sequences in a file and blast it.

I wouldn't recommend that approach.

For a modest number of queries, I would suggest doing one online BLAST
query at a time. This will spread out the load on the NCBI, and means each
time your XML results won't be too big. Trying to do too many queries at
risks hitting an NCBI CPU limit, or having problems downloading a very
large XML result file.

For a large number of queries, I would suggest using standalone BLAST
(installed and run locally) - especially if you want to use very lenient
parameters giving lots of results (meaning large output files).

> So when I tried the above code to blast a
> sequence that I have put in a fasta file, it gives an error. Same kinda
> error. Below are the code and traceback.
>
>>>> fasta_string = open("test.fasta").read()
>>>> result_handle = NCBIWWW.qblast("blastp", "nr",
> fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50,
> word_size=2, alignments=500, descriptions=500,format_type='XML')
>
> *Traceback (most recent call last):
> ...
> ValueError: No RID and no RTOE found in the 'please wait' page. (there was
> probably a problem with your request)
>
> Please let me know if you could sense in the problem with the code.
>
> Sincerely,
> Karthik

The code works fine - I just tried it using a FASTA file with four proteins.
I would guess there is a problem with your FASTA file - perhaps there
is a bad sequence in it, or too many sequences. Since you don't have
the latest code we can't see the NCBI error message in the traceback,
which would help a lot.

I see you are running on Windows, so the easiest way to try this is
to backup C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py
and replace it with the new version from our repository:
http://biopython.open-bio.org/SRC/biopython/Bio/Blast/NCBIWWW.py
or:
http://github.com/biopython/biopython/raw/master/Bio/Blast/NCBIWWW.py

Or, could you send me the FASTA file to try it here (please send it to me
directly, not the mailing list).

Regards,

Peter

From nick_leake77 at hotmail.com  Mon Apr 26 11:36:28 2010
From: nick_leake77 at hotmail.com (Nick Leake)
Date: Mon, 26 Apr 2010 11:36:28 -0400
Subject: [Biopython] parsing a fasta with multiple entries
Message-ID: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>


Hello,
 
I'm having trouble parsing a fasta file with multiple sequences - it is a fasta that has most of the transposable elements in fruit flies found at http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down.  I want to be able to access the DNA sequences for manipulation and later removal from a chromosomal region.  I originally thought that I could follow the same fasta format example shown in the biopython tutorial.  However, that failed to work.  I think it might be because there are multiple entries.  
 
Basically, I just want parse the information and have dictionaries hold the transposon elements name and sequence for later use.  Can I do that with biopython or should I make my own parser? Any help would be greatly appreciated.  I'm still very much a python novice and get frustrated by not knowing how to ask my questions appropriately. 		 	   		  
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

From biopython at maubp.freeserve.co.uk  Mon Apr 26 11:52:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 16:52:28 +0100
Subject: [Biopython] help with parsing EMBL
In-Reply-To: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
Message-ID: <g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>

Hi Nick,

On Mon, Apr 26, 2010 Nick Leake wrote:
> Hello,
>
> I'm having trouble parsing an embl file (attached) with multiple
> sequences. ?I want to be able to access the DNA sequences for
> manipulation and removal from a chromosomal region. ?I originally
> thought that I could follow the same fasta format example shown in the
> biopython tutorial. ?However, that failed to work. ?Next, I tried to
> convert the file to a fastq or a fasta to just follow the examples -
> again, failed. ?So, I looked around and found some embl parsing code:
>
> from Bio import SeqIO
>
> p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
> p.next()
> record=p.next()
>
> print record
>
> This kinda works, but fails to read all entries.

Well, yes:

from Bio import SeqIO
#that imports the library

p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
#that sets up the EMBL parser (although EMBL files are text so it is a bit
#odd to open it in binary read mode)

p.next() #reads the first record and discards it

record=p.next() #reads the second record and stores as variable record

You only ever try and look at the second record. See below...

> ... ?In addition, I don't know what code I need to 'grab' the DNA
> information for manipulations and remove these sequences from
> a given DNA segment. ? ?Can I get a little guidance to
> what I need to do or where I can look to help solve my problem?

What you probably want to start with is a simple for loop,

from Bio import SeqIO
for record in SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41"),"embl"):
    print record.id, record.seq

However, this runs into a problem:

Traceback (most recent call last):
  ...
ValueError: Expected sequence length 2, found 2483.

Looking at your file (which was too big to send to the list), your EMBL
file is invalid. Specifically this is failing on the record which starts:

ID   FROGGER    standard; DNA; INV; 2 BP.

That ID line says the sequence is just 2 base pairs, but in fact the
seems to be 2483bp. The ID line should probably be edited like this:

ID   FROGGER    standard; DNA; INV; 2483 BP.

Fixing that shows up another similar problem,

ID   TV1    standard; DNA; INV; 1728 BP.

should probably be:

ID   TV1    standard; DNA; INV; 1730 BP.

Then there is this record:

ID   DDBARI1    standard; DNA; INV; 1676 BP.

Several parts of the record suggest it should be 1676bp (not just the ID
line, but also for example the SQ line), but there is actually 1677bp of
sequence present.

After making those three edits by hand, Biopython should parse it.
I suspect your EMBL file has been manually edited. Where did it
come from?

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 11:54:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 16:54:54 +0100
Subject: [Biopython] Fwd: help with parsing EMBL
In-Reply-To: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
Message-ID: <l2w320fb6e01004260854td3e000cbkb51c19792f024328@mail.gmail.com>

Hi all,

I'm forwarding this email from Nick Leake about parsing EMBL files,
but without his 1.3MB attachment. I'll reply to his questions in a
follow up email...

Peter

---------- Forwarded message ----------
From:?Nick Leake
To:?<biopython at lists.open-bio.org>
Date:?Mon, 26 Apr 2010 09:35:45 -0400
Subject:?help with parsing

Hello,


I'm having trouble parsing an embl file (attached) with multiple
sequences. ?I want to be able to access the DNA sequences for
manipulation and removal from a chromosomal region. ?I originally
thought that I could follow the same fasta format example shown in the
biopython tutorial. ?However, that failed to work. ?Next, I tried to
convert the file to a fastq or a fasta to just follow the examples -
again, failed. ?So, I looked around and found some embl parsing code:


from Bio import SeqIO

p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
p.next()
record=p.next()

print record


This kinda works, but fails to read all entries. ?Also, there is no
'record' argument for output. ?In addition, I don't know what code I
need to 'grab' the DNA information for manipulations and remove these
sequences from a given DNA segment. ? ?Can I get a little guidance to
what I need to do or where I can look to help solve my problem?

Any help would be greatly appreciated. ?I'm still very much a python
novice and get frustrated by not knowing how to ask my questions
appropriately.

_________________________________________________________________
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

---------- Forwarded message ----------
From:?biopython-request at lists.open-bio.org
To:
Date:?Mon, 26 Apr 2010 09:44:02 -0400
Subject:?confirm 29081d7dc4252dd9c96c13f5018658d3414acbdc
If you reply to this message, keeping the Subject: header intact,
Mailman will discard the held message. ?Do this if the message is
spam. ?If you reply to this message and include an Approved: header
with the list password in it, the message will be approved for posting
to the list. ?The Approved: header can also appear in the first line
of the body of the reply.


From biopython at maubp.freeserve.co.uk  Mon Apr 26 11:59:02 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 16:59:02 +0100
Subject: [Biopython] parsing a fasta with multiple entries
In-Reply-To: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>
References: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>
Message-ID: <h2s320fb6e01004260859m218775ffj9dbebe1220358480@mail.gmail.com>

On Mon, Apr 26, 2010 at 4:36 PM, Nick Leake <nick_leake77 at hotmail.com> wrote:
>
> Hello,
>
> I'm having trouble parsing a fasta file with multiple sequences - it is a fasta
> that has most of the transposable elements in fruit flies found at
> http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down.

Hi Nick,

You mean this file?
http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.fasta

> I want to be able to access the DNA sequences for manipulation and later
> removal from a chromosomal region. ?I originally thought that I could follow
> the same fasta format example shown in the biopython tutorial. ?However,
> that failed to work. ?I think it might be because there are multiple entries.

The Bio.SeqIO.read() function is for when there is a single record. The
Bio.SeqIO.parse() function is for when you have multiple records. Could
you clarify which bit of the tutorial was confusing? We'd like to make it
better.

> Basically, I just want parse the information and have dictionaries hold the
> transposon elements name and sequence for later use. ?Can I do that with
> biopython or should I make my own parser? Any help would be greatly
> appreciated. ?I'm still very much a python novice and get frustrated by not
> knowing how to ask my questions appropriately.

You should be able to use the Bio.SeqIO.index() function for this.

>>> from Bio import SeqIO
>>> data = SeqIO.index("D_mel_transposon_sequence_set.fasta", "fasta")
>>> data.keys()[:10]
['gb|U14101|TART-B', 'gb|AF162798|Dbuz\\BuT1',
'gb|U26847|Dvir\\Helena', 'gb|X67681|Bari1', 'gb|M69216|hobo',
'gb|U29466|Dkoe\\Gandalf', 'gb|Z27119|flea',
'gb|AB022762|aurora-element', 'gb|nnnnnnnn|Stalker3T',
'gb|AF518730|Dwil\\Vege']
>>> data["gb|nnnnnnnn|Stalker3T"]
SeqRecord(seq=Seq('TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAAT...ACA',
SingleLetterAlphabet()), id='gb|nnnnnnnn|Stalker3T',
name='gb|nnnnnnnn|Stalker3T', description='gb|nnnnnnnn|Stalker3T
STALKER3 372bp', dbxrefs=[])
>>> print data["gb|nnnnnnnn|Stalker3T"].seq
TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAATATGTAAAGTAGAGTTAATATGTAAGTAAGCAAAAGACCACCAACACTTACATGAACACTCCAGCTCTTGAAATACGATCGAGCGCTTAAACATAAGCCGATCGCGGAGCGTGAGAGTGCCGAGCATACACCTAGCAGCTCAAGTGATTAAGATAAGATAAGATAAGATAACAAACACGTAGTCTTAAGCGCGTCATGTGCGGGTGGCTGTACCCAAGAACAGCAAAGTGAATTCATTCGAATAAACCGCTTCAAGCAGAGCAGAGCCAAGTCTATTATATCAACTTCAAAAATACCGTATAACCTTGAACCTATTACA

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 12:02:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 17:02:18 +0100
Subject: [Biopython] help with parsing EMBL
In-Reply-To: <g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
	<g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>
Message-ID: <r2z320fb6e01004260902j46228ecl3b938d0853b4cc56@mail.gmail.com>

On Mon, Apr 26, 2010 at 4:52 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Nick,
>
> On Mon, Apr 26, 2010 Nick Leake wrote:
>> Hello,
>>
>> I'm having trouble parsing an embl file (attached) with multiple
>> sequences. ...
>
> After making those three edits by hand, Biopython should parse it.
> I suspect your EMBL file has been manually edited. Where did it
> come from?

>From Nick's other email about the FASTA file,
http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html
I can can see that the funny EMBL file came from the Berkeley Drosophil
 Genome Project (BDGP)'s Natural Transposable Element Project:
http://www.fruitfly.org/p_disrupt/TE.html

Specifically this file:
http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl

I'll email them to alert them about the three obvious errors I discussed.

Peter

From biopython at maubp.freeserve.co.uk  Mon Apr 26 12:28:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 17:28:31 +0100
Subject: [Biopython] help with parsing EMBL
In-Reply-To: <r2z320fb6e01004260902j46228ecl3b938d0853b4cc56@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
	<g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>
	<r2z320fb6e01004260902j46228ecl3b938d0853b4cc56@mail.gmail.com>
Message-ID: <h2j320fb6e01004260928v7fd51eb0j599a506b9cbe3186@mail.gmail.com>

On Mon, Apr 26, 2010 at 5:02 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> From Nick's other email about the FASTA file,
> http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html
> I can can see that the funny EMBL file came from the Berkeley Drosophil
> ?Genome Project (BDGP)'s Natural Transposable Element Project:
> http://www.fruitfly.org/p_disrupt/TE.html
>
> Specifically this file:
> http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl
>
> I'll email them to alert them about the three obvious errors I discussed.

There is also something odd going on with the features, which the
Biopython parser seems to be ignoring...

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 18:04:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 23:04:15 +0100
Subject: [Biopython] parsing a fasta with multiple entries
In-Reply-To: <SNT113-W64C744E14961EDDDAF60CDF9040@phx.gbl>
References: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>
	<h2s320fb6e01004260859m218775ffj9dbebe1220358480@mail.gmail.com>
	<SNT113-W64C744E14961EDDDAF60CDF9040@phx.gbl>
Message-ID: <l2m320fb6e01004261504ye631c395if33c0a6493c7393d@mail.gmail.com>

On Mon, Apr 26, 2010 at 8:05 PM, Nick Leake wrote:
> Thanks Peter,
>
> All of the information is?very helpful.? I apologize for sending?second
> email.? I was thinking that?the first email was going to be discarded for
> having the attachment - which in hindsight is an obvious fact.? At that
> time, I had only seen the initial email for rejecting the first.

I managed to reply before sending the original email (without
attachment) to the list - so partly my fault.

>>> I want to be able to access the DNA sequences for manipulation and
>>> later removal from a chromosomal region. ?I originally thought that I
>>> could follow the same fasta format example shown in the biopython
>>> tutorial. ?However, that failed to work. ?I think it might be because
>>> there are multiple entries.
>>
>> The Bio.SeqIO.read() function is for when there is a single record. The
>> Bio.SeqIO.parse() function is for when you have multiple records. Could
>> you clarify which bit of the tutorial was confusing? We'd like to make it
>> better.
>
> The tutorial I used was from
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html

OK, good - that is the current version.

> I will admit I didn't really know the difference from the Bio.SeqIO.read()
> verse the Bio.SeqIO.parse() functions even though they should be
> intuitive.? Still, the mentioned tutorial doen't seem to have a multiple
> entry parsed example.?This is where my naivet??and confusion on
> the matter probably started.

It does (the file ls_orchid.fasta used in several examples has 94
entries), but I guess there is a lot of information in there and it can be
overwhelming.

Your problems with the funny EMBL file probably didn't help :(

Peter


From p.j.a.cock at googlemail.com  Mon Apr 26 18:30:54 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 26 Apr 2010 23:30:54 +0100
Subject: [Biopython] Google Summer of Code - accepted students
In-Reply-To: <4BD60D63.1040400@cornell.edu>
References: <4BD60D63.1040400@cornell.edu>
Message-ID: <m2l320fb6e01004261530zd9f1a723ge8958362426bb7be@mail.gmail.com>

---------- Forwarded message ----------
From: Robert Buels <rmb32 at cornell.edu>
Date: Mon, Apr 26, 2010 at 11:02 PM
Subject: Google Summer of Code - accepted students
To: rmb32 at cornell.edu


Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology
module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5
originally assigned, plus 1 extra) allotted to us by Google.
Proposals were extremely competitive: 6 out of 52 translates to an
11.5% acceptance rate. ?We received a lot of really excellent
proposals, the decisions were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From rmb32 at cornell.edu  Mon Apr 26 18:02:11 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 15:02:11 -0700
Subject: [Biopython] Google Summer of Code - accepted students
Message-ID: <4BD60D63.1040400@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of 
Code students, listed in alphabetical order with their project titles 
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including 
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, 
Classification, and Visualization of Posttranslational Modification of 
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & 
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending 
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally 
assigned, plus 1 extra) allotted to us by Google.  Proposals were 
extremely competitive: 6 out of 52 translates to an 11.5% acceptance 
rate.  We received a lot of really excellent proposals, the decisions 
were not easy.

Thanks very much to all the students who applied, we very much 
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do 
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From anaryin at gmail.com  Tue Apr 27 00:29:36 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 27 Apr 2010 12:29:36 +0800
Subject: [Biopython] Google Summer of Code - accepted students
In-Reply-To: <m2l320fb6e01004261530zd9f1a723ge8958362426bb7be@mail.gmail.com>
References: <4BD60D63.1040400@cornell.edu>
	<m2l320fb6e01004261530zd9f1a723ge8958362426bb7be@mail.gmail.com>
Message-ID: <g2lb537e3711004262129tdda1424eke077314b53a90a2a@mail.gmail.com>

Hello all!

Thanks for the confidence! I'm sure it's going to work alright! If
anyone has any comments to add to my application feel free either to
email me!

Regards!

Jo?o [...] Rodrigues

On Monday, April 26, 2010, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> ---------- Forwarded message ----------
> From: Robert Buels <rmb32 at cornell.edu>
> Date: Mon, Apr 26, 2010 at 11:02 PM
> Subject: Google Summer of Code - accepted students
> To: rmb32 at cornell.edu
>
>
> Hi all,
>
> I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
> Code students, listed in alphabetical order with their project titles
> and primary mentors:
>
> Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
> Implementation of Multiple Sequence Alignment Algorithms
>
> Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
> Classification, and Visualization of Posttranslational Modification of
> Proteins
>
> Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby
>
> Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
> Duplication Inference Algorithm for Binary and Non-binary Species Tree
>
> Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
> Bio.PDB: broadening the usefulness of BioPython's Structural Biology
> module
>
> Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring
>
> Congratulations to our accepted students!
>
> All told, we had 52 applications submitted for the 6 slots (5
> originally assigned, plus 1 extra) allotted to us by Google.
> Proposals were extremely competitive: 6 out of 52 translates to an
> 11.5% acceptance rate. ?We received a lot of really excellent
> proposals, the decisions were not easy.
>
> Thanks very much to all the students who applied, we very much
> appreciate your hard work.
>
> Here's to a great 2010 Summer of Code, I'm sure these students will do
> some wonderful work.
>
> Rob Buels
> OBF GSoC 2010 Administrator
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

-- 
Jo?o [...] Rodrigues
@ http://stanford.edu/~joaor/


From rmb32 at cornell.edu  Tue Apr 27 01:52:57 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 22:52:57 -0700
Subject: [Biopython] Google Summer of Code - accepted students
Message-ID: <4BD67BB9.3000804@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally
assigned, plus 1 extra) allotted to us by Google.  Proposals were
extremely competitive: 6 out of 52 translates to an 11.5% acceptance
rate.  We received a lot of really excellent proposals, the decisions
were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From biopython at maubp.freeserve.co.uk  Tue Apr 27 05:45:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Apr 2010 10:45:20 +0100
Subject: [Biopython] Bug in GenBank/EMBL parser?
In-Reply-To: <h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com>
References: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>
	<h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com>
Message-ID: <z2v320fb6e01004270245s6d83b011jd4e135e8444beee8@mail.gmail.com>

On Thu, Apr 22, 2010 at 9:56 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson <laserson at mit.edu> wrote:
>> Hi,
>>
>> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
>> supposedly conforms to the EMBL standard).
>>
>> The short story is that whenever there is a feature, the parser checks
>> whether there are qualifiers in the feature with an assert statement, and
>> does not allow features with no qualifiers. ?However, the IMGT flatfile is
>> full of entries that have features with no qualifiers (only coordinates).
>>
>> Who is wrong here? ?Does the EMBL specification require that a feature have
>> qualifiers? ?Or is this a bug to be fixed in the parser.
>
> Hi Uri,
>
> Thank you for your detailed report,
>
> Since you have raised this, I went back over the EMBL documentation.
> All their example features qualifiers (and from personal experience all
> EMBL files from the EMBL and GenBank files from the NCBI) do have
> qualifiers. However, in Section 7.2 they are called "Optional qualifiers".
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2
>
> So it does look like an unwarranted assumption in the Biopython
> parser (even though it has been a safe assumption on "official" EMBL
> and GenBank files thus far), which we should fix.

Bug filed and now fixed,
http://bugzilla.open-bio.org/show_bug.cgi?id=3062

It turned out to be an invalid EMBL file where the features were over-
indented. Biopython was quite happy to parse valid EMBL or GenBank
files with features without qualifiers (although I don't recall seeing any
examples from EMBL or the NCBI like this).

Peter


From silvio.tschapke at googlemail.com  Wed Apr 28 05:24:25 2010
From: silvio.tschapke at googlemail.com (Silvio Tschapke)
Date: Wed, 28 Apr 2010 11:24:25 +0200
Subject: [Biopython] save efetch results in different files
Message-ID: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>

Hi all,

I'd like to download hundreds of pubmed entries in one turn, but save every
entry in a single file for further processing with e.g. NLTK.
Is this possible? Or what is the common way to do this? Or do I have to call
efetch for every single pmid? I dont know how.
Could you also explain me what handle.read() does? Entrez.read(handle) I
understand, because it is documented, but handle.read() not. What kind of
type is a handle?


search_results = Entrez.read(Entrez.esearch(db="pubmed",
                                            term="Biopython",
                                            usehistory="y"))

batch_size = 10


for start in range(0,count,batch_size):
    end = min(count, start+batch_size)
    print "Going to download record %i to %i" % (start+1, end)
    fetch_handle = Entrez.efetch(db="pubmed", rettype="xml",
                                 retstart=start, retmax=batch_size,
                                 webenv=search_results["WebEnv"],
                                 query_key=search_results["QueryKey"])


for pmid in search_results["IdList"]:
    out_handle = open(pmid+".txt", "w")
    HERE I HAVE TO ACCESS THE ENTRY FROM THE fetch_handle FOR THE
CORRESPONDING pmid

    #data = Entrez.read(fetch_handle)
    #data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
    out_handle.close()


Cheers,
Silvio

From biopython at maubp.freeserve.co.uk  Wed Apr 28 05:57:48 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 28 Apr 2010 10:57:48 +0100
Subject: [Biopython] save efetch results in different files
In-Reply-To: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>
References: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>
Message-ID: <g2w320fb6e01004280257jbf297b99qac07bc2b7f9519ae@mail.gmail.com>

On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke
<silvio.tschapke at googlemail.com> wrote:
> Hi all,
>
> I'd like to download hundreds of pubmed entries in one turn, but save every
> entry in a single file for further processing with e.g. NLTK.
> Is this possible? Or what is the common way to do this? Or do I have to call
> efetch for every single pmid? I dont know how.

Personally I would probably save each pubmed result to a separate file
named using the pmid - a Unix filesystem should cope fine with a few
thousand files in a single directory. This is simple and lets you add more
entries at a later date, and you have simple access to any record.

The other approach of combining separate entries into multiple files sounds
overly complicated (although possible), while another approach would be a
single large file containing all the records in one. These would require a
index if you needed random access to the entries by pmid.

> Could you also explain me what handle.read() does? Entrez.read(handle) I
> understand, because it is documented, but handle.read() not. What kind of
> type is a handle?

It is *like* a standard handle that you'd get in python from open(filename).
This is an object supporting read() giving all the remaining data as a string,
readline() giving the next line etc.

Peter

From laserson at mit.edu  Wed Apr 28 14:49:40 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 28 Apr 2010 14:49:40 -0400
Subject: [Biopython] SPARK error messages to be sent to stderr?
Message-ID: <r2y165c1bda1004281149jbd33bb38hff43801adb023676@mail.gmail.com>

The spark error messages when there is a parsing problem are currently
getting sent to stdout:

(line 181 in Bio/Parsers/spark.py)
print "Syntax error at or near `%s' token" % token

Can this be changed to:
print >>sys.stderr, "Syntax error at or near `%s' token" % token

This way the error messages can be handled separately.

Thanks!
Uri

From laserson at mit.edu  Wed Apr 28 15:12:28 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 28 Apr 2010 15:12:28 -0400
Subject: [Biopython] Can the GenBank/EMBL parser recover from errors?
Message-ID: <t2o165c1bda1004281212waa7dbf7ar5d1b4ca3d4b7664e@mail.gmail.com>

Hi,

I am trying to parse a large file of EMBL records that I know has some
errors in it.  However, rather than having the parser break when it gets to
the error, I'd rather it just skip that record, and move on to the next one.
 I was wondering if this functionality is already built in somewhere.  One
way I can do this is like this:

iterator = SeqIO.parse(ip,'embl').__iter__()
while True:
    try:
        record = iterator.next()
    # Now I specify all the parsing errors I want to catch:
    except LocationParserError:
        # Reinitialize iterator at current file position. The iterator
        # then skips to the beginning of the next record and continues.
        iterator = SeqIO.parse(ip,'embl').__iter__()
    except StopIteration:
        break

This way, whenever there is a parsing error, I just reinitialize the
iterator at the current file position, and it seeks to the beginning of the
next record.  However, this requires me to write out the for loop manually
(using StopIteration).  Does anyone know of a cleaner/more elegant way of
doing this?

Thanks!
Uri

-- 
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu

From laserson at mit.edu  Wed Apr 28 17:38:52 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 28 Apr 2010 17:38:52 -0400
Subject: [Biopython] Bug in GenBank/EMBL parser?
In-Reply-To: <z2v320fb6e01004270245s6d83b011jd4e135e8444beee8@mail.gmail.com>
References: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com> 
	<h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com> 
	<z2v320fb6e01004270245s6d83b011jd4e135e8444beee8@mail.gmail.com>
Message-ID: <y2h165c1bda1004281438kb5e5049n913fa4af53e184a@mail.gmail.com>

This fixed the main problem with parsing IMGT files that have increased
indentation.  I also filed an additional bug/enhancement with a proposed
patch, which should make biopython compatible with IMGT and still conform to
the INSDC format: http://bugzilla.open-bio.org/show_bug.cgi?id=3069

<http://bugzilla.open-bio.org/show_bug.cgi?id=3069>Uri

On Tue, Apr 27, 2010 at 05:45, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Apr 22, 2010 at 9:56 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson <laserson at mit.edu> wrote:
> >> Hi,
> >>
> >> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
> >> supposedly conforms to the EMBL standard).
> >>
> >> The short story is that whenever there is a feature, the parser checks
> >> whether there are qualifiers in the feature with an assert statement,
> and
> >> does not allow features with no qualifiers.  However, the IMGT flatfile
> is
> >> full of entries that have features with no qualifiers (only
> coordinates).
> >>
> >> Who is wrong here?  Does the EMBL specification require that a feature
> have
> >> qualifiers?  Or is this a bug to be fixed in the parser.
> >
> > Hi Uri,
> >
> > Thank you for your detailed report,
> >
> > Since you have raised this, I went back over the EMBL documentation.
> > All their example features qualifiers (and from personal experience all
> > EMBL files from the EMBL and GenBank files from the NCBI) do have
> > qualifiers. However, in Section 7.2 they are called "Optional
> qualifiers".
> >
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2
> >
> > So it does look like an unwarranted assumption in the Biopython
> > parser (even though it has been a safe assumption on "official" EMBL
> > and GenBank files thus far), which we should fix.
>
> Bug filed and now fixed,
> http://bugzilla.open-bio.org/show_bug.cgi?id=3062
>
> It turned out to be an invalid EMBL file where the features were over-
> indented. Biopython was quite happy to parse valid EMBL or GenBank
> files with features without qualifiers (although I don't recall seeing any
> examples from EMBL or the NCBI like this).
>
> Peter
>


-- 
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu

From p.j.a.cock at googlemail.com  Wed Apr 28 18:11:43 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 28 Apr 2010 23:11:43 +0100
Subject: [Biopython] Can the GenBank/EMBL parser recover from errors?
In-Reply-To: <t2o165c1bda1004281212waa7dbf7ar5d1b4ca3d4b7664e@mail.gmail.com>
References: <t2o165c1bda1004281212waa7dbf7ar5d1b4ca3d4b7664e@mail.gmail.com>
Message-ID: <r2k320fb6e01004281511x897d58f7gca81d2b8f91e303@mail.gmail.com>

On Wednesday, April 28, 2010, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to parse a large file of EMBL records that I know has some
> errors in it. ?However, rather than having the parser break when it gets to
> the error, I'd rather it just skip that record, and move on to the next one.
> ?I was wondering if this functionality is already built in somewhere. ?One
> way I can do this is like this:
>
> iterator = SeqIO.parse(ip,'embl').__iter__()
> while True:
>  ? ?try:
>  ? ? ? ?record = iterator.next()
>  ? ?# Now I specify all the parsing errors I want to catch:
>  ? ?except LocationParserError:
>  ? ? ? ?# Reinitialize iterator at current file position. The iterator
>  ? ? ? ?# then skips to the beginning of the next record and continues.
>  ? ? ? ?iterator = SeqIO.parse(ip,'embl').__iter__()
>  ? ?except StopIteration:
>  ? ? ? ?break
>
> This way, whenever there is a parsing error, I just reinitialize the
> iterator at the current file position, and it seeks to the beginning of the
> next record. ?However, this requires me to write out the for loop manually
> (using StopIteration). ?Does anyone know of a cleaner/more elegant way of
> doing this?
>
> Thanks!

Hi Uri,

There is no obvious way to handle this within the Bio.SeqIO.parse framework.

I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't
so corrupt that it can't be scanned to identify each record). Just
wrap each record access in an error handler.

Peter


From cloudycrimson at gmail.com  Thu Apr 29 02:58:26 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Thu, 29 Apr 2010 12:28:26 +0530
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
	<h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>
	<q2gddd5adac1004260425q3db04bd9rfdc08b80997ffa63@mail.gmail.com>
	<l2o320fb6e01004260452oe343dda3tfeca71b3d72b7403@mail.gmail.com>
	<p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
Message-ID: <y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>

hello Peter,

Sorry for the late reply. I am writing to thank you. The suggestions you
gave were of massive work in our research by reducing the BLASTing time.
Thank you for taking interest,

Sincerely,
Karthikaja
On Mon, Apr 26, 2010 at 5:27 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>  On Mon, Apr 26, 2010 at 12:52 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja <cloudycrimson at gmail.com>
> wrote:
> >> Hi Peter,
> >>
> >> I will seriously consider using the stand alone blast option. And thank
> you
> >> so much for the links. :)  I have replaced the repository.
> >>
> >> You suspected a problem with the sequences but they work very well when
> >> given directly in the code. I have attached my fasta file. Please tell
> me
> >> how it works with you.
> >>
> >> Karthikraja.
> >
> > You seem to have made a mistake with the FASTA file, there should be
> > a read name on the ">" lines with the sequence on the subsequence lines.
> > E.g. More like this:
> >
> >>Seq1
> > IMYTALPVIGKRHFRPSFTR
> >>Seq2
> > RSSRGRGR
> > (etc)
> >
> > As is, your file is valid but describes seven records each with no
> sequence
> > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc).
>
> P.S. The updated Biopython should have given you this error message:
>
> ValueError: Error message from NCBI: Message ID#32 Error: Query
> contains no data: Query contains no sequence data
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Thu Apr 29 05:08:00 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Apr 2010 10:08:00 +0100
Subject: [Biopython] save efetch results in different files
In-Reply-To: <s2ld3ddc94e1004280956w8f442c60u988a324394a1cc48@mail.gmail.com>
References: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>
	<g2w320fb6e01004280257jbf297b99qac07bc2b7f9519ae@mail.gmail.com>
	<s2ld3ddc94e1004280956w8f442c60u988a324394a1cc48@mail.gmail.com>
Message-ID: <p2h320fb6e01004290208wb59d11fax1e4dd751b1ae67cf@mail.gmail.com>

On Wed, Apr 28, 2010 at 5:56 PM, Silvio Tschapke wrote:
>
> On Wed, Apr 28, 2010 at 11:57 AM, Peter wrote:
>>
>> On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote:
>> > Hi all,
>> >
>> > I'd like to download hundreds of pubmed entries in one turn, but save
>> > every entry in a single file for further processing with e.g. NLTK.
>> > Is this possible? Or what is the common way to do this? Or do I have to
>> > call efetch for every single pmid? I dont know how.
>>
>> Personally I would probably save each pubmed result to a separate file
>> named using the pmid - a Unix filesystem should cope fine with a few
>> thousand files in a single directory. This is simple and lets you add more
>> entries at a later date, and you have simple access to any record.
>
> This is what I thought..to save each pubmed result to a separate file named
> using the pmid, as you can see in the code snippet.
> But it isn't working so far. Could you help me with the efetch_handle? I
> have called efetch one time with all pmids. So the efetch_handle contains
> all results. But now I need to pull out every single result from this handle
> to save it in a separate file with its pmid. And I don't know how to do it.
> Or isn't there another way..do I have to call efetch for every pmid and than
> save it into a file inside the loop?
> Because Biopython recommends to not do many queries per second I
> thought it would be better to only call efetch one time for all pmids.

The simplest answer is to make one efetch call per PMID, giving a single
record at a time which you can save to individual files. You can still do
this with the esearch+efetch history support. This does mean making
many small queries to the NCBI, rather than batching them together -
but the NCBI do not have any explicit guidelines on batch sizes.

Note - you would be making over 100 queries, so make sure you don't
run this during USA office hours!

The more complex approach (which the NCBI might prefer) is to
download batches of records together (e.g. 50 PMID results at once).
If you wanted to save these to separate files, you would have to divide
the text up yourself. I think you just need to look for lines starting
"PMID-" so this shouldn't be too hard.

Peter

From cloudycrimson at gmail.com  Fri Apr 30 06:50:08 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Fri, 30 Apr 2010 16:20:08 +0530
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
	<h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>
	<q2gddd5adac1004260425q3db04bd9rfdc08b80997ffa63@mail.gmail.com>
	<l2o320fb6e01004260452oe343dda3tfeca71b3d72b7403@mail.gmail.com>
	<p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
	<y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>
Message-ID: <l2vddd5adac1004300350q7fd71962ucfebf885901923fa@mail.gmail.com>

hello Peter,

I have done blast for 25 sequences and have got 10 hits for each sequence. I
have stored the results in an XML file. Now i need to *parse* it and the
information in the cookbook isn helping me.

>>> from Bio.Blast import NCBIWWW
>>> result_handle = open("finaltest3.xml")
>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)
>>> for blast_record in blast_records:

I am using the above code. Please tell me how to proceed to get information
namely "sequence, seq id, e value and alignment".

And I also have another doubt. While using q blast, is it possible to
restrict the results to only human and mouse hits? If yes, it will be great
if you could give me an example code or link.

Sincerely,
Karthik.


On Thu, Apr 29, 2010 at 12:28 PM, Karthik Raja <cloudycrimson at gmail.com>wrote:

>
> hello Peter,
>
> Sorry for the late reply. I am writing to thank you. The suggestions you
> gave were of massive work in our research by reducing the BLASTing time.
> Thank you for taking interest,
>
> Sincerely,
> Karthikaja
>   On Mon, Apr 26, 2010 at 5:27 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:
>
>>  On Mon, Apr 26, 2010 at 12:52 PM, Peter <biopython at maubp.freeserve.co.uk>
>> wrote:
>> > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja <cloudycrimson at gmail.com>
>> wrote:
>> >> Hi Peter,
>> >>
>> >> I will seriously consider using the stand alone blast option. And thank
>> you
>> >> so much for the links. :)  I have replaced the repository.
>> >>
>> >> You suspected a problem with the sequences but they work very well when
>> >> given directly in the code. I have attached my fasta file. Please tell
>> me
>> >> how it works with you.
>> >>
>> >> Karthikraja.
>> >
>> > You seem to have made a mistake with the FASTA file, there should be
>> > a read name on the ">" lines with the sequence on the subsequence lines.
>> > E.g. More like this:
>> >
>> >>Seq1
>> > IMYTALPVIGKRHFRPSFTR
>> >>Seq2
>> > RSSRGRGR
>> > (etc)
>> >
>> > As is, your file is valid but describes seven records each with no
>> sequence
>> > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc).
>>
>> P.S. The updated Biopython should have given you this error message:
>>
>> ValueError: Error message from NCBI: Message ID#32 Error: Query
>> contains no data: Query contains no sequence data
>>
>> Peter
>>
>
>

From biopython at maubp.freeserve.co.uk  Fri Apr 30 07:15:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 30 Apr 2010 12:15:05 +0100
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <l2vddd5adac1004300350q7fd71962ucfebf885901923fa@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
	<h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>
	<q2gddd5adac1004260425q3db04bd9rfdc08b80997ffa63@mail.gmail.com>
	<l2o320fb6e01004260452oe343dda3tfeca71b3d72b7403@mail.gmail.com>
	<p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
	<y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>
	<l2vddd5adac1004300350q7fd71962ucfebf885901923fa@mail.gmail.com>
Message-ID: <o2n320fb6e01004300415y236acb44z562fb6f11435c3f4@mail.gmail.com>

On Fri, Apr 30, 2010 at 11:50 AM, Karthik Raja <cloudycrimson at gmail.com> wrote:
> hello Peter,
>
> I have done blast for 25 sequences and have got 10 hits for each sequence. I
> have stored the results in an XML file. Now i need to *parse* it and the
> information in the cookbook isn helping me.
>
>>>> from Bio.Blast import NCBIWWW
>>>> result_handle = open("finaltest3.xml")
>>>> from Bio.Blast import NCBIXML
>>>> blast_records = NCBIXML.parse(result_handle)
>>>> for blast_record in blast_records:
>
> I am using the above code. Please tell me how to proceed to get information
> namely "sequence, seq id, e value and alignment".

That should be fairly clear from the tutorial, look at the section titled
"The BLAST record class".

> And I also have another doubt. While using q blast, is it possible to
> restrict the results to only human and mouse hits? If yes, it will be great
> if you could give me an example code or link.

You can ask the NCBI to filter the BLAST results for you with an
Entrez query, one of the optional arguments to the Biopython
qblast function. Something like "mouse[ORGN] OR human[ORGN]"
should work. You can try out the Entrez query on the website to
make sure you have the right syntax and terms.

Peter

From p.j.a.cock at googlemail.com  Fri Apr  2 17:34:00 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 2 Apr 2010 18:34:00 +0100
Subject: [Biopython] Biopython 1.54 beta released
Message-ID: <t2o320fb6e01004021034p44915b44w421d6e9a152b13e4@mail.gmail.com>

Dear all,

A beta release for Biopython 1.54 is now available for download
and testing, as announced here:

http://news.open-bio.org/news/2009/06/biopython-154-beta-released/

Noted that I haven't done a fully detailed release announcement,
we'll leave that for the official release.

Source distributions and Windows installers are available from
the downloads page on the Biopython website.
http://biopython.org/wiki/Download

We are interested in getting feedback on the beta release as
a whole, but especially on the new features - including the
updated multiple sequence alignment object (which is what
you?ll now get when parsing alignments with Bio.AlignIO), the
new Bio.Phylo module, and the Bio.SeqIO support for Standard
Flowgram Format (SFF) files.

(At least) 10 people contributed to this release (so far), which
includes 4 new people:

Anne Pajon (first contribution)
Brad Chapman
Christian Zmasek
Eric Talevich
Jose Blanca (first contribution)
Kevin Jacobs (first contribution)
Leighton Pritchard
Michiel de Hoon
Peter Cock
Thomas Holder (first contribution)

On behalf of the Biopython team, thank you for any feedback,
bug reports, and contributions.

Peter

P.S.

You may wish to subscribe to our news feed.  For RSS links etc, see:
http://biopython.org/wiki/News

Biopython news is also on twitter:
http://twitter.com/biopython


From p.j.a.cock at googlemail.com  Fri Apr  2 17:39:08 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 2 Apr 2010 18:39:08 +0100
Subject: [Biopython] Biopython 1.54 beta released
In-Reply-To: <t2o320fb6e01004021034p44915b44w421d6e9a152b13e4@mail.gmail.com>
References: <t2o320fb6e01004021034p44915b44w421d6e9a152b13e4@mail.gmail.com>
Message-ID: <h2n320fb6e01004021039x302510bel7a81ff3fb833025@mail.gmail.com>

> Dear all,
>
> A beta release for Biopython 1.54 is now available for download
> and testing, as announced here:
>
> http://news.open-bio.org/news/2009/06/biopython-154-beta-released/
>
> Noted that I haven't done a fully detailed release announcement,
> we'll leave that for the official release.

That URL should have been:
http://news.open-bio.org/news/2010/04/biopython-1-54-beta-released/

Sorry for the extra email,

Peter


From cgohlke at uci.edu  Fri Apr  2 23:05:25 2010
From: cgohlke at uci.edu (Christoph Gohlke)
Date: Fri, 02 Apr 2010 16:05:25 -0700
Subject: [Biopython] Biopython 1.54b test failures
Message-ID: <4BB67835.7030303@uci.edu>

Hello,

I get two test failures (see below) when running 'setup.py test' for 
biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are related 
to line ending style. Maybe it would be a good idea to use Python's 
universal newline support (available since 2.3) when opening text files 
for iteration over lines. All tests pass after the following changes:

BIO/SCOP/Raf.py

line 104:
         f = open(self.filename, 'rU')

line 121:
         f = open(self.filename, 'rU')

BIO/SCOP/Cla.py

line 103:
         f = open(self.filename, 'rU')

line 123:
         f = open(self.filename, 'rU')

line 72 (inconsistent indentation):
             h.append("=".join(map(str,ht)))


-- Christoph


======================================================================
ERROR: Test CLA file indexing
----------------------------------------------------------------------
Traceback (most recent call last):
   File "test_SCOP_Cla.py", line 74, in testIndex
     rec = index['d1hbia_']
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", 
line 127,
  in __getitem__
     record = Record(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", 
line 45,
in __init__
     self._process(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", 
line 51,
in _process
     raise ValueError("I don't understand the format of %s" % line)
ValueError: I don't understand the format of 5

======================================================================
ERROR: testSeqMapIndex (test_SCOP_Raf.RafTests)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "test_SCOP_Raf.py", line 68, in testSeqMapIndex
     r = index.getSeqMap("103m")
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 152,
  in getSeqMap
     sm = self[id]
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 125,
  in __getitem__
     record = SeqMap(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 196,
  in __init__
     self._process(line)
   File 
"D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", 
line 216,
  in _process
     raise ValueError("Incompatible RAF version: "+self.version)
ValueError: Incompatible RAF version: .01

----------------------------------------------------------------------
Ran 143 tests in 98.871 seconds

FAILED (failures = 2)


From biopython at maubp.freeserve.co.uk  Fri Apr  2 23:22:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 3 Apr 2010 00:22:32 +0100
Subject: [Biopython] Biopython 1.54b test failures
In-Reply-To: <4BB67835.7030303@uci.edu>
References: <4BB67835.7030303@uci.edu>
Message-ID: <u2p320fb6e01004021622i9a350a0eqfaf9663b5d6e9e62@mail.gmail.com>

On Sat, Apr 3, 2010 at 12:05 AM, Christoph Gohlke <cgohlke at uci.edu> wrote:
> Hello,
>
> I get two test failures (see below) when running 'setup.py test' for
> biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are
> related to line ending style.

It is a known issue - a simple work around is just run something
like unix2dos on the SCOP test files, and then the tests pass.

> Maybe it would be a good idea to use Python's universal
> newline support (available since 2.3) when opening text
> files for iteration over lines.

I had tried that in the past without success...

> All tests pass after the following changes:
>
> BIO/SCOP/Raf.py
>
> line 104:
> ? ? ? ?f = open(self.filename, 'rU')
>
> line 121:
> ? ? ? ?f = open(self.filename, 'rU')
>
> BIO/SCOP/Cla.py
>
> line 103:
> ? ? ? ?f = open(self.filename, 'rU')
>
> line 123:
> ? ? ? ?f = open(self.filename, 'rU')
>
> line 72 (inconsistent indentation):
> ? ? ? ? ? ?h.append("=".join(map(str,ht)))
>

I recall trying the universal read lines thing before without
success in the SCOP tests - maybe it was this line 72 thing
that I missed. I'll take another look at this next week (when
I have access to a Windows machine).

Thanks,

Peter


From skhadar at gmail.com  Sat Apr  3 01:33:01 2010
From: skhadar at gmail.com (Khader Shameer)
Date: Fri, 2 Apr 2010 19:33:01 -0600
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
Message-ID: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>

Hi,

I was trying to install BioPython using fink.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

Used the command "fink install biopython-py24"
Got the following error:
Failed: no package found for specification 'biopython-py24'!
Tried 23, 24 and 25 - it is not working.

Any idea why it is not working ?

Thanks,
Shameer


From vincent at vincentdavis.net  Sat Apr  3 03:04:17 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Fri, 2 Apr 2010 21:04:17 -0600
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
Message-ID: <q2t77e831101004022004jc3b06856g91f2cf3780650ffd@mail.gmail.com>

Installing from source, instructions here is straight forward, just did it
with the newest version, no problems
http://biopython.org/wiki/Download

*Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Fri, Apr 2, 2010 at 7:33 PM, Khader Shameer <skhadar at gmail.com> wrote:

> Hi,
>
> I was trying to install BioPython using fink.
>
> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>
> Used the command "fink install biopython-py24"
> Got the following error:
> Failed: no package found for specification 'biopython-py24'!
> Tried 23, 24 and 25 - it is not working.
>
> Any idea why it is not working ?
>
> Thanks,
> Shameer
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Sat Apr  3 10:33:48 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 3 Apr 2010 11:33:48 +0100
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
Message-ID: <j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>

On Sat, Apr 3, 2010 at 2:33 AM, Khader Shameer <skhadar at gmail.com> wrote:
> Hi,
>
> I was trying to install BioPython using fink.
>
> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>
> Used the command "fink install biopython-py24"
> Got the following error:
> Failed: no package found for specification 'biopython-py24'!
> Tried 23, 24 and 25 - it is not working.
>
> Any idea why it is not working ?

Something to do with Fink? Also note we don't
support Python 2.3 anymore (and Python 2.4 is
on its last few releases as a supported version
for Biopython).

Apple provides python 2.5 (32bit) and python
2.6 (64bit) on Snow Leopard. I actually use
python 2.6 on the Mac specifically because it
is 64bit and can cope with more memory.

As Vincent and our documentation suggests,
try just installing from source. You'll need to
install Apple's XCode tools first, and it seems
to help if you tick the optional older SDKs as well.

Peter


From p.j.a.cock at googlemail.com  Sat Apr  3 13:52:11 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 3 Apr 2010 14:52:11 +0100
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
	<j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>
Message-ID: <n2u320fb6e01004030652la67e081fq24ce895e29e7c9b@mail.gmail.com>

>> Hi,
>>
>> I was trying to install BioPython using fink.
>>
>> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
>> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>>
>> Used the command "fink install biopython-py24"
>> Got the following error:
>> Failed: no package found for specification 'biopython-py24'!
>> Tried 23, 24 and 25 - it is not working.
>>
>> Any idea why it is not working ?
>
> Something to do with Fink? Also note we don't
> support Python 2.3 anymore (and Python 2.4 is
> on its last few releases as a supported version
> for Biopython).

If you really want to use fink, I think you'll have to
contact the fink team. Specifically it looks like
Koen van der Drift is kindly taking care of packaging
Biopython on Fink:

http://pdb.finkproject.org/pdb/package.php/biopython-py24
http://pdb.finkproject.org/pdb/package.php/biopython-py25
http://pdb.finkproject.org/pdb/package.php/biopython-py26

Peter


From skhadar at gmail.com  Sat Apr  3 17:19:49 2010
From: skhadar at gmail.com (Khader Shameer)
Date: Sat, 3 Apr 2010 11:19:49 -0600
Subject: [Biopython] Biopython installation failed on Mac OSX 10.6
In-Reply-To: <n2u320fb6e01004030652la67e081fq24ce895e29e7c9b@mail.gmail.com>
References: <w2jb6ff81951004021833k16f90b28x2d17f1b2aee14c4f@mail.gmail.com>
	<j2i320fb6e01004030333qc2ba3820i7ff71f6b853efaf3@mail.gmail.com>
	<n2u320fb6e01004030652la67e081fq24ce895e29e7c9b@mail.gmail.com>
Message-ID: <k2nb6ff81951004031019vd49d921lfad25a6636781086@mail.gmail.com>

Thanks Vincent, Peter :  I have installed BioPython from source.


On Sat, Apr 3, 2010 at 7:52 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> >> Hi,
> >>
> >> I was trying to install BioPython using fink.
> >>
> >> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
> >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> >>
> >> Used the command "fink install biopython-py24"
> >> Got the following error:
> >> Failed: no package found for specification 'biopython-py24'!
> >> Tried 23, 24 and 25 - it is not working.
> >>
> >> Any idea why it is not working ?
> >
> > Something to do with Fink? Also note we don't
> > support Python 2.3 anymore (and Python 2.4 is
> > on its last few releases as a supported version
> > for Biopython).
>
> If you really want to use fink, I think you'll have to
> contact the fink team. Specifically it looks like
> Koen van der Drift is kindly taking care of packaging
> Biopython on Fink:
>
> http://pdb.finkproject.org/pdb/package.php/biopython-py24
> http://pdb.finkproject.org/pdb/package.php/biopython-py25
> http://pdb.finkproject.org/pdb/package.php/biopython-py26
>
> Peter
>


From rmb32 at cornell.edu  Sat Apr  3 20:09:27 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 13:09:27 -0700
Subject: [Biopython] Google Summer of Code is *ON* for OBF projects!
Message-ID: <4BB7A077.4070802@cornell.edu>

Hi all,

Reminder:  GSoC student proposals must be submitted to Google by April 
9th, 19:00 UTC.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project 
mailing lists, they can help you get your proposal into shape.

So far, we have 5 proposals submitted to our org in Google's web app. 
Keep them coming, and let's see some really good ones!

Rob Buels
OBF GSoC 2010 Administrator


From rmb32 at cornell.edu  Sun Apr  4 04:37:38 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 21:37:38 -0700
Subject: [Biopython] Reminder: GSoC student applications due April 9,
	19:00 UTC
Message-ID: <4BB81792.8060001@cornell.edu>

Hi all,

Sending this again with a different subject line, just in case.

GSoC student proposals must be submitted to Google through their web 
application by *April 9th, 19:00 UTC*.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project
mailing lists, they can help you get your proposal into shape.

So far, we have 6 proposals submitted to our org in Google's web app.
Keep them coming, and keep them good!

Rob Buels
OBF GSoC 2010 Administrator


From ulfada at gmail.com  Mon Apr  5 01:46:14 2010
From: ulfada at gmail.com (Sofia Lemons)
Date: Sun, 4 Apr 2010 21:46:14 -0400
Subject: [Biopython] SoC project (BioPython and PyCogent)
Message-ID: <y2r838208c01004041846q78520c57m17135145edbbbc61@mail.gmail.com>

I'm working on an application for the Summer of Code project of
integrating BioPython and PyCogent. I've looked through the list
archives and saw Brad's general advice to other potential SoC
applicants, but I thought I'd introduce myself and see if there was
any advice specific to this project. I've used BioPython in the past
and even explored the code a bit. I'm considering working on one or
more of the bugs in Bugzilla if I can find time, and will work to
familiarize myself with PyCogent. Are there any other concepts,
projects, or people I should familiarize myself with (aside from
what's listed on the ideas page, of course)? As you can see from my
GitHub and Google Code accounts, I've got some experience with open
source projects, but please do suggest any specific tools or methods
you think I should try to get up to speed on, as well. Feel free to
contact me off-list.

Thanks,
Sofia


From stran104 at chapman.edu  Mon Apr  5 10:59:28 2010
From: stran104 at chapman.edu (Matthew Strand)
Date: Mon, 5 Apr 2010 03:59:28 -0700
Subject: [Biopython] GSoC Ortholog Module Proposal
Message-ID: <g2h2a63cc351004050359kbdeb99afva24ca1e2a0ac3871@mail.gmail.com>

Dear Biopython GSoC list,

I am a student at Chapman University and over the last 18 months I have been
using biopython to produce phylogenetic trees with ClustalW, T-Coffee, and
PHYLIP. I have found the most difficult part to be identifying ortholgos for
the particular species that our lab is interested in studying. The orthology
databases provide a great deal of matches but each database requires its own
wrapper and some databases are stronger than others with particular species.


So far I have written wrappers to get ortholog IDs from InParanoid and then
fetch the sequences from either NCBI or BioMart. This provides good results
for most common species but not all. To handle rare species I have
implemented the Reverse Smallest Distance orthology algorithm to run
protein-protein searches. It is available at http://ortholog.us. I also have
automated scripts to align protein families, concatenate aligned families,
and create trees.

For GSoC I would like to write a module to abstract finding orthologs as
much as possible. This would greatly simplify creating custom evolutionary
trees for biologists. The module could fetch orthologs from TreeFam,
InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also
provide support for producing alignments, concatenating alignments, removing
sections of gaps, and constructing trees. Ortholog identification could be
done with no dependency other than an internet connection. Alignments and
trees would require the user to have the appropriate tools installed.

The overhead of writing this type of code makes it difficult for
evolutionary biologists and bio wet labs to get a picture of evolutionary
relationships in specific groups of species. This module would aim to
simplify creating custom phylogenetic trees.

A timeline of milestones might look something like this:
Week 1-2: Stable wrappers for InParanoid
Week 3-4: Stable wrappers for Roundup
Week 5-6: Stable wrappers for Treefam
Week 6-7: Stable wrappers for BlastO
Week 8-9: Ortholog module to abstract the database wrappers
Week 10-11: Alignment and tree tools

Is there any interest in having such a project? I'd be grateful to get some
feedback either on or off list.

Best,
-Matthew Strand


From chapmanb at 50mail.com  Mon Apr  5 11:50:00 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 5 Apr 2010 07:50:00 -0400
Subject: [Biopython] SoC project (BioPython and PyCogent)
In-Reply-To: <y2r838208c01004041846q78520c57m17135145edbbbc61@mail.gmail.com>
References: <y2r838208c01004041846q78520c57m17135145edbbbc61@mail.gmail.com>
Message-ID: <20100405115000.GB62718@sobchak.mgh.harvard.edu>

Sofia;

> I'm working on an application for the Summer of Code project of
> integrating BioPython and PyCogent. 

Great -- glad to you hear you are interested in the project.

> I've looked through the list
> archives and saw Brad's general advice to other potential SoC
> applicants, but I thought I'd introduce myself and see if there was
> any advice specific to this project.

The overall goal is to provide integration between Biopython and
PyCogent so programmers can benefit from the unique features and
algorithms in each library. This has two general themes:

- Ensuring interoperability between core objects like sequences,
  alignments and phylogenetic trees.
- Using this interoperability to develop analysis workflows that
  utilize functionality from both libraries.

Within this broad scope you are free to orient your proposal to
whatever set of biological questions that interest you. We've tried
to sketch out some ideas we had on the GSoC page as a starting
point.

> I've used BioPython in the past
> and even explored the code a bit. I'm considering working on one or
> more of the bugs in Bugzilla if I can find time, and will work to
> familiarize myself with PyCogent. Are there any other concepts,
> projects, or people I should familiarize myself with (aside from
> what's listed on the ideas page, of course)? 

Proposals are due this Friday, April 9th and normally require a few
rounds of back and forth revisions to get to a competitive level. My
suggestion would be to focus on learning enough of Biopython and
PyCogent to write out a detailed project plan, with a week by week
description of activities and specific goals.

> As you can see from my
> GitHub and Google Code accounts, I've got some experience with open
> source projects, but please do suggest any specific tools or methods
> you think I should try to get up to speed on, as well. 

The open source work is great; definitely include this in your
proposal. A good outline to start with is:

- Project summary -- A short abstract describing what you hope to
  accomplish during the summer, how you plan to go about it, and
  what motivates you to work on the project.

- Personal summary -- Describe your background and how it will help
  you be successful during GSoC. Here is where you can sell yourself
  to all of the mentors ranking the project: why are you a good
  coder? Why is this project useful to use? How will working on the
  summer project encourage you to stay active in the community?

- Project plan -- The detailed week by week description of plans
  mentioned above.

Hope this helps,
Brad


From chapmanb at 50mail.com  Mon Apr  5 12:05:54 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 5 Apr 2010 08:05:54 -0400
Subject: [Biopython] GSoC Ortholog Module Proposal
In-Reply-To: <g2h2a63cc351004050359kbdeb99afva24ca1e2a0ac3871@mail.gmail.com>
References: <g2h2a63cc351004050359kbdeb99afva24ca1e2a0ac3871@mail.gmail.com>
Message-ID: <20100405120554.GC62718@sobchak.mgh.harvard.edu>

Matthew;
Thanks for the introduction and pointers to your work. Your
http://ortholog.us interface looks like a useful resource; it's
really nice to see web interfaces being developed with programmable
JSON APIs. Out of curiousity, is the code available for what you've
done so far?

> For GSoC I would like to write a module to abstract finding orthologs as
> much as possible. This would greatly simplify creating custom evolutionary
> trees for biologists. The module could fetch orthologs from TreeFam,
> InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also
> provide support for producing alignments, concatenating alignments, removing
> sections of gaps, and constructing trees. Ortholog identification could be
> done with no dependency other than an internet connection. Alignments and
> trees would require the user to have the appropriate tools installed.
[...]
> Is there any interest in having such a project? I'd be grateful to get some
> feedback either on or off list.

This is a good project idea and nicely spec'ed out. One additional
direction that might also be worth exploring is using BioMart to
retrieve orthologs from the Ensembl Compara work. Here's a recent
thread on BioStar with the queries to use:

http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale

I don't know of Python programming interfaces to BioMart, but there
is a nice R bioconductor library that can be leveraged with Rpy2:

http://www.bioconductor.org/packages/bioc/html/biomaRt.html
http://rpy.sourceforge.net/rpy2.html

For the practical GSoC things, project proposals are due this
Friday, April 9th so time is running short. I'm unfortunately a bit 
over-committed as this point to mentor but hopefully someone will 
be available to step in that role. I'm happy to make suggestions on
the proposal as it comes together.

Thanks,
Brad


From bjorn_johansson at bio.uminho.pt  Mon Apr  5 13:50:25 2010
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Mon, 5 Apr 2010 14:50:25 +0100
Subject: [Biopython] pro
Message-ID: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>

Hi,
I have a problem that may be related to biopython (or not).
I have written a plugin for a cross platform program (Wikidpad) that relies
on some biopython modules.
I do the development on ubuntu 9.10 and have Wikidpad installed using wine
to be able to test the functionality on windows.

Under wine I have added the following code to make biopython installed under
linux  available to the python interpreter (py2exe) under wine:

if sys.platform == 'win32':
    sys.path.append("z:\usr\local\lib\python2.6\dist-packages")
    sys.path.append("z:\usr\lib/python2.6")

line 40 in "SeqTools.py" below reads: from Bio import SeqIO
I get the error below when importing the module under wikidpad running under
wine

  File "C:\Program Files\WikidPad\user_extensions\SeqTools.py", line 40, in
<module>
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\__init__.py",
line 303, in <module>
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\InsdcIO.py", line
29, in <module>
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\__init__.py",
line 53, in <module>
  File
"z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py",
line 319, in <module>
  File
"z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py",
line 177, in __init__
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line
88, in __init__
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line
129, in collectRules
  File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line
101, in addRule
AttributeError: 'NoneType' object has no attribute 'split'

I wonder if anyone has an immediate idea of what I am doing wrong?
The python interpreter under wine seem to find the biopython modules.
I cannot understand the error that I get afterwards.....

grateful for help!
/bjorn


From eric.talevich at gmail.com  Mon Apr  5 15:48:04 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 5 Apr 2010 11:48:04 -0400
Subject: [Biopython] pro
In-Reply-To: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
Message-ID: <p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>

2010/4/5 Bj?rn Johansson <bjorn_johansson at bio.uminho.pt>

> Hi,
> I have a problem that may be related to biopython (or not).
> I have written a plugin for a cross platform program (Wikidpad) that relies
> on some biopython modules.
> I do the development on ubuntu 9.10 and have Wikidpad installed using wine
> to be able to test the functionality on windows.
>
> Under wine I have added the following code to make biopython installed
> under
> linux  available to the python interpreter (py2exe) under wine:
> [...]
>

It looks like spark relies on the docstrings in Bio.GenBank.LocationParser.
Is there anything in py2exe that would strip the docstrings from compiled
modules? Some optimizations do this -- I think "python -O3" strips
docstrings, for instance.

-Eric


From p.j.a.cock at googlemail.com  Mon Apr  5 16:16:43 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 5 Apr 2010 17:16:43 +0100
Subject: [Biopython] pro
In-Reply-To: <p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
	<p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
Message-ID: <q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>

2010/4/5 Eric Talevich <eric.talevich at gmail.com>
>
> It looks like spark relies on the docstrings in Bio.GenBank.LocationParser.
> Is there anything in py2exe that would strip the docstrings from compiled
> modules? Some optimizations do this -- I think "python -O3" strips
> docstrings, for instance.

You may be on to something there Eric.

Bj?rn, could compare your file:
z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py
with the version we provide:
http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py
or:
http://biopython.org/SRC/biopython/Bio/Parsers/spark.py

In the medium term, I'd like to move the GenBank/EMBL location
parsing to something simpler and faster (using regular expressions)
and then deprecate Bio.GenBank.LocationParser and indeed the
whole of Bio.parsers (which just has a copy of spark). There is
a bug open on this with some code. But that isn't going to help
Bj?rn right now.

Peter


From stran104 at chapman.edu  Mon Apr  5 19:02:21 2010
From: stran104 at chapman.edu (Matthew Strand)
Date: Mon, 5 Apr 2010 12:02:21 -0700
Subject: [Biopython] GSoC Ortholog Module Proposal
Message-ID: <p2s2a63cc351004051202o5270b105x79266ca2f75c1ccf@mail.gmail.com>

> Thanks for the introduction and pointers to your work. Your
> http://ortholog.us interface looks like a useful resource; it's
> really nice to see web interfaces being developed with programmable
> JSON APIs. Out of curiousity, is the code available for what you've
> done so far?
>

Thanks, we have found it useful for finding unindexed orthologs. Fetching
results from the pre-compiled databases is faster but of course requires
writing wrappers that are time consuming to develop. The plan is to release
all code as an open source Django app with a paper that is in the works.
However, I'd be happy to share any code with mentors/organizers for
evaluation purposes off-list in the meantime.


>
> This is a good project idea and nicely spec'ed out. One additional
> direction that might also be worth exploring is using BioMart to
> retrieve orthologs from the Ensembl Compara work. Here's a recent
> thread on BioStar with the queries to use:
>
>
> http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale
>
> I don't know of Python programming interfaces to BioMart, but there
> is a nice R bioconductor library that can be leveraged with Rpy2:
>

I agree, this would be a good addition. I have some messy Python wrappers to
BioMart but the Rpy route would probably provide a more reliable solution
with less effort.


> http://www.bioconductor.org/packages/bioc/html/biomaRt.html
> http://rpy.sourceforge.net/rpy2.html
>
> For the practical GSoC things, project proposals are due this
> Friday, April 9th so time is running short. I'm unfortunately a bit
> over-committed as this point to mentor but hopefully someone will
> be available to step in that role. I'm happy to make suggestions on
> the proposal as it comes together.
>

Thanks, I hope so too. I will post a full proposal in the near future.
Feedback would of course be greatly appreciated. I'm a little unclear, do I
need a mentor to submit a proposal? Is writing a proposal a mute point
without a mentor?

Best,
-Matt Strand


From vincent at vincentdavis.net  Mon Apr  5 19:51:46 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Mon, 5 Apr 2010 13:51:46 -0600
Subject: [Biopython] Build CDF file
Message-ID: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>

The custom array for which I have data does not have a CDF file. I have been
told that others have changed the header on the CEL files to reference
different CDF file. That only kinda makes sense to me. I obviously have CEL
files. I also have the sequences that each probe matches and finally I have
genome match data. By that I mean I know which probes are a perfect match
and which are a mismatch and the location of the mismatch. Can I build a CDF
file from this? How?
Does it make sense to build a CDF for each hybrid(not sure thats the right
word) of the organism if the genome is known for each.

Not sure if this is better ask here or the BioConductor, If there is a
python solution I would try that first, I think.

I think the bioconductor package altcdfenvs
LINK<http://bioconductor.org/packages/2.5/bioc/html/altcdfenvs.html>
does
this.
I guess I should email Laurent Gautier, maybe he reads this :)


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


From biopython at maubp.freeserve.co.uk  Mon Apr  5 20:35:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 5 Apr 2010 21:35:20 +0100
Subject: [Biopython] Build CDF file
In-Reply-To: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>
References: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>
Message-ID: <q2v320fb6e01004051335v8e456933jd46cfd31bbb71ddf@mail.gmail.com>

On Mon, Apr 5, 2010 at 8:51 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> The custom array for which I have data does not have a CDF
> file...

Hi Vincent,

Did you mean to post this to the BioConductor mailing list?

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr  5 20:53:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 5 Apr 2010 21:53:42 +0100
Subject: [Biopython] Build CDF file
In-Reply-To: <-3455855938884949614@unknownmsgid>
References: <j2h77e831101004051251t9650792ja5eb4eccc1d7cc33@mail.gmail.com>
	<q2v320fb6e01004051335v8e456933jd46cfd31bbb71ddf@mail.gmail.com>
	<-3455855938884949614@unknownmsgid>
Message-ID: <h2o320fb6e01004051353jb804b065m2448899c81b4d4d2@mail.gmail.com>

On Mon, Apr 5, 2010 at 9:46 PM, Vincent Davis <vincent at vincentdavis.com> wrote:
>
> No, but maybe I should. I was hopping for a python solution
>

Are these CDF files of yours NetCDF files?
http://en.wikipedia.org/wiki/NetCDF

If so, try Scientific.IO.NetCDF from Konrad Hinsen's ScientificPython
http://sourcesup.cru.fr/projects/scientific-py/

Peter


From chapmanb at 50mail.com  Tue Apr  6 12:26:27 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 6 Apr 2010 08:26:27 -0400
Subject: [Biopython] GSoC Ortholog Module Proposal
In-Reply-To: <p2s2a63cc351004051202o5270b105x79266ca2f75c1ccf@mail.gmail.com>
References: <p2s2a63cc351004051202o5270b105x79266ca2f75c1ccf@mail.gmail.com>
Message-ID: <20100406122627.GE66230@sobchak.mgh.harvard.edu>

Matthew;

> > Thanks for the introduction and pointers to your work. Your
> > http://ortholog.us interface looks like a useful resource; it's
> > really nice to see web interfaces being developed with programmable
> > JSON APIs. Out of curiousity, is the code available for what you've
> > done so far?
> 
> Thanks, we have found it useful for finding unindexed orthologs. Fetching
> results from the pre-compiled databases is faster but of course requires
> writing wrappers that are time consuming to develop. The plan is to release
> all code as an open source Django app with a paper that is in the works.
> However, I'd be happy to share any code with mentors/organizers for
> evaluation purposes off-list in the meantime.

Cool; definitely let us know on the mailing lists when the paper and
code are out. It would be fun to see.

> > For the practical GSoC things, project proposals are due this
> > Friday, April 9th so time is running short. I'm unfortunately a bit
> > over-committed as this point to mentor but hopefully someone will
> > be available to step in that role. I'm happy to make suggestions on
> > the proposal as it comes together.
> 
> Thanks, I hope so too. I will post a full proposal in the near future.
> Feedback would of course be greatly appreciated. I'm a little unclear, do I
> need a mentor to submit a proposal? Is writing a proposal a mute point
> without a mentor?

You will need a mentor and this is always the tough part of GSoC: there
are more good students and ideas than mentors and funded spots. I would
never discourage anyone from getting together a proposal; it is a good
exercise and helps you think through the work you are planning to do.
In terms of acceptance rates, it is lower when coming in later in the
process with your own ideas since mentors will have already settled on
a few ideas and begun feeling committed to students working on those.
However, nothing is locked down or decided until the deadline hits,
proposals are ranked by all of the mentors, and we see how many spots
we'll get from Google.

GSoC is kind of like interviewing job candidates without being sure how
many positions you'll have at the end. In summary, if you feel like the
proposal writing process would be interesting and useful to you, I'd
definitely encourage you to go for it and see where it takes you.

Brad


From bjorn_johansson at bio.uminho.pt  Wed Apr  7 09:33:39 2010
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Wed, 7 Apr 2010 10:33:39 +0100
Subject: [Biopython] pro
In-Reply-To: <q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
	<p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
	<q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>
Message-ID: <p2yc3ee7c591004070233ob8752794o6c32dc942d333585@mail.gmail.com>

Hi,
thank you very much for the information, I think it has to do with the
docstrings, if I run with python -OO under linux, I get the same error msg.

as for the two spark files, they seem identical, spark.py is the one i
downloaded from
http://biopython.org/SRC/biopython/Bio/Parsers/spark.py:

diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py

produces no output at all.

I will try and find out if the optimization can be overridden for one file
only.

Thanks!
/bjorn


2010/4/5 Peter Cock <p.j.a.cock at googlemail.com>

> 2010/4/5 Eric Talevich <eric.talevich at gmail.com>
> >
> > It looks like spark relies on the docstrings in
> Bio.GenBank.LocationParser.
> > Is there anything in py2exe that would strip the docstrings from compiled
> > modules? Some optimizations do this -- I think "python -O3" strips
> > docstrings, for instance.
>
> You may be on to something there Eric.
>
> Bj?rn, could compare your file:
> z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py
> with the version we provide:
> http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py
> or:
> http://biopython.org/SRC/biopython/Bio/Parsers/spark.py
>
> In the medium term, I'd like to move the GenBank/EMBL location
> parsing to something simpler and faster (using regular expressions)
> and then deprecate Bio.GenBank.LocationParser and indeed the
> whole of Bio.parsers (which just has a copy of spark). There is
> a bug open on this with some code. But that isn't going to help
> Bj?rn right now.
>
> Peter
>


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
http://www.bio.uminho.pt
http://sites.google.com/site/bjornhome
Work (direct) +351-253 601517
Private mob. +351-967 147 704
Dept of Biology (secretariate) +351-253 60 4310
Dept of Biology (fax) +351-253 678980


From p.j.a.cock at googlemail.com  Wed Apr  7 09:37:59 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 7 Apr 2010 10:37:59 +0100
Subject: [Biopython] pro
In-Reply-To: <p2yc3ee7c591004070233ob8752794o6c32dc942d333585@mail.gmail.com>
References: <x2gc3ee7c591004050650k11e156bcib299eb2317e4c071@mail.gmail.com>
	<p2v3f6baf361004050848j4789ca6ez7dc4b08efa0ba005@mail.gmail.com>
	<q2r320fb6e01004050916l78e0f3d3g3e3881402ff05600@mail.gmail.com>
	<p2yc3ee7c591004070233ob8752794o6c32dc942d333585@mail.gmail.com>
Message-ID: <s2h320fb6e01004070237v6cc2532dg70605411c7a4fc5d@mail.gmail.com>

010/4/7 Bj?rn Johansson <bjorn_johansson at bio.uminho.pt>:
> Hi,
> thank you very much for the information, I think it has to do with the
> docstrings, if I run with python -OO under linux, I get the same error msg.
>
> as for the two spark files, they seem identical, spark.py is the one i
> downloaded from
> http://biopython.org/SRC/biopython/Bio/Parsers/spark.py:
>
> diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py
>
> produces no output at all.

OK, thanks. I wanted to find out if py2exe was optimising the python
files by editing them to remove the docstrings. It seems not.

> I will try and find out if the optimization can be overridden for one file
> only.
>
> Thanks!
> /bjorn

Peter


From lunt at ctbp.ucsd.edu  Thu Apr  8 00:57:07 2010
From: lunt at ctbp.ucsd.edu (Bryan Lunt)
Date: Wed, 7 Apr 2010 17:57:07 -0700
Subject: [Biopython] StockholmIO replaces "." with "-", why?
Message-ID: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>

Greetings All!

It looks like line 364 of Bio.AlignIO.StockholmIO reads:

seqs[id] += seq.replace(".","-")

So when you load into memory alignments that mark gaps created to
allow alignment to inserts with ".", (such as PFam alignments or the
output of hmmer) that information is lost.

I know there must be a good reason for this, but I am finding it a
problem on my end..

-Bryan Lunt


From fuxin at umail.iu.edu  Thu Apr  8 01:40:02 2010
From: fuxin at umail.iu.edu (Fuxiao Xin)
Date: Wed, 7 Apr 2010 21:40:02 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
Message-ID: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>

Dear all,

I am a third year Phd student in Bioinformatics from Indiana University
Bloomington.  I am very in interested in the google summer code project of
biopython "PDB-Tidy: command-line tools for manipulating PDB files".

My own research needs extensive manipulation of PDB files, and I think  this
idea of adding more features to Bio.PDB and more command line options to
analyze/present PDB data is excellent. This project is of strong interest to
me since it will benefit my own research project as well.

Programming Skills: I use perl and python during my daily research. I am now
working on developing a new functional site predictor using protein
structure information. The code will be open source, but the work is under
review so the code is not released yet.

My project plan:

week1
1. Renumber residues starting from 1 (or N)
function name: renumberPDB, given a pdb file, rename the atom field
numbering of the file to remove missing amino acids
communicate with mentors to set standards of the code to follow for the rest
of the functions
create work log to keep track of process;

week2-3
2. Select a portion of the structure -- models, chains, etc. -- and write it
to a new file (PDB, FASTA, and other formats)
function name: rewritePDB, inputs will be a particular portion of a PDB file
you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
fasta), and the output name.
3. Perform some basic, well-established measures of model quality/validity
function name: PDBquality
the function will report RESOLUTION and ? of the structure
4. extract disorder region in PDB structure
function name: PDBdisorder
report missing residues in the structure atom field

week3-4
5. make a function to draw a Ramachandran plot
function name: ramaPLOT
combine the two steps(calcualting torsion angles and draw the plot) into one
function, give the option to draw the plot or not

week5
6. open PDB files in the window for visulization, visulize PDBsuperpose
results, output RMSD
function name: superposePDB
the function will look like the PDBsuperpose function in matlab; use
Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
visulization tool to see the results
week6
7. write a function to extract all experimental conditions of a PDB file,
includes PH, temperature, and salt
function name: PDBconditon
it will be easy to get PH and temperature information, but for salt, it will
be hard to parse because there is no general rule of such information in the
PDB file; parse REMARK 200 field;

week7-8
8. extract PTM,
function name: PDBptm
difficult: the Post-translational modification annotation in PDB is not
consistant, need to make a list of PTMs to work on
parse MODRES field

week9-10
9. extract ligand binding information
function name: PDBligand
parse HETNAM field


Other obligations:  I am aware that google summer code starts from May 24th,
but I will have a review paper with my advisor due on June 1st, I hope it
will be OK for me to start after June 1st, and I will makeup the first week
in Auguest.

Best,
Fuxiao


From eric.talevich at gmail.com  Thu Apr  8 03:48:08 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 7 Apr 2010 23:48:08 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
In-Reply-To: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
References: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
Message-ID: <v2t3f6baf361004072048rc5e9a9e7kffb744d922733362@mail.gmail.com>

Hi Fuxiao,

Thanks for your interest in this project. I see you've been working on this
proposal for awhile already, so although the submission deadline is very
close, I think you'll still be OK. I've interleaved my comments with your
proposal below:

On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin <fuxin at umail.iu.edu> wrote:

> Dear all,
>
> I am a third year Phd student in Bioinformatics from Indiana University
> Bloomington.  I am very in interested in the google summer code project of
> biopython "PDB-Tidy: command-line tools for manipulating PDB files".
>
> My own research needs extensive manipulation of PDB files, and I think
>  this
> idea of adding more features to Bio.PDB and more command line options to
> analyze/present PDB data is excellent. This project is of strong interest
> to
> me since it will benefit my own research project as well.
>

Good to hear. Does your lab have a website? This project requires some
knowledge of structural biology, so it helps if we can see what specific
research you've already done in that area.

Programming Skills: I use perl and python during my daily research. I am now
> working on developing a new functional site predictor using protein
> structure information. The code will be open source, but the work is under
> review so the code is not released yet.
>

Is there any other programming work you've done in the past that you could
let us see? It doesn't have to be part of an existing open-source project;
even some functioning snippets posted somewhere would help us get a sense of
your coding style and abilities. Examples where you've used Biopython or
another established toolkit for working with PDB files or other scientific
data would be especially useful.

We also like to see that you're familiar with a project's build tools, which
in Biopython's case is GitHub and the standard Python mechanisms. So, if you
could upload some of your prior work to GitHub and send us the link, that
would be ideal.


My project plan:
>
> week1
> 1. Renumber residues starting from 1 (or N)
> function name: renumberPDB, given a pdb file, rename the atom field
> numbering of the file to remove missing amino acids
> communicate with mentors to set standards of the code to follow for the
> rest
> of the functions
> create work log to keep track of process;
>

Biopython's coding standards generally follow an earlier version of PEP 8;
hopefully you can pick it up quickly just by reading the source code for
Bio.PDB -- so you don't really need that item listed here.

In the past, students have maintained their weekly schedules on a wiki or
other public document, and updated them continually throughout the summer.
This functions as a work log, in a way. You would also have an e-mail record
of your work from your weekly reports to this list.

week2-3
> 2. Select a portion of the structure -- models, chains, etc. -- and write
> it
> to a new file (PDB, FASTA, and other formats)
> function name: rewritePDB, inputs will be a particular portion of a PDB
> file
> you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
> fasta), and the output name.
> 3. Perform some basic, well-established measures of model quality/validity
> function name: PDBquality
> the function will report RESOLUTION and ? of the structure
> 4. extract disorder region in PDB structure
> function name: PDBdisorder
> report missing residues in the structure atom field
>

These tasks seem reasonable. You don't need to commit to specific function
names yet; it would be more helpful to describe the overall module layout
you're planning, and list the dependencies for each (especially the
components of Bio.PDB that come into play).


> week3-4
> 5. make a function to draw a Ramachandran plot
> function name: ramaPLOT
> combine the two steps(calcualting torsion angles and draw the plot) into
> one
> function, give the option to draw the plot or not
>

This task has a number of dependencies which I think you should list and
describe here. Because of those dependencies there's a significant chance of
it taking longer than you planned -- so I'd recommend moving it to after the
midterm evaluations, wherever those fit into your schedule.

week5
> 6. open PDB files in the window for visulization, visulize PDBsuperpose
> results, output RMSD
> function name: superposePDB
> the function will look like the PDBsuperpose function in matlab; use
> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
> visulization tool to see the results
>

Would you build Python wrappers for interacting with the chosen
visualization tool, or just write a set of files and launch the viewer in a
script?


> week6
> 7. write a function to extract all experimental conditions of a PDB file,
> includes PH, temperature, and salt
> function name: PDBconditon
> it will be easy to get PH and temperature information, but for salt, it
> will
> be hard to parse because there is no general rule of such information in
> the
> PDB file; parse REMARK 200 field;
>

Sounds handy. Would your script write out a report combining all of this
info, or just extract requested elements?


> week7-8
> 8. extract PTM,
> function name: PDBptm
> difficult: the Post-translational modification annotation in PDB is not
> consistant, need to make a list of PTMs to work on
> parse MODRES field
>
> week9-10
> 9. extract ligand binding information
> function name: PDBligand
> parse HETNAM field
>

Good. Some of these later items sound straightforward enough that it would
be better to tackle them earlier in the summer.


> Other obligations:  I am aware that google summer code starts from May
> 24th,
> but I will have a review paper with my advisor due on June 1st, I hope it
> will be OK for me to start after June 1st, and I will makeup the first week
> in Auguest.
>

How much of the "community bonding period" will this occupy? The guideline
is that you get set up with the build system, read documentation and do
background research part-time between GSoC acceptance and May 24, and start
writing code full-time on May 24. You can make up for a gap in your project
plan by doing extra preparation before coding starts; would this be possible
for you?

Finally, the GSoC administration app (socghop.appspot.com) gets crowded as
the deadline approaches, so it's best if you register yourself there and
take care of the administrivia as soon as you can to avoid any trouble on
Friday.

Best regards,
Eric


From rozziite at gmail.com  Thu Apr  8 03:48:16 2010
From: rozziite at gmail.com (Diana Jaunzeikare)
Date: Wed, 7 Apr 2010 23:48:16 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
In-Reply-To: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
References: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
Message-ID: <q2l4057d3bf1004072048k9c3f348au586c61bf9d2902f0@mail.gmail.com>

Hi Fuxiao,

Good start on the application! Some comments below.

On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin <fuxin at umail.iu.edu> wrote:
> Dear all,
>
> I am a third year Phd student in Bioinformatics from Indiana University
> Bloomington. ?I am very in interested in the google summer code project of
> biopython "PDB-Tidy: command-line tools for manipulating PDB files".
>
> My own research needs extensive manipulation of PDB files, and I think ?this
> idea of adding more features to Bio.PDB and more command line options to
> analyze/present PDB data is excellent. This project is of strong interest to
> me since it will benefit my own research project as well.
>
> Programming Skills: I use perl and python during my daily research. I am now
> working on developing a new functional site predictor using protein
> structure information. The code will be open source, but the work is under
> review so the code is not released yet.
>
> My project plan:
>
> week1
> 1. Renumber residues starting from 1 (or N)
> function name: renumberPDB, given a pdb file, rename the atom field
> numbering of the file to remove missing amino acids
> communicate with mentors to set standards of the code to follow for the rest
> of the functions
> create work log to keep track of process;
>
> week2-3
> 2. Select a portion of the structure -- models, chains, etc. -- and write it
> to a new file (PDB, FASTA, and other formats)
> function name: rewritePDB, inputs will be a particular portion of a PDB file
> you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
> fasta), and the output name.
> 3. Perform some basic, well-established measures of model quality/validity
> function name: PDBquality
> the function will report RESOLUTION and ? of the structure

Maybe you can get some inspiration of measures of model
quality/validity from PDBREPORT database [0] and WHAT_IF [1] software.

[0] http://swift.cmbi.ru.nl/gv/pdbreport/
[1] http://swift.cmbi.ru.nl/whatif/


> 4. extract disorder region in PDB structure
> function name: PDBdisorder
> report missing residues in the structure atom field
>
> week3-4
> 5. make a function to draw a Ramachandran plot
> function name: ramaPLOT
> combine the two steps(calcualting torsion angles and draw the plot) into one
> function, give the option to draw the plot or not
>
> week5
> 6. open PDB files in the window for visulization, visulize PDBsuperpose
> results, output RMSD
> function name: superposePDB
> the function will look like the PDBsuperpose function in matlab; use
> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
> visulization tool to see the results
> week6
> 7. write a function to extract all experimental conditions of a PDB file,
> includes PH, temperature, and salt
> function name: PDBconditon
> it will be easy to get PH and temperature information, but for salt, it will
> be hard to parse because there is no general rule of such information in the
> PDB file; parse REMARK 200 field;
>
> week7-8
> 8. extract PTM,
> function name: PDBptm
> difficult: the Post-translational modification annotation in PDB is not
> consistant, need to make a list of PTMs to work on
> parse MODRES field
>
> week9-10
> 9. extract ligand binding information
> function name: PDBligand
> parse HETNAM field
>
>
> Other obligations: ?I am aware that google summer code starts from May 24th,
> but I will have a review paper with my advisor due on June 1st, I hope it
> will be OK for me to start after June 1st, and I will makeup the first week
> in Auguest.
>
> Best,
> Fuxiao
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From fuxin at indiana.edu  Thu Apr  8 07:40:36 2010
From: fuxin at indiana.edu (Fuxiao Xin)
Date: Thu, 8 Apr 2010 03:40:36 -0400
Subject: [Biopython] About Google Summer Code Project PDB-tidy
In-Reply-To: <v2t3f6baf361004072048rc5e9a9e7kffb744d922733362@mail.gmail.com>
References: <o2yccb717081004071840hd84bad3dpbc7554978ee14b6c@mail.gmail.com>
	<v2t3f6baf361004072048rc5e9a9e7kffb744d922733362@mail.gmail.com>
Message-ID: <m2xccb717081004080040u3dfac46by28bcb421d734b504@mail.gmail.com>

hi Eric and Diana,

Thanks for your quick reply.

For the quality/validation problem, thanks Diana for pointing me to the two
resources,  I am surprised that there are so many "problems" defined for PDB
files, and obviously  I underestimate this task, and I think it's a very
interesting problem to study and  I'd like to devote more time on this task,
 I am thinking to make this task the main focus of my first period
coding(before midterm check).  What do you think?

For Eric's responses, please find my reply in line.

My own research needs extensive manipulation of PDB files, and I think  this
>> idea of adding more features to Bio.PDB and more command line options to
>> analyze/present PDB data is excellent. This project is of strong interest
>> to
>> me since it will benefit my own research project as well.
>>
>
> Good to hear. Does your lab have a website? This project requires some
> knowledge of structural biology, so it helps if we can see what specific
> research you've already done in that area.
>

Our lab's website is : http://www.informatics.indiana.edu/predrag/ , and one
main focus of our lab is PTM and disorder, both need to deal with PDB files.
A poster title shows my protein structure-based kernel work:*
http://www.iscb.org/rocky09-program/rocky09-poster-presenters-abstracts,
they didn't put the abstract online. I could send you the abstract if you
are interested.  *


> Programming Skills: I use perl and python during my daily research. I am
>> now
>> working on developing a new functional site predictor using protein
>> structure information. The code will be open source, but the work is under
>> review so the code is not released yet.
>>
>
> Is there any other programming work you've done in the past that you could
> let us see? It doesn't have to be part of an existing open-source project;
> even some functioning snippets posted somewhere would help us get a sense of
> your coding style and abilities. Examples where you've used Biopython or
> another established toolkit for working with PDB files or other scientific
> data would be especially useful.
>
We also like to see that you're familiar with a project's build tools, which
> in Biopython's case is GitHub and the standard Python mechanisms. So, if you
> could upload some of your prior work to GitHub and send us the link, that
> would be ideal.
>

I put some of my python code here:
http://github.com/fuxiaoxin/my_python_code. I don't have code in python
using Bio.PDB. For parsing PDB, my code are in perl for the sake of its
regular expression, I seldomly use bioperl or biopython in the past, I write
all my own code, that's also why I think I am very clear of all kinds of
problems in PDB files. I am quite surprised to find Bio.PDB already have so
many modules for various functions. I could upload some of my perl functions
if you would like to have a look: I have functions similar to PDBparser,
NeighborSearch, DSSP, NACCESS.

I have to say I am not very familiar with the build tools of python. But I
hope to learn it during the bonding period. I just guided myself through to
upload my codes to Github, :)

My project plan:
>>
>> week1
>> 1. Renumber residues starting from 1 (or N)
>> function name: renumberPDB, given a pdb file, rename the atom field
>> numbering of the file to remove missing amino acids
>> communicate with mentors to set standards of the code to follow for the
>> rest
>> of the functions
>> create work log to keep track of process;
>>
>
> Biopython's coding standards generally follow an earlier version of PEP 8;
> hopefully you can pick it up quickly just by reading the source code for
> Bio.PDB -- so you don't really need that item listed here.
>
>
I will learn from Bio.PDB source code and remove this one.


> In the past, students have maintained their weekly schedules on a wiki or
> other public document, and updated them continually throughout the summer.
> This functions as a work log, in a way. You would also have an e-mail record
> of your work from your weekly reports to this list.
>

That's great to know.


> week2-3
>> 2. Select a portion of the structure -- models, chains, etc. -- and write
>> it
>> to a new file (PDB, FASTA, and other formats)
>> function name: rewritePDB, inputs will be a particular portion of a PDB
>> file
>> you want to write out(support 'chain', 'model', 'atom'), a file
>> format(PDB,
>> fasta), and the output name.
>> 3. Perform some basic, well-established measures of model quality/validity
>> function name: PDBquality
>> the function will report RESOLUTION and ? of the structure
>> 4. extract disorder region in PDB structure
>> function name: PDBdisorder
>> report missing residues in the structure atom field
>>
>
> These tasks seem reasonable. You don't need to commit to specific function
> names yet; it would be more helpful to describe the overall module layout
> you're planning, and list the dependencies for each (especially the
> components of Bio.PDB that come into play).
>

I will make a new  proposal with these details by tomorrow.


>
>> week3-4
>> 5. make a function to draw a Ramachandran plot
>> function name: ramaPLOT
>> combine the two steps(calcualting torsion angles and draw the plot) into
>> one
>> function, give the option to draw the plot or not
>>
>
> This task has a number of dependencies which I think you should list and
> describe here. Because of those dependencies there's a significant chance of
> it taking longer than you planned -- so I'd recommend moving it to after the
> midterm evaluations, wherever those fit into your schedule.
>

 I will add more details here.


> week5
>> 6. open PDB files in the window for visulization, visulize PDBsuperpose
>> results, output RMSD
>> function name: superposePDB
>> the function will look like the PDBsuperpose function in matlab; use
>> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
>> visualization tool to see the results
>>
>
> Would you build Python wrappers for interacting with the chosen
> visualization tool, or just write a set of files and launch the viewer in a
> script?
>

I am thinking of launching the script, since those PDB visualization tools
already have very nice command line options and interfaces.  But I think it
is really important to be able to visualize the structure on the
fly, especially when you are doing PDB superimpose.


>  week6
>> 7. write a function to extract all experimental conditions of a PDB file,
>> includes PH, temperature, and salt
>> function name: PDBconditon
>> it will be easy to get PH and temperature information, but for salt, it
>> will
>> be hard to parse because there is no general rule of such information in
>> the
>> PDB file; parse REMARK 200 field;
>>
>
> Sounds handy. Would your script write out a report combining all of this
> info, or just extract requested elements?
>

I am thinking to put the results into a variable instead of a report, since
it will be great for batch processing, and display the results immediately
in interactive mode.

>
> Other obligations:  I am aware that google summer code starts from May
>> 24th,
>> but I will have a review paper with my advisor due on June 1st, I hope it
>> will be OK for me to start after June 1st, and I will makeup the first
>> week
>> in Auguest.
>>
>
> How much of the "community bonding period" will this occupy? The guideline
> is that you get set up with the build system, read documentation and do
> background research part-time between GSoC acceptance and May 24, and start
> writing code full-time on May 24. You can make up for a gap in your project
> plan by doing extra preparation before coding starts; would this be possible
> for you?
>

I think the bonding period will be really important for me to get known
about the python build tools, and of course other stuff you mentors suggest
me to learn,  so I will devote my time for "bonding".  But since I will get
busy near the end of May, I plan to start early and do things more
efficiently.


>
> Finally, the GSoC administration app (socghop.appspot.com) gets crowded as
> the deadline approaches, so it's best if you register yourself there and
> take care of the administrivia as soon as you can to avoid any trouble on
> Friday.
>

Thanks for the reminding. I will incorporate you and Diana's suggestions to
make a new version of proposal, by tomorrow night.  But the idea is,  the
main project for the first period would be the quality/validation task , and
the second period will be the Ramachandran plot.  And I will fill in the
time with other small functions.


Thanks,
Fuxiao


From biopython at maubp.freeserve.co.uk  Thu Apr  8 08:04:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Apr 2010 09:04:27 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
Message-ID: <g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>

On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
> Greetings All!
>
> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>
> seqs[id] += seq.replace(".","-")
>
> So when you load into memory alignments that mark gaps created to
> allow alignment to inserts with ".", (such as PFam alignments or the
> output of hmmer) that information is lost.
>
> I know there must be a good reason for this, but I am finding it a
> problem on my end..
>
> -Bryan Lunt

Hi Bryan,

Yes, is it done deliberately. The dot is a problem - it has a quite
specific meaning of "same as above" on other alignment file
formats, while "-" is an almost universal shorthand for gap/insertion.
Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
conversion.

Have you got a sample output file we can use as a unit test or
at least discuss? As I recall, on the PFAM alignments I looked
at there was no data loss by doing the dot to dash mapping.

Peter


From sma.hmc at gmail.com  Thu Apr  8 09:41:26 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Thu, 8 Apr 2010 02:41:26 -0700
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
Message-ID: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>

I am a junior Computer Science major with heavy bioinformatic leanings
at Harvey Mudd College. I know that it is very late for new summer of
code applications, but I was wondering if you could have a look at my
proposed schedule to give me some pointers and answer a few questions.
I am also considering applying for the project involving adding more
ways to use R through python, but I was unsure of which project had
more users who wanted it completed.

Questions:
What does it mean by BioPython's acquired sequences? I can't seem to
find out what or where information about "acquired sequences" is.
Thus, I do not discuss anything about it in my current proposal.

For the creation of workflows, do there already exist use and test
cases for this or would I be best off looking for ones in papers and
trying to mimic them? Right now, I have an example paper where the
interoperability would have been helpful.

Any other use cases I should immediately consider in my proposal?

My current proposed schedule:

For Bio Python and PyCogent interoperability.
Week 1: Familiarization with the code and soliciting requests. While
what seems intuitive to me might not seem so to others. It would be
best to spend this time to determine a group of people who would
highly benefit from the interoperability and ask them for what they
would look for. For example, would they rather use one, save the data,
and use the other. Would they want to use them directly. Basically, I
want to get a good idea of how this code will be used before making my
own decisions on how I think people will use it. Also important here
is to create sets of data which can be used later on the process.

Week 2 and 3: Code converting PyCogent and BioPython. The core objects
in each package seem like they should not be too difficult to convert.
This step will involve looking into the documentation and coding for
PyCogent and BioPython, to determine what the core objects contain for
each. One possible problem here is if either PyCogent or BioPython
core objects use heavy subclassing, as determining subclassing in
Python has been a nightmare in the past. Testing at this point will
likely involve going through the entire round trip conversion, and
seeing if everything looks the same.

Week 4: Ensure that conversions allow the use of data from one program
to the other. The workflows of codon usage to clustering code can be
tested. One possible test set is from Sharp et. al. 1986. Here they
found different codon usage for different genes. Additionally, it
should be considered how codon usage can be used to help with making
biologically accurate clusters.

Week 5: Familiarize with phyloXML and make interoperable with
PyCogent. phyloXML has already been added with BioPython. Making
phyloXML work with PyCogent could be based on how it was adapted for
BioPython. Clear risks here include problems with making sure that the
API for phyloXML in PyCogent gives an intuitive interface to use
phyloXML.

Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
there is at least some support for PyCogent to query ENSEMBL. It seems
like it would be useful to query other genomics databases such as
Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
queries into their MySQL database. Ideally, if everything previously
has been alright, the conversion of PyCogent to BioPython forms shoudl
already be accounted for.

Week 8-12: Slip days and additional features. The initial set of use
cases will surely expand and this is extra time to allow for those use
cases to be accounted for.

Thanks,
Singer Ma


From biopython at maubp.freeserve.co.uk  Thu Apr  8 10:04:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Apr 2010 11:04:10 +0100
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
In-Reply-To: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
References: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
Message-ID: <l2l320fb6e01004080304p44ad227arbd2fa9ab9cc11f4b@mail.gmail.com>

On Thu, Apr 8, 2010 at 10:41 AM, Singer Ma <sma.hmc at gmail.com> wrote:
> I am a junior Computer Science major with heavy bioinformatic leanings
> at Harvey Mudd College. I know that it is very late for new summer of
> code applications, but I was wondering if you could have a look at my
> proposed schedule to give me some pointers and answer a few questions.
> I am also considering applying for the project involving adding more
> ways to use R through python, but I was unsure of which project had
> more users who wanted it completed.
>
> Questions:
> What does it mean by BioPython's acquired sequences? I can't seem to
> find out what or where information about "acquired sequences" is.
> Thus, I do not discuss anything about it in my current proposal.

http://www.biopython.org/wiki/Google_Summer_of_Code#Biopython_and_PyCogent_interoperability

You mean "Connecting Biopython acquired sequences to PyCogent's
alignment, phylogenetic tree preparation and tree visualization code."?

I think Brad means using Biopython to load (parse) sequence data (e.g.
with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in
the sense of get/load data.

> Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
> there is at least some support for PyCogent to query ENSEMBL. It seems
> like it would be useful to query other genomics databases such as
> Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
> queries into their MySQL database. ...

Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are
language neutral and we have Bio.Entrez to support them in Biopython.

Peter


From biopython at maubp.freeserve.co.uk  Thu Apr  8 10:26:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Apr 2010 11:26:10 +0100
Subject: [Biopython] Biopython 1.54b test failures
In-Reply-To: <u2p320fb6e01004021622i9a350a0eqfaf9663b5d6e9e62@mail.gmail.com>
References: <4BB67835.7030303@uci.edu>
	<u2p320fb6e01004021622i9a350a0eqfaf9663b5d6e9e62@mail.gmail.com>
Message-ID: <u2x320fb6e01004080326y796fba45lcdc476d9acc9c5c3@mail.gmail.com>

On Sat, Apr 3, 2010 at 12:22 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I recall trying the universal read lines thing before without
> success in the SCOP tests - maybe it was this line 72 thing
> that I missed. I'll take another look at this next week (when
> I have access to a Windows machine).
>

You are right - that does make the two SCOP tests pass on
Windows without having to first convert the SCOP example
files from Unix to DOS/Windows newlines. Checked in.
Would you like to be credited for this in the NEWS and
CONTRIB files?

Thanks,

Peter


From sma.hmc at gmail.com  Thu Apr  8 10:31:10 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Thu, 8 Apr 2010 03:31:10 -0700
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
In-Reply-To: <l2l320fb6e01004080304p44ad227arbd2fa9ab9cc11f4b@mail.gmail.com>
References: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
	<l2l320fb6e01004080304p44ad227arbd2fa9ab9cc11f4b@mail.gmail.com>
Message-ID: <w2m62ed8c081004080331l8522b8dm994ccaac9c57761a@mail.gmail.com>

> You mean "Connecting Biopython acquired sequences to PyCogent's
> alignment, phylogenetic tree preparation and tree visualization code."?
>
> I think Brad means using Biopython to load (parse) sequence data (e.g.
> with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in
> the sense of get/load data.

Ah, so, its just the most straightforward use of the conversion tools
that would be made. Sorry, I thought I was missing something here.
Shouldn't be this be taken care of in the first use case of "Allow
round-trip conversion between biopython and pycogent core objects
(sequence, alignment, tree, etc.)."? Or does this require me to
determine how the interactions will be made?

>
> Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are
> language neutral and we have Bio.Entrez to support them in Biopython.

Ah, I misread my information, so NCBI Entrez can already be queried.
What exactly do we need to get from ENSEMBL that isn't already
supported then?

Singer


From chapmanb at 50mail.com  Thu Apr  8 12:39:53 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 8 Apr 2010 08:39:53 -0400
Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability
In-Reply-To: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
References: <n2g62ed8c081004080241x380860bcibdd4acc5b8fe3acb@mail.gmail.com>
Message-ID: <20100408123953.GG911@sobchak.mgh.harvard.edu>

Singer;
Thanks for the introduction and initial project plan. Glad that you
are interested. I'll try to tackle a few of the specific points
Peter has not already talked about, and suggest some specifics for
the application.

> Questions:
> What does it mean by BioPython's acquired sequences? I can't seem to
> find out what or where information about "acquired sequences" is.
> Thus, I do not discuss anything about it in my current proposal.

Following up on what Peter mentioned, what we're trying to say there
is to use the results from step 1 (interoperability) to create
unique workflows that use both Biopython and PyCogent. This is a
suggested workflow to utilize some of the strengths of both
packages.

> For the creation of workflows, do there already exist use and test
> cases for this or would I be best off looking for ones in papers and
> trying to mimic them? Right now, I have an example paper where the
> interoperability would have been helpful.

Yes, that is exactly the right approach. The ideas we've suggested
are just brainstorming; please select workflows that are interesting
to you.

> My current proposed schedule:
> 
> For Bio Python and PyCogent interoperability.
> Week 1: Familiarization with the code and soliciting requests. While
> what seems intuitive to me might not seem so to others. It would be
> best to spend this time to determine a group of people who would
> highly benefit from the interoperability and ask them for what they
> would look for. For example, would they rather use one, save the data,
> and use the other. Would they want to use them directly. Basically, I
> want to get a good idea of how this code will be used before making my
> own decisions on how I think people will use it. Also important here
> is to create sets of data which can be used later on the process.

All of this type of non-coding work should be done in the community
bonding period, from April 26th to the start of coding. When week 1
hits, you want to be ready to code. See the timeline for more
specific information on dates:

http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/timeline

> Week 5: Familiarize with phyloXML and make interoperable with
> PyCogent. phyloXML has already been added with BioPython. Making
> phyloXML work with PyCogent could be based on how it was adapted for
> BioPython. Clear risks here include problems with making sure that the
> API for phyloXML in PyCogent gives an intuitive interface to use
> phyloXML.

Again, all of the non-coding activities should be moved to before
the actual coding period. In your timeline you want to focus on code
deliverables for each week. Of course there will be learning and
reading during the program, but you want to be sure to have a code
centric focus.

> Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
> there is at least some support for PyCogent to query ENSEMBL. It seems
> like it would be useful to query other genomics databases such as
> Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
> queries into their MySQL database. Ideally, if everything previously
> has been alright, the conversion of PyCogent to BioPython forms shoudl
> already be accounted for.

Following up on your discussion with Peter, you should think about
some workflows that use Biopython Entrez queries and PyCogent
Ensembl queries to answer interesting questions that could not be
done with either. This should help to focus your ideas on integration 
and workflows, as opposed to implementing new functionality.

> Week 8-12: Slip days and additional features. The initial set of use
> cases will surely expand and this is extra time to allow for those use
> cases to be accounted for.

You need to continue your detailed project plan for the entire
period. See the examples in the NESCent application documentation to 
get an idea of the level of detail in accepted projects from previous years:

https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply
http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html

Practically, applications are due tomorrow, so you should have a
submission sent in to OpenBio through the GSoC interface
(http://socghop.appspot.com).

Hope this helps,
Brad


From vincent at vincentdavis.net  Thu Apr  8 18:33:41 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 12:33:41 -0600
Subject: [Biopython] affy CEL and CDF reader
Message-ID: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>

I ended up writing my own modules for reading both affy Cel and CDF files.
Long story as to why I did not just use what was available in biopython.
I plan on making what I have done available to the biopython and will upload
it as a fork. I will outline what ways what I have is different below.
My question is: Are there any improvements(features) others would like to
see beyond what is avalible in the current CelFile.py?
I saw some posts a month or so ago about checking for consistency in cell
file, I think it was something about making sure the stated number of probes
was consistent with the intensity measurements.

What is different,
when an file is read Affycel.read('file') many atributes are set. for
example
a = affcel()
a.read('testfile')
a.filename,
a.version,
a.header.items()  # a dictionary of all header items
a.num_intensity
a.intensity
a.num_masks
a.masks
a.num_outliers
a.outliers
a.numb_modified
a.modified

I plan to add the ability return/call intensity values with our with
outliers or mask values.
All data is currently store in numpy structured arrays,
currently a.intensity returns the structured array, but I plan on making it
an option to easily choose how this is returned.
also what to make an optional normalized intensity array so that if the data
is normalized it can be stored with the affycel instance. My use case was
that I was opening about 80 cel files and reading them in was slow. this
allowed me to read each file as an instance of affycel stored in a list that
I then pickled. It was then much faster to open them.

Are improvements to the CelFile.py are of value to biopython?

I hope to have the code pushed up to my fork on github late tonight. Just
thought I would ask if there was any suggestion before I did.

Also have an CDF file reader, but only have done some basic testing. I don't
have a lot of use for this, do other biopython users?

I am kinda working in a vacuum and am trying to get more involved in
projects to improve my skills and knowledge. Any suggestions would be
appreciated.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


From sdavis2 at mail.nih.gov  Thu Apr  8 18:56:12 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 8 Apr 2010 14:56:12 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
Message-ID: <j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>

On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> I ended up writing my own modules for reading both affy Cel and CDF files.
> Long story as to why I did not just use what was available in biopython.
> I plan on making what I have done available to the biopython and will upload
> it as a fork. I will outline what ways what I have is different below.
> My question is: Are there any improvements(features) others would like to
> see beyond what is avalible in the current CelFile.py?
> I saw some posts a month or so ago about checking for consistency in cell
> file, I think it was something about making sure the stated number of probes
> was consistent with the intensity measurements.
>
> What is different,
> when an file is read Affycel.read('file') many atributes are set. for
> example
> a = affcel()
> a.read('testfile')
> a.filename,
> a.version,
> a.header.items() ?# a dictionary of all header items
> a.num_intensity
> a.intensity
> a.num_masks
> a.masks
> a.num_outliers
> a.outliers
> a.numb_modified
> a.modified
>
> I plan to add the ability return/call intensity values with our with
> outliers or mask values.
> All data is currently store in numpy structured arrays,
> currently a.intensity returns the structured array, but I plan on making it
> an option to easily choose how this is returned.
> also what to make an optional normalized intensity array so that if the data
> is normalized it can be stored with the affycel instance. My use case was
> that I was opening about 80 cel files and reading them in was slow. this
> allowed me to read each file as an instance of affycel stored in a list that
> I then pickled. It was then much faster to open them.
>
> Are improvements to the CelFile.py are of value to biopython?
>
> I hope to have the code pushed up to my fork on github late tonight. Just
> thought I would ask if there was any suggestion before I did.
>
> Also have an CDF file reader, but only have done some basic testing. I don't
> have a lot of use for this, do other biopython users?
>
> I am kinda working in a vacuum and am trying to get more involved in
> projects to improve my skills and knowledge. Any suggestions would be
> appreciated.

Just out of curiosity, is your work based on the affy sdk, or are you
parsing stuff yourself?

Sean


From vincent at vincentdavis.net  Thu Apr  8 19:03:38 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 13:03:38 -0600
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
Message-ID: <q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>

Parsing it myself, But based directly an the affy documentation found here.
http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:

> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
> > I ended up writing my own modules for reading both affy Cel and CDF
> files.
> > Long story as to why I did not just use what was available in biopython.
> > I plan on making what I have done available to the biopython and will
> upload
> > it as a fork. I will outline what ways what I have is different below.
> > My question is: Are there any improvements(features) others would like to
> > see beyond what is avalible in the current CelFile.py?
> > I saw some posts a month or so ago about checking for consistency in cell
> > file, I think it was something about making sure the stated number of
> probes
> > was consistent with the intensity measurements.
> >
> > What is different,
> > when an file is read Affycel.read('file') many atributes are set. for
> > example
> > a = affcel()
> > a.read('testfile')
> > a.filename,
> > a.version,
> > a.header.items()  # a dictionary of all header items
> > a.num_intensity
> > a.intensity
> > a.num_masks
> > a.masks
> > a.num_outliers
> > a.outliers
> > a.numb_modified
> > a.modified
> >
> > I plan to add the ability return/call intensity values with our with
> > outliers or mask values.
> > All data is currently store in numpy structured arrays,
> > currently a.intensity returns the structured array, but I plan on making
> it
> > an option to easily choose how this is returned.
> > also what to make an optional normalized intensity array so that if the
> data
> > is normalized it can be stored with the affycel instance. My use case was
> > that I was opening about 80 cel files and reading them in was slow. this
> > allowed me to read each file as an instance of affycel stored in a list
> that
> > I then pickled. It was then much faster to open them.
> >
> > Are improvements to the CelFile.py are of value to biopython?
> >
> > I hope to have the code pushed up to my fork on github late tonight. Just
> > thought I would ask if there was any suggestion before I did.
> >
> > Also have an CDF file reader, but only have done some basic testing. I
> don't
> > have a lot of use for this, do other biopython users?
> >
> > I am kinda working in a vacuum and am trying to get more involved in
> > projects to improve my skills and knowledge. Any suggestions would be
> > appreciated.
>
> Just out of curiosity, is your work based on the affy sdk, or are you
> parsing stuff yourself?
>
> Sean
>


From sdavis2 at mail.nih.gov  Thu Apr  8 19:40:01 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 8 Apr 2010 15:40:01 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
	<q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
Message-ID: <i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>

On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> Parsing it myself, But based directly an the affy documentation found here.
> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/

So, are you covering both binary and text formats for .CEL files?  I
think that modern .CEL files (those produced by GCOS) are binary and
represent the majority of .CEL files produced today.  Some of the I/O
issues that you discuss are almost definitely dealt with by using the
binary .CEL files.

I'm certainly not an expert on Affy, so take all these
questions/comments with a grain of salt.

Sean


> On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net>
>> wrote:
>> > I ended up writing my own modules for reading both affy Cel and CDF
>> files.
>> > Long story as to why I did not just use what was available in biopython.
>> > I plan on making what I have done available to the biopython and will
>> upload
>> > it as a fork. I will outline what ways what I have is different below.
>> > My question is: Are there any improvements(features) others would like to
>> > see beyond what is avalible in the current CelFile.py?
>> > I saw some posts a month or so ago about checking for consistency in cell
>> > file, I think it was something about making sure the stated number of
>> probes
>> > was consistent with the intensity measurements.
>> >
>> > What is different,
>> > when an file is read Affycel.read('file') many atributes are set. for
>> > example
>> > a = affcel()
>> > a.read('testfile')
>> > a.filename,
>> > a.version,
>> > a.header.items() ?# a dictionary of all header items
>> > a.num_intensity
>> > a.intensity
>> > a.num_masks
>> > a.masks
>> > a.num_outliers
>> > a.outliers
>> > a.numb_modified
>> > a.modified
>> >
>> > I plan to add the ability return/call intensity values with our with
>> > outliers or mask values.
>> > All data is currently store in numpy structured arrays,
>> > currently a.intensity returns the structured array, but I plan on making
>> it
>> > an option to easily choose how this is returned.
>> > also what to make an optional normalized intensity array so that if the
>> data
>> > is normalized it can be stored with the affycel instance. My use case was
>> > that I was opening about 80 cel files and reading them in was slow. this
>> > allowed me to read each file as an instance of affycel stored in a list
>> that
>> > I then pickled. It was then much faster to open them.
>> >
>> > Are improvements to the CelFile.py are of value to biopython?
>> >
>> > I hope to have the code pushed up to my fork on github late tonight. Just
>> > thought I would ask if there was any suggestion before I did.
>> >
>> > Also have an CDF file reader, but only have done some basic testing. I
>> don't
>> > have a lot of use for this, do other biopython users?
>> >
>> > I am kinda working in a vacuum and am trying to get more involved in
>> > projects to improve my skills and knowledge. Any suggestions would be
>> > appreciated.
>>
>> Just out of curiosity, is your work based on the affy sdk, or are you
>> parsing stuff yourself?
>>
>> Sean
>>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From vincent at vincentdavis.net  Thu Apr  8 19:43:57 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 13:43:57 -0600
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
	<q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
	<i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>
Message-ID: <k2m77e831101004081243v46a53399y408f1db30c3b6115@mail.gmail.com>

No I was not reading the binary files. That said I am interested in perusing
that if there is interest.
Do you have a link to the SDK?

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:

> On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
> > Parsing it myself, But based directly an the affy documentation found
> here.
> >
> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/
>
> So, are you covering both binary and text formats for .CEL files?  I
> think that modern .CEL files (those produced by GCOS) are binary and
> represent the majority of .CEL files produced today.  Some of the I/O
> issues that you discuss are almost definitely dealt with by using the
> binary .CEL files.
>
> I'm certainly not an expert on Affy, so take all these
> questions/comments with a grain of salt.
>
> Sean
>
>
> > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov>
> wrote:
> >
> >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net
> >
> >> wrote:
> >> > I ended up writing my own modules for reading both affy Cel and CDF
> >> files.
> >> > Long story as to why I did not just use what was available in
> biopython.
> >> > I plan on making what I have done available to the biopython and will
> >> upload
> >> > it as a fork. I will outline what ways what I have is different below.
> >> > My question is: Are there any improvements(features) others would like
> to
> >> > see beyond what is avalible in the current CelFile.py?
> >> > I saw some posts a month or so ago about checking for consistency in
> cell
> >> > file, I think it was something about making sure the stated number of
> >> probes
> >> > was consistent with the intensity measurements.
> >> >
> >> > What is different,
> >> > when an file is read Affycel.read('file') many atributes are set. for
> >> > example
> >> > a = affcel()
> >> > a.read('testfile')
> >> > a.filename,
> >> > a.version,
> >> > a.header.items()  # a dictionary of all header items
> >> > a.num_intensity
> >> > a.intensity
> >> > a.num_masks
> >> > a.masks
> >> > a.num_outliers
> >> > a.outliers
> >> > a.numb_modified
> >> > a.modified
> >> >
> >> > I plan to add the ability return/call intensity values with our with
> >> > outliers or mask values.
> >> > All data is currently store in numpy structured arrays,
> >> > currently a.intensity returns the structured array, but I plan on
> making
> >> it
> >> > an option to easily choose how this is returned.
> >> > also what to make an optional normalized intensity array so that if
> the
> >> data
> >> > is normalized it can be stored with the affycel instance. My use case
> was
> >> > that I was opening about 80 cel files and reading them in was slow.
> this
> >> > allowed me to read each file as an instance of affycel stored in a
> list
> >> that
> >> > I then pickled. It was then much faster to open them.
> >> >
> >> > Are improvements to the CelFile.py are of value to biopython?
> >> >
> >> > I hope to have the code pushed up to my fork on github late tonight.
> Just
> >> > thought I would ask if there was any suggestion before I did.
> >> >
> >> > Also have an CDF file reader, but only have done some basic testing. I
> >> don't
> >> > have a lot of use for this, do other biopython users?
> >> >
> >> > I am kinda working in a vacuum and am trying to get more involved in
> >> > projects to improve my skills and knowledge. Any suggestions would be
> >> > appreciated.
> >>
> >> Just out of curiosity, is your work based on the affy sdk, or are you
> >> parsing stuff yourself?
> >>
> >> Sean
> >>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>


From vincent at vincentdavis.net  Thu Apr  8 20:21:32 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Thu, 8 Apr 2010 14:21:32 -0600
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
Message-ID: <m2h77e831101004081321xc3fef5efh86e4db4e61607406@mail.gmail.com>

Maybe I should have started this discussion differently.
Is there any need for improvements to the ability to read CEL files or CDF
files and if so what are they? I am interested in  contributing.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Apr 8, 2010 at 12:33 PM, Vincent Davis <vincent at vincentdavis.net>wrote:

> I ended up writing my own modules for reading both affy Cel and CDF files.
> Long story as to why I did not just use what was available in biopython.
> I plan on making what I have done available to the biopython and will
> upload it as a fork. I will outline what ways what I have is different
> below.
> My question is: Are there any improvements(features) others would like to
> see beyond what is avalible in the current CelFile.py?
> I saw some posts a month or so ago about checking for consistency in cell
> file, I think it was something about making sure the stated number of probes
> was consistent with the intensity measurements.
>
> What is different,
> when an file is read Affycel.read('file') many atributes are set. for
> example
> a = affcel()
> a.read('testfile')
> a.filename,
> a.version,
> a.header.items()  # a dictionary of all header items
> a.num_intensity
> a.intensity
> a.num_masks
> a.masks
> a.num_outliers
> a.outliers
>  a.numb_modified
> a.modified
>
> I plan to add the ability return/call intensity values with our with
> outliers or mask values.
> All data is currently store in numpy structured arrays,
> currently a.intensity returns the structured array, but I plan on making it
> an option to easily choose how this is returned.
> also what to make an optional normalized intensity array so that if the
> data is normalized it can be stored with the affycel instance. My use case
> was that I was opening about 80 cel files and reading them in was slow. this
> allowed me to read each file as an instance of affycel stored in a list that
> I then pickled. It was then much faster to open them.
>
> Are improvements to the CelFile.py are of value to biopython?
>
> I hope to have the code pushed up to my fork on github late tonight. Just
> thought I would ask if there was any suggestion before I did.
>
> Also have an CDF file reader, but only have done some basic testing. I
> don't have a lot of use for this, do other biopython users?
>
> I am kinda working in a vacuum and am trying to get more involved in
> projects to improve my skills and knowledge. Any suggestions would be
> appreciated.
>
>   *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>
>


From sdavis2 at mail.nih.gov  Thu Apr  8 22:31:43 2010
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 8 Apr 2010 18:31:43 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <k2m77e831101004081243v46a53399y408f1db30c3b6115@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<j2h264855a01004081156ge3d1a9d3nd8e1e7aa156c5a5a@mail.gmail.com>
	<q2r77e831101004081203i932807f2n513bf1de5708725e@mail.gmail.com>
	<i2t264855a01004081240lf48e1c42md7e67849260a51cd@mail.gmail.com>
	<k2m77e831101004081243v46a53399y408f1db30c3b6115@mail.gmail.com>
Message-ID: <s2g264855a01004081531sc04a7630xf3fc08a0032c0d43@mail.gmail.com>

On Thu, Apr 8, 2010 at 3:43 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> No I was not reading the binary files. That said I am interested in perusing
> that if there is interest.
> Do you have a link to the SDK?

I believe this will get you close:

http://www.affymetrix.com/partners_programs/programs/developer/fusion/index.affx?terms=no

I hope my questions are not taken the wrong way, but I have learned
from the bioconductor project that dealing with vendor file formats is
often a non-trivial pursuit.  It isn't always easy to think of all the
edge cases.

Sean


> ?*Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
> ?my blog <http://vincentdavis.net> |
> LinkedIn<http://www.linkedin.com/in/vincentdavis>
>
>
> On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net>
>> wrote:
>> > Parsing it myself, But based directly an the affy documentation found
>> here.
>> >
>> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/
>>
>> So, are you covering both binary and text formats for .CEL files? ?I
>> think that modern .CEL files (those produced by GCOS) are binary and
>> represent the majority of .CEL files produced today. ?Some of the I/O
>> issues that you discuss are almost definitely dealt with by using the
>> binary .CEL files.
>>
>> I'm certainly not an expert on Affy, so take all these
>> questions/comments with a grain of salt.
>>
>> Sean
>>
>>
>> > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov>
>> wrote:
>> >
>> >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net
>> >
>> >> wrote:
>> >> > I ended up writing my own modules for reading both affy Cel and CDF
>> >> files.
>> >> > Long story as to why I did not just use what was available in
>> biopython.
>> >> > I plan on making what I have done available to the biopython and will
>> >> upload
>> >> > it as a fork. I will outline what ways what I have is different below.
>> >> > My question is: Are there any improvements(features) others would like
>> to
>> >> > see beyond what is avalible in the current CelFile.py?
>> >> > I saw some posts a month or so ago about checking for consistency in
>> cell
>> >> > file, I think it was something about making sure the stated number of
>> >> probes
>> >> > was consistent with the intensity measurements.
>> >> >
>> >> > What is different,
>> >> > when an file is read Affycel.read('file') many atributes are set. for
>> >> > example
>> >> > a = affcel()
>> >> > a.read('testfile')
>> >> > a.filename,
>> >> > a.version,
>> >> > a.header.items() ?# a dictionary of all header items
>> >> > a.num_intensity
>> >> > a.intensity
>> >> > a.num_masks
>> >> > a.masks
>> >> > a.num_outliers
>> >> > a.outliers
>> >> > a.numb_modified
>> >> > a.modified
>> >> >
>> >> > I plan to add the ability return/call intensity values with our with
>> >> > outliers or mask values.
>> >> > All data is currently store in numpy structured arrays,
>> >> > currently a.intensity returns the structured array, but I plan on
>> making
>> >> it
>> >> > an option to easily choose how this is returned.
>> >> > also what to make an optional normalized intensity array so that if
>> the
>> >> data
>> >> > is normalized it can be stored with the affycel instance. My use case
>> was
>> >> > that I was opening about 80 cel files and reading them in was slow.
>> this
>> >> > allowed me to read each file as an instance of affycel stored in a
>> list
>> >> that
>> >> > I then pickled. It was then much faster to open them.
>> >> >
>> >> > Are improvements to the CelFile.py are of value to biopython?
>> >> >
>> >> > I hope to have the code pushed up to my fork on github late tonight.
>> Just
>> >> > thought I would ask if there was any suggestion before I did.
>> >> >
>> >> > Also have an CDF file reader, but only have done some basic testing. I
>> >> don't
>> >> > have a lot of use for this, do other biopython users?
>> >> >
>> >> > I am kinda working in a vacuum and am trying to get more involved in
>> >> > projects to improve my skills and knowledge. Any suggestions would be
>> >> > appreciated.
>> >>
>> >> Just out of curiosity, is your work based on the affy sdk, or are you
>> >> parsing stuff yourself?
>> >>
>> >> Sean
>> >>
>> > _______________________________________________
>> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From reece at berkeley.edu  Thu Apr  8 23:38:10 2010
From: reece at berkeley.edu (Reece Hart)
Date: Thu, 08 Apr 2010 16:38:10 -0700
Subject: [Biopython] SeqIO.parse exception on Google App Engine
Message-ID: <4BBE68E2.2030803@berkeley.edu>

Hi-

I'm trying to fetch a Genbank record and parse it in the Google App Engine
environment. A command line version works fine, but when using exactly the
same code under Google App Engine, SeqIO throws throws the following
exception:
   ...
   File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line 746, 
in parse_footer
     self.line = self.line.rstrip(os.linesep)
AttributeError: 'module' object has no attribute 'linesep'

The environment:
- Ubuntu Lucid beta1
- Python 2.6.5
- Biopython 1.53
- GAE 1.3.2

Test case:
I put together a simple test case that retrieves a raw (text) Genbank
record using Bio.Entrez (efetch); this works in both environments.
Parsing that record works on the command line, but not under GAE.

- curl http://harts.net/reece/tmp/demo1.tgz | tar -xvzf-
- cd demo1
- update symlink ./Bio to a Biopython tree
eg$ ln -s /usr/share/pyshared/Bio Bio
My intent is to prepend Bio to sys.paths much the way I would expect this
to be deployed (i.e., without updating sys.path).


Command line test:
$ ./lookup
fetch_text:LOCUS       NM_004006              13993 bp    mRNA    
linear   PRI 25-MAR-2010
fetch_parse:NM_004006.2 / NM_004006 / Homo sapiens dystrophin (DMD), 
transcript variant Dp427m,

GAE test:
In the demo1 directory:
$ dev_appserver.py .
and, in another terminal:
$ curl http://localhost:8080/
You'll see the exception in the http reply and in the appserver log


Thanks for any help/advice/pointers,
Reece

P.S. I'm learning Python and GAE at the same time, so silly errors are
possible (nay, likely).


From chapmanb at 50mail.com  Fri Apr  9 01:19:45 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 8 Apr 2010 21:19:45 -0400
Subject: [Biopython] SeqIO.parse exception on Google App Engine
In-Reply-To: <4BBE68E2.2030803@berkeley.edu>
References: <4BBE68E2.2030803@berkeley.edu>
Message-ID: <20100409011945.GE2011@kunkel>

Hi Reece;

> I'm trying to fetch a Genbank record and parse it in the Google App Engine
> environment. A command line version works fine, but when using exactly the
> same code under Google App Engine, SeqIO throws throws the following
> exception:
>   ...
>   File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line
> 746, in parse_footer
>     self.line = self.line.rstrip(os.linesep)
> AttributeError: 'module' object has no attribute 'linesep'

The python on Google App Engine is a bit crippled and lacks some of
the functionality of a full python install. It looks like one issue
must be that os.linesep is not defined on GAE. A quick fix is to
modify this to "\n", or just do:

os.linesep = "\n"

at the top of the Scanner.py file.

It would be really useful if you were able to submit a patch or list
of areas where Biopython fails on app engine and we can think about
how to suitably modify the code base to work on GAE and still be
compatible with Windows.

I did a bit of work on this using Biopython in Google App Engine
last year; code is on GitHub here:

http://github.com/chapmanb/biosqlweb

that might be helpful as a starting place for other ideas.

Good luck and let us know how your GAE experience goes,
Brad


From reece at berkeley.edu  Fri Apr  9 02:34:48 2010
From: reece at berkeley.edu (Reece Hart)
Date: Thu, 08 Apr 2010 19:34:48 -0700
Subject: [Biopython] SeqIO.parse exception on Google App Engine
In-Reply-To: <20100409011945.GE2011@kunkel>
References: <4BBE68E2.2030803@berkeley.edu> <20100409011945.GE2011@kunkel>
Message-ID: <4BBE9248.2080502@berkeley.edu>

Hi Brad. Thanks for the quick reply.

On 04/08/2010 06:19 PM, Brad Chapman wrote:
> A quick fix is to
> modify this to "\n", or just do:
>
> os.linesep = "\n"
>
> at the top of the Scanner.py file.
>    
It turns out that this fix also works within the module that does the 
parse. To wit:
from Bio import SeqIO
os.linesep = '\n'
rec = SeqIO.parse(...)

> I did a bit of work on this using Biopython in Google App Engine
> last year; code is on GitHub here:
> http://github.com/chapmanb/biosqlweb
> that might be helpful as a starting place for other ideas.
>    
Yes, thank you for this. This is precisely where I started only a few 
days ago...

Cheers,
Reece


From reece at berkeley.edu  Fri Apr  9 04:46:36 2010
From: reece at berkeley.edu (Reece Hart)
Date: Thu, 08 Apr 2010 21:46:36 -0700
Subject: [Biopython] GenBank.Scanner use of os.linesep
Message-ID: <4BBEB12C.8030907@berkeley.edu>

Hi All-

I recently discovered that the GenBank parser doesn't work on Google App 
Engine because os.linesep is undefined (GenBank/Scanner.py:746):

    745    #            if self.line[-1] == "\n" : self.line = 
self.line[:-1]
    746                self.line = self.line.rstrip(os.linesep)
    747                misc_lines.append(self.line)

Defining os.linesep is sufficient to fix the problem (thanks to Brad 
Chapman).

It seems to me that this use of os.linesep is probably mistaken here. If 
the file comes from efetch, the line separator will be \n regardless of 
platform [1] and that is what should be used in rstrip. It's possible 
that the file might come from a dog-foresaken CRLF platform and 
therefore contain that line separator.

So, I humbly propose that 746 be changed to either rstrip('\n') or, 
perhaps, rstrip('\n\r'). Although the need for the latter is probably 
rare, I don't see that it costs anything to cover that case by adding \r.

I'm new to this community, so I don't know whether we now have ferocious 
debate about the merits of line terminators or, rather, I submit a lame 
one-liner patch against the git HEAD.

Thanks for Biopython.

Cheers,
Reece


[1] For reference, here's a web request that should be equivalent to the 
efetch. On line 5, 0a is LF is \n.
apt12j$ curl -s 
'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=238018044&rettype=gb' 
| hexdump -C | head
00000000  4c 4f 43 55 53 20 20 20  20 20 20 20 4e 4d 5f 30  |LOCUS       
NM_0|
00000010  30 34 30 30 36 20 20 20  20 20 20 20 20 20 20 20  
|04006           |
00000020  20 20 20 31 33 39 39 33  20 62 70 20 20 20 20 6d  |   13993 
bp    m|
00000030  52 4e 41 20 20 20 20 6c  69 6e 65 61 72 20 20 20  |RNA    
linear   |
00000040  50 52 49 20 32 35 2d 4d  41 52 2d 32 30 31 30 0a  |PRI 
25-MAR-2010.|
00000050  44 45 46 49 4e 49 54 49  4f 4e 20 20 48 6f 6d 6f  |DEFINITION  
Homo|

-- 
Reece Hart, Ph.D.
Chief Scientist, Genome Commons                http://genomecommons.org/
Center for Computational Biology               324G Stanley Hall
UC Berkeley / QB3                              Berkeley, CA 94720


From biopython at maubp.freeserve.co.uk  Fri Apr  9 08:54:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 09:54:53 +0100
Subject: [Biopython] GenBank.Scanner use of os.linesep
In-Reply-To: <4BBEB12C.8030907@berkeley.edu>
References: <4BBEB12C.8030907@berkeley.edu>
Message-ID: <u2q320fb6e01004090154v83ff8badq743a4369cdf78bee@mail.gmail.com>

On Fri, Apr 9, 2010 at 5:46 AM, Reece Hart <reece at berkeley.edu> wrote:
> Hi All-
>
> I recently discovered that the GenBank parser doesn't work on Google App
> Engine because os.linesep is undefined (GenBank/Scanner.py:746):
>
> ? 745 ? ?# ? ? ? ? ? ?if self.line[-1] == "\n" : self.line = self.line[:-1]
> ? 746 ? ? ? ? ? ? ? ?self.line = self.line.rstrip(os.linesep)
> ? 747 ? ? ? ? ? ? ? ?misc_lines.append(self.line)
>
> Defining os.linesep is sufficient to fix the problem (thanks to Brad
> Chapman).
>
> It seems to me that this use of os.linesep is probably mistaken here.

I agree.

> If the
> file comes from efetch, the line separator will be \n regardless of platform
> [1] and that is what should be used in rstrip. It's possible that the file
> might come from a dog-foresaken CRLF platform and therefore contain that
> line separator.

I think it would break in a more common setting - passing a file on
Windows with CRLF, since Python will turn that into just \n.

> So, I humbly propose that 746 be changed to either rstrip('\n') or, perhaps,
> rstrip('\n\r'). Although the need for the latter is probably rare, I don't
> see that it costs anything to cover that case by adding \r.

A plain rstrip() would also work and get rid of any trailing whitespace.
I've checked that in.

> I'm new to this community, so I don't know whether we now have ferocious
> debate about the merits of line terminators or, rather, I submit a lame
> one-liner patch against the git HEAD.

For something this trivial, your verbal patch is fine. Would you like
to be added to the NEWS and CONTRIB file?

Peter


From biopython at maubp.freeserve.co.uk  Fri Apr  9 12:08:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 13:08:03 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
Message-ID: <k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>

On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>> Greetings All!
>>
>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>
>> seqs[id] += seq.replace(".","-")
>>
>> So when you load into memory alignments that mark gaps created to
>> allow alignment to inserts with ".", (such as PFam alignments or the
>> output of hmmer) that information is lost.
>>
>> I know there must be a good reason for this, but I am finding it a
>> problem on my end..
>>
>> -Bryan Lunt
>
> Hi Bryan,
>
> Yes, is it done deliberately. The dot is a problem - it has a quite
> specific meaning of "same as above" on other alignment file
> formats, while "-" is an almost universal shorthand for gap/insertion.
> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
> conversion.
>
> Have you got a sample output file we can use as a unit test or
> at least discuss? As I recall, on the PFAM alignments I looked
> at there was no data loss by doing the dot to dash mapping.

According to http://sonnhammer.sbc.su.se/Stockholm.html
>> Sequence letters may include any characters except
>> whitespace. Gaps may be indicated by "." or "-".

So a Stockholm file using a mixture of "." and "-" would be
valid but a bit odd. Why would anyone do that?

Peter


From cjfields at illinois.edu  Fri Apr  9 12:51:35 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 9 Apr 2010 07:51:35 -0500
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
Message-ID: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>

On Apr 9, 2010, at 7:08 AM, Peter wrote:

> On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>>> Greetings All!
>>> 
>>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>> 
>>> seqs[id] += seq.replace(".","-")
>>> 
>>> So when you load into memory alignments that mark gaps created to
>>> allow alignment to inserts with ".", (such as PFam alignments or the
>>> output of hmmer) that information is lost.
>>> 
>>> I know there must be a good reason for this, but I am finding it a
>>> problem on my end..
>>> 
>>> -Bryan Lunt
>> 
>> Hi Bryan,
>> 
>> Yes, is it done deliberately. The dot is a problem - it has a quite
>> specific meaning of "same as above" on other alignment file
>> formats, while "-" is an almost universal shorthand for gap/insertion.
>> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
>> conversion.
>> 
>> Have you got a sample output file we can use as a unit test or
>> at least discuss? As I recall, on the PFAM alignments I looked
>> at there was no data loss by doing the dot to dash mapping.
> 
> According to http://sonnhammer.sbc.su.se/Stockholm.html
>>> Sequence letters may include any characters except
>>> whitespace. Gaps may be indicated by "." or "-".
> 
> So a Stockholm file using a mixture of "." and "-" would be
> valid but a bit odd. Why would anyone do that?
> 
> Peter

Just curious, b/c this is a point of contention in BioPerl.  How does BioPython internally set what symbols correspond to residues/gaps/frameshifts/other?  BioPerl retains the original sequence but uses regexes for validation and methods that return symbol-related information (e.g. gap counts).  

(BTW, the contention here isn't that we use regexes, but that we set them globally).


chris


From biopython at maubp.freeserve.co.uk  Fri Apr  9 13:21:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 14:21:03 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>
Message-ID: <l2u320fb6e01004090621m94ab4301g1a262340eafc2fe1@mail.gmail.com>

On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
>
> Just curious, b/c this is a point of contention in BioPerl. ?How does BioPython
> internally set what symbols correspond to residues/gaps/frameshifts/other?
> BioPerl retains the original sequence but uses regexes for validation and
> methods that return symbol-related information (e.g. gap counts).
>
> (BTW, the contention here isn't that we use regexes, but that we set them globally).
>
> chris

Hi Chris,

The short answer is gaps are by default "-", and stop codons are "*", but
beyond that it would be down to user code to interpret odd symbols.

Our sequences have an alphabet object which can specify the letters (as
a set of expected characters), with explicit support for a single gap
character (usually "-"), and for proteins a single stop codon symbol (usually
"*"). This could in theory be extended to define other symbols too. The gap
char does get treated specially in some of the alignment code (e.g. for
calling a consensus), but I don't think we have anything built in regarding
frameshifts.

Peter


From biopython at maubp.freeserve.co.uk  Fri Apr  9 13:30:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 14:30:55 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <alpine.DEB.1.10.1004091501460.3715@gramsci.bo.biodec.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<alpine.DEB.1.10.1004091501460.3715@gramsci.bo.biodec.com>
Message-ID: <o2r320fb6e01004090630vfe4b740azb287ebf8b77fa702@mail.gmail.com>

On Fri, Apr 9, 2010 at 2:09 PM, Ivan Rossi <ivan at biodec.com> wrote:
>
> On Fri, 9 Apr 2010, Peter wrote:
>
>> So a Stockholm file using a mixture of "." and "-" would be
>> valid but a bit odd. Why would anyone do that?
>
> IIRC the "." are used for "gaps" at the extremes of sequences in a MSA. When
> you do local sequence alignments, like blast and most HMMs do, gaps at the
> extremes of sequences do not pay the usual penalty for gap opening. So in
> Stockholm format distinguishes between gaps for what you paid a price during
> the alignment ("-") and gaps-for-free (".") which are there just to pad each
> row to the MSA width.

So internal gaps (true gaps), versus leading or trailing padding. That makes
sense - and is certainly how PFAM does things according to their FAQ:

Quoting from http://pfam.sanger.ac.uk/help#tabview=tab3
>>> What is the difference between the - and . characters in your full alignments ?
>>>
>>> The '-' and '.' characters both represent gap characters. However they
>>> do tell you some extra information about how the HMM has generated
>>> the alignment. The '-' symbols are where the alignment of the sequence
>>> has used a delete state in the HMM to jump past a match state. This
>>> means that the sequence is missing a column that the HMM was
>>> expecting to be there. The '.' character is used to pad gaps where one
>>> sequence in the alignment has sequence from the HMMs insert state.
>>> See the alignment below where both characters are used. The HMM
>>> states emitting each column are shown. Note that residues emitted
>>> from the Insert (I) state are in lower case.

I wonder why doesn't this get mentioned anywhere on the format definitions:
http://sonnhammer.sbc.su.se/Stockholm.html
http://en.wikipedia.org/wiki/Stockholm_format

Peter


From cjfields at illinois.edu  Fri Apr  9 13:28:42 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 9 Apr 2010 08:28:42 -0500
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <l2u320fb6e01004090621m94ab4301g1a262340eafc2fe1@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu>
	<l2u320fb6e01004090621m94ab4301g1a262340eafc2fe1@mail.gmail.com>
Message-ID: <9D6E3C31-B273-4B37-BFE8-8C951C025CBB@illinois.edu>


On Apr 9, 2010, at 8:21 AM, Peter wrote:

> On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields <cjfields at illinois.edu> wrote:
>> 
>> 
>> Just curious, b/c this is a point of contention in BioPerl.  How does BioPython
>> internally set what symbols correspond to residues/gaps/frameshifts/other?
>> BioPerl retains the original sequence but uses regexes for validation and
>> methods that return symbol-related information (e.g. gap counts).
>> 
>> (BTW, the contention here isn't that we use regexes, but that we set them globally).
>> 
>> chris
> 
> Hi Chris,
> 
> The short answer is gaps are by default "-", and stop codons are "*", but
> beyond that it would be down to user code to interpret odd symbols.
> 
> Our sequences have an alphabet object which can specify the letters (as
> a set of expected characters), with explicit support for a single gap
> character (usually "-"), and for proteins a single stop codon symbol (usually
> "*"). This could in theory be extended to define other symbols too. The gap
> char does get treated specially in some of the alignment code (e.g. for
> calling a consensus), but I don't think we have anything built in regarding
> frameshifts.
> 
> Peter

Within LocatableSeq we define the following:

$GAP_SYMBOLS = '\-\.=~';
$FRAMESHIFT_SYMBOLS = '\\\/';
$OTHER_SYMBOLS = '\?';
$RESIDUE_SYMBOLS = '0-9A-Za-z\*';

Combined these can be used in a regex to validate sequence, or separately used for other purposes (counting gaps, frameshifts, etc.).  The OTHER_SYMBOLS is rally a catch-all for anything residue-like (counted in the sequence).  All of these can be redefined, but currently that's global, so it can have consequences in rare cases when mixing sequences from different formats.  We may localize them to work around that (part of GSoC project for alignment reimplementation).

We had a Symbol class at one point but I believe it was considered too 'heavy,' though this may be more a consequence of Perl's hammered-on OO.  

chris


From reece at berkeley.edu  Fri Apr  9 15:18:36 2010
From: reece at berkeley.edu (Reece Hart)
Date: Fri, 09 Apr 2010 08:18:36 -0700
Subject: [Biopython] GenBank.Scanner use of os.linesep
In-Reply-To: <u2q320fb6e01004090154v83ff8badq743a4369cdf78bee@mail.gmail.com>
References: <4BBEB12C.8030907@berkeley.edu>
	<u2q320fb6e01004090154v83ff8badq743a4369cdf78bee@mail.gmail.com>
Message-ID: <4BBF454C.4020502@berkeley.edu>

Peter-

> A plain rstrip() would also work and get rid of any trailing whitespace.
> I've checked that in.
> For something this trivial, your verbal patch is fine. Would you like
> to be added to the NEWS and CONTRIB file?
>    

Thanks for making this change so quickly. Please don't bother with the 
NEWS and CONTRIB file changes.

Cheers,
Reece


From davidpkilgore at gmail.com  Fri Apr  9 15:44:12 2010
From: davidpkilgore at gmail.com (Kizzo Kilgore)
Date: Fri, 9 Apr 2010 08:44:12 -0700
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
Message-ID: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>

Hello

I just wanted to introduce myself to the Biopython project/community,
and my intentions for participating as a student in this year's
Google's Summer of Code.  I have posted a rough draft of my proposal
to the GSOC applications site for mentors to see.  It is not complete
but I am currently working on it, so as to make final improvements
before the deadline.  I haven't had time (due to school/work) to fix
any of the bugs in the bug tracking system that has been pointed to
before, but please no that I am no stranger to source code, and that I
will make a great addition to the Biopython community after the
summer.  Please leave me feedback either by shooting me an email or
leaving a message in the GSOC applications site.  Also, be sure to
check out my website shown in the proposal for additional
qualifications.  Thank you.

-- 
Kizzo


From lunt at ctbp.ucsd.edu  Fri Apr  9 15:55:31 2010
From: lunt at ctbp.ucsd.edu (Bryan Lunt)
Date: Fri, 9 Apr 2010 08:55:31 -0700
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
Message-ID: <i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>

Hello Peter,
The HMMER suit of tools, and the Pfam website use "-" to indicate that
an HMM visited a deletion state, and "." to indicate that the HMM on a
different sequence visited an insertion state, and this gap is just
added to maintain alignment.


>foo
AA...BBB---CCC
>bar
AAbazBBBDDDCCC

In this example, the sequence "foo" doesn't have the DDD section of
the profile HMM,
the second sequence has not only the full model, but also contains an
insert, "baz" that is not part of the HMM, for example, an extra-long
loop.

I hope this helps...
-Bryan

On Fri, Apr 9, 2010 at 5:08 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>>> Greetings All!
>>>
>>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>>
>>> seqs[id] += seq.replace(".","-")
>>>
>>> So when you load into memory alignments that mark gaps created to
>>> allow alignment to inserts with ".", (such as PFam alignments or the
>>> output of hmmer) that information is lost.
>>>
>>> I know there must be a good reason for this, but I am finding it a
>>> problem on my end..
>>>
>>> -Bryan Lunt
>>
>> Hi Bryan,
>>
>> Yes, is it done deliberately. The dot is a problem - it has a quite
>> specific meaning of "same as above" on other alignment file
>> formats, while "-" is an almost universal shorthand for gap/insertion.
>> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
>> conversion.
>>
>> Have you got a sample output file we can use as a unit test or
>> at least discuss? As I recall, on the PFAM alignments I looked
>> at there was no data loss by doing the dot to dash mapping.
>
> According to http://sonnhammer.sbc.su.se/Stockholm.html
>>> Sequence letters may include any characters except
>>> whitespace. Gaps may be indicated by "." or "-".
>
> So a Stockholm file using a mixture of "." and "-" would be
> valid but a bit odd. Why would anyone do that?
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Fri Apr  9 16:09:16 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 9 Apr 2010 17:09:16 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>
Message-ID: <l2r320fb6e01004090909r2a13fd17qf81b397a9df7bef@mail.gmail.com>

Hi Bryan,

On Fri, Apr 9, 2010 at 4:55 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>
> Hello Peter,
> The HMMER suit of tools, and the Pfam website use "-" to indicate that
> an HMM visited a deletion state, and "." to indicate that the HMM on a
> different sequence visited an insertion state, and this gap is just
> added to maintain alignment.
>
>>foo
> AA...BBB---CCC
>>bar
> AAbazBBBDDDCCC
>
> In this example, the sequence "foo" doesn't have the DDD section of
> the profile HMM,
> the second sequence has not only the full model, but also contains an
> insert, "baz" that is not part of the HMM, for example, an extra-long
> loop.
>
> I hope this helps...
> -Bryan

Yes, it does. I think this HMMER/PFAM convention should be noted
on the definition of the Stockholm format - that might have prevented
this problem in Biopython since none of the examples I'd looked at
when writing the parser had this behaviour. Note your example is
more subtle than the different between internal gaps and leading or
trailing padding described by Ivan earlier:
http://lists.open-bio.org/pipermail/biopython/2010-April/006396.html

Could you point out a suitable (small) example from PFAM we can
use for a unit test, or email me an example (off list)?

Now, as to how to deal with this: We could extend the Biopython
Alphabet objects to explicitly support multiple types of gaps (the
current setup only really copes with a single gap character). Using
this information we could handle some special cases like Stockholm
to PHYLIP would require merging either gap onto a dash. This
doesn't sound that straight forward though.

Or, we can avoid explicit declarations about the sequence (just
ignore the Biopython Alphabet object capabilities and use one
of the generic alphabets), and leave the problem in the hands of
the end user. This is bound to cause some unpleasant surprises
one day, but might be the best solution.

Peter


From chapmanb at 50mail.com  Fri Apr  9 20:21:32 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 9 Apr 2010 16:21:32 -0400
Subject: [Biopython] affy CEL and CDF reader
In-Reply-To: <m2h77e831101004081321xc3fef5efh86e4db4e61607406@mail.gmail.com>
References: <k2q77e831101004081133y3410b229me8065384e717458f@mail.gmail.com>
	<m2h77e831101004081321xc3fef5efh86e4db4e61607406@mail.gmail.com>
Message-ID: <20100409202132.GA20004@sobchak.mgh.harvard.edu>

Vincent;
Thanks for the work on the Affy Cel/CDF parsers. I don't know
anything at all about the formats so can't help much with the
technical questions, but wanted to help with a few more general 
points you raise.

> > I ended up writing my own modules for reading both affy Cel and CDF files.

This and the following discussion are a bit hard to follow. When I
read through this thread I wasn't sure exactly what improvements
you've made, how they affect back compatibility of the code, and how
they help make the parser better going forward.

A lot of this work is very specialized, so you are trying to catch the
attention of the few people who know enough to help. If you can organize
your code and e-mail in a way that makes it easy for them to comment
and contribute, you'll increase the number of valuable responses you
receive.

It's an under appreciated skill, but very valuable for grabbing busy
people's attention and getting feedback.

> > Are improvements to the CelFile.py are of value to biopython?

Absolutely.

> Is there any need for improvements to the ability to read CEL files or CDF
> files and if so what are they? I am interested in  contributing.

Yes. Make it faster, more complete, easier to use.

There are general answers you can apply across the board. We definitely
are looking for contributions and happy to have you interested.

Brad


From chapmanb at 50mail.com  Fri Apr  9 20:39:12 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 9 Apr 2010 16:39:12 -0400
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
Message-ID: <20100409203912.GB20004@sobchak.mgh.harvard.edu>

Kizzo;

> I just wanted to introduce myself to the Biopython project/community,
> and my intentions for participating as a student in this year's
> Google's Summer of Code.  I have posted a rough draft of my proposal
> to the GSOC applications site for mentors to see.  

Glad you are interested in this and thanks for getting together a
proposal. I wish you would have dropped us a line a bit earlier as
we would have been happy to help with getting the application
together.

> It is not complete
> but I am currently working on it, so as to make final improvements
> before the deadline.  I haven't had time (due to school/work) to fix
> any of the bugs in the bug tracking system that has been pointed to
> before, but please no that I am no stranger to source code, and that I
> will make a great addition to the Biopython community after the
> summer.  

Great. I noticed that you worked on GSoC with OpenCog last year. Is
this the most recent code base from that work?

https://code.launchpad.net/~kizzobot/opencog/python-bindings

Have you still been involved with that community after the work? Did
they decide not to do GSoC this year?

Thanks again,
Brad


From davidpkilgore at gmail.com  Fri Apr  9 20:52:57 2010
From: davidpkilgore at gmail.com (Kizzo Kilgore)
Date: Fri, 9 Apr 2010 13:52:57 -0700
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <20100409203912.GB20004@sobchak.mgh.harvard.edu>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
	<20100409203912.GB20004@sobchak.mgh.harvard.edu>
Message-ID: <j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>

On Fri, Apr 9, 2010 at 1:39 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Kizzo;
>
>> I just wanted to introduce myself to the Biopython project/community,
>> and my intentions for participating as a student in this year's
>> Google's Summer of Code. ?I have posted a rough draft of my proposal
>> to the GSOC applications site for mentors to see.
>
> Glad you are interested in this and thanks for getting together a
> proposal. I wish you would have dropped us a line a bit earlier as
> we would have been happy to help with getting the application
> together.
>
>> It is not complete
>> but I am currently working on it, so as to make final improvements
>> before the deadline. ?I haven't had time (due to school/work) to fix
>> any of the bugs in the bug tracking system that has been pointed to
>> before, but please no that I am no stranger to source code, and that I
>> will make a great addition to the Biopython community after the
>> summer.
>
> Great. I noticed that you worked on GSoC with OpenCog last year. Is
> this the most recent code base from that work?
>
> https://code.launchpad.net/~kizzobot/opencog/python-bindings
>

The core developers merged my bindings in with the main branch a long
time ago, and yes that's the most recent codebase from that work.

> Have you still been involved with that community after the work? Did
> they decide not to do GSoC this year?
>

Oh yes, I'm still a regular on their IRC channel and mailing lists.
OpenCog is closer to my passion, and I already had 2 proposals for
OpenCog this summer ready, but unfortunately the project didn't get
accepted for GSoC this year.  I plan to work more with OpenCog as a
potential PhD project, so am still am involved with OpenCog.

> Thanks again,
> Brad
>


-- 
Kizzo


From vincent at vincentdavis.net  Sat Apr 10 05:43:06 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Fri, 9 Apr 2010 23:43:06 -0600
Subject: [Biopython] Bio.Application now subprocess?
Message-ID: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>

I was considering writing a module for using the command line Affymetrix
Power Tools Software
LINK<http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>
Mostly
to convert between CEL file types but there are lots of other features
<http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>If
I read correctly will be replaced using subprocess. Are there any modules
currently using subprcess rather than Bio.Application?
Anything I should know but don't (as if you know what I know) or consider

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


From biopython at maubp.freeserve.co.uk  Sat Apr 10 10:28:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 11:28:19 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
Message-ID: <j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>

On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis <vincent at vincentdavis.net> wrote:
> I was considering writing a module for using the command line Affymetrix
> Power Tools Software
> LINK<http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>
> Mostly
> to convert between CEL file types but there are lots of other features
> <http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx>If
> I read correctly will be replaced using subprocess. Are there any modules
> currently using subprcess rather than Bio.Application?
> Anything I should know but don't (as if you know what I know) or consider

Hi Vincent,

The idea is to use a Bio.Application based wrapper to build a command
line string, and invoke that with the subprocess module (i.e. use BOTH).
The tutorial has several examples of this (e.g. alignment tools and BLAST).

What have you been reading that makes you think Bio.Application is
being replaced with subprocess? We should probably clarify it.

Peter


From vincent at vincentdavis.net  Sat Apr 10 13:12:34 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Sat, 10 Apr 2010 07:12:34 -0600
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
Message-ID: <m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>

Let me say it was late at night when I started reading thorough this and I
am very new to it so....
The first function defines in Bio/Applications.py
def generic_run(commandline):
"""Run an application with the given commandline (DEPRECATED)......We now
recommend you invoke subprocess directly, using
str(commandline)............."""

The second
class ApplicationResult:
""""""Make results of a program available through a standard interface
(DEPRECATED)................."""

I think these should be moved tp the bottom if possible maybe below a
comment section that indicates the item below are or are going to be
deprecated.

The last line in
class AbstractCommandline(object):
""".......................
You would typically run the command line via a standard Python operating
    system call (e.g. using the subprocess module)."""

I started to read though this example but thought I would read more about
subprocess module, At this point it is not clear to me what bio/Applications
is doing for me. subprocess seems simple. But I  have a lot to learn and I
assume that if I start by getting basic functionality with subprocess then
it will make more sence

One of the parts that is not clear to me is for example in Emboss
class WaterCommandline(_EmbossCommandLine):
..........
self.parameters = \
         [_Option(["-asequence","asequence"], ["input", "file"], None,
1, "First sequence to align")

Not really sure where the parts to the _option line are documented, I assume
in the ...for p in parameters:......
Just not clear, I guess I need to study it more.


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Sat, Apr 10, 2010 at 4:28 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
> > I was considering writing a module for using the command line Affymetrix
> > Power Tools Software
> > LINK<
> http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx
> >
> > Mostly
> > to convert between CEL file types but there are lots of other features
> > <
> http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx
> >If
> > I read correctly will be replaced using subprocess. Are there any modules
> > currently using subprcess rather than Bio.Application?
> > Anything I should know but don't (as if you know what I know) or consider
>
> Hi Vincent,
>
> The idea is to use a Bio.Application based wrapper to build a command
> line string, and invoke that with the subprocess module (i.e. use BOTH).
> The tutorial has several examples of this (e.g. alignment tools and BLAST).
>
> What have you been reading that makes you think Bio.Application is
> being replaced with subprocess? We should probably clarify it.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Sat Apr 10 13:58:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 14:58:28 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
	<m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
Message-ID: <l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>

On Sat, Apr 10, 2010 at 2:12 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> Let me say it was late at night when I started reading thorough this and I
> am very new to it so....
> The first function defines in Bio/Applications.py
> def generic_run(commandline):

OK, so you are looking at the API docs and/or the code.
Bits of Bio/Applications.py are deprecated, and I think
you are right - we can try and make the status clearer.

Peter


From rodrigo_faccioli at uol.com.br  Sat Apr 10 17:23:19 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Sat, 10 Apr 2010 14:23:19 -0300
Subject: [Biopython] Bio.Application now subprocess?
Message-ID: <q2t3715adb71004101023g8dd9102ha64748311c15d9f1@mail.gmail.com>

I've developed a class for this proposed. It might help you. Please, see the
link below.

http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py

Thanks,

--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218


From vincent at vincentdavis.net  Sat Apr 10 17:30:05 2010
From: vincent at vincentdavis.net (Vincent Davis)
Date: Sat, 10 Apr 2010 11:30:05 -0600
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <q2t3715adb71004101023g8dd9102ha64748311c15d9f1@mail.gmail.com>
References: <q2t3715adb71004101023g8dd9102ha64748311c15d9f1@mail.gmail.com>
Message-ID: <z2z77e831101004101030p2680e3f7md91c83528f2b3f57@mail.gmail.com>

>
> On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli <
> rodrigo_faccioli at uol.com.br> wrote:
>
>> I've developed a class for this proposed. It might help you. Please, see
>> the
>> link below.
>
>
> http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py
>>
>>
>
> Thanks, This might be a good place for me to start. Nit sure how this is
different than Bio/Applications.py other than it is much simpler from a
quick look.


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli <
rodrigo_faccioli at uol.com.br> wrote:

> I've developed a class for this proposed. It might help you. Please, see
> the
> link below.
>
>
> http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py
>
> Thanks,
>
> --
> Rodrigo Antonio Faccioli
> Ph.D Student in Electrical Engineering
> University of Sao Paulo - USP
> Engineering School of Sao Carlos - EESC
> Department of Electrical Engineering - SEL
> Intelligent System in Structure Bioinformatics
> http://laips.sel.eesc.usp.br
> Phone: 55 (16) 3373-9366 Ext 229
> Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From biopython at maubp.freeserve.co.uk  Sat Apr 10 19:02:08 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 20:02:08 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
	<m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
	<l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>
Message-ID: <m2t320fb6e01004101202yf8946bb3k609e0814b96b3cb8@mail.gmail.com>

On Sat, Apr 10, 2010 at 2:58 PM, Peter wrote:
>
> OK, so you are looking at the API docs and/or the code.
> Bits of Bio/Applications.py are deprecated, and I think
> you are right - we can try and make the status clearer.
>

Hi Vincent,

I updated that a bit, hopefully it is clearer that a typical
user doesn't need to look at Bio.Applications at all.
Rather you might use the alignment tool wrappers in
Bio.Align.Applications, or the EMBOSS wrappers in
Bio.Emboss.Applications (etc) which internally use
the classes defined in Bio.Applications.

The *only* reason you'd use Bio.Applications directly
now is to write a new command line tool wrapper.

[Historically you might have used the old generic_run
function in Bio.Applications, but that is deprecated now]

Peter


From biopython at maubp.freeserve.co.uk  Sat Apr 10 20:33:57 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Apr 2010 21:33:57 +0100
Subject: [Biopython] Bio.Application now subprocess?
In-Reply-To: <1101855478758905131@unknownmsgid>
References: <q2o77e831101004092243nb9202bf4tcba1ad5a42a6207f@mail.gmail.com>
	<j2n320fb6e01004100328o673cba08k69119bb8e83735c4@mail.gmail.com>
	<m2w77e831101004100612mf230ae52tef9579e57b1147af@mail.gmail.com>
	<l2x320fb6e01004100658r648bb34ax26daf2a6758632f8@mail.gmail.com>
	<m2t320fb6e01004101202yf8946bb3k609e0814b96b3cb8@mail.gmail.com>
	<1101855478758905131@unknownmsgid>
Message-ID: <t2i320fb6e01004101333zfb1b64c2i7bacb5b407fadaa@mail.gmail.com>

On Sat, Apr 10, 2010 at 8:27 PM, Vincent Davis wrote:
>
> So that was/is my plan to use it to writes command lone tools for the
> affymetrix apt dev commandline app. unless this is redundant in a way
> I am not aware of.
> Thanks

Ah - right, now this makes sense. Are you on the dev mailing
list (CC'd)? That would be a better place to ask. I'd start by
looking at Bio.Align.Applications (less subclasses there) as a
model.

Peter


From chapmanb at 50mail.com  Mon Apr 12 12:37:31 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 12 Apr 2010 08:37:31 -0400
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
	<20100409203912.GB20004@sobchak.mgh.harvard.edu>
	<j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>
Message-ID: <20100412123731.GJ20004@sobchak.mgh.harvard.edu>

Kizzo;

> > Have you still been involved with that community after the work? Did
> > they decide not to do GSoC this year?
> 
> Oh yes, I'm still a regular on their IRC channel and mailing lists.
> OpenCog is closer to my passion, and I already had 2 proposals for
> OpenCog this summer ready, but unfortunately the project didn't get
> accepted for GSoC this year.  I plan to work more with OpenCog as a
> potential PhD project, so am still am involved with OpenCog.

That's great to hear. One of the most important parts of GSoC for
myself and many mentors is the chance to get additional folks
involved in open source.

Reviews of the applications have started, and the main aspect which
would improve your proposal is to develop a specific project plan 
with detailed descriptions of week to week goals. For each week you 
should have:

- Description of the specific weekly goal. 
- Details on the PyCogent and Biopython code you expect to be working with
- Possible issues or areas of expansion you expect might impact the
  timeline
- Expected work on documentation and testing. You want to have this
  integrated throughout the proposal.

See the examples in the NESCent application documentation to get an
idea of the level of detail in accepted projects from previous years:

https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply
http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html

The content we'd like to see in the proposal is interconversion of
core object (Sequence, Alignment, Phylogeny) in the first half of
the summer, and applications of this interconversion to developing
biological workflows in the second half of the summer. Feel free to
be creative and pick work that is of interest to your studies.

Since you can't edit the proposal currently, please prepare this in
a publicly accessible Google Doc and provide a link from the public
comments so other mentors can view it.

Thanks,
Brad


From biopython at maubp.freeserve.co.uk  Mon Apr 12 13:35:44 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 12 Apr 2010 14:35:44 +0100
Subject: [Biopython] StockholmIO replaces "." with "-", why?
In-Reply-To: <y2ob34be8bd1004091150j154f6444s3cf1a2731f190fe3@mail.gmail.com>
References: <i2vb34be8bd1004071757lc24e582aweed55444a3abd8cf@mail.gmail.com>
	<g2k320fb6e01004080104iee3f12bbv2bebcbe3e33185fe@mail.gmail.com>
	<k2p320fb6e01004090508k8902d659v170c9c8d6bedbb8@mail.gmail.com>
	<i2lb34be8bd1004090855id9b805d2iaf5d5b0da560f9fe@mail.gmail.com>
	<l2r320fb6e01004090909r2a13fd17qf81b397a9df7bef@mail.gmail.com>
	<y2ob34be8bd1004091150j154f6444s3cf1a2731f190fe3@mail.gmail.com>
Message-ID: <w2w320fb6e01004120635jad7fd9c3r9088db0f1da8138e@mail.gmail.com>

On Fri, Apr 9, 2010 at 7:50 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
> Hello Peter,
>
> Thanks for your help recently on this!
> I have here two files that I like to use as examples, because they are
> fairly small, (203 sequences)
>
> The Pfam page summarizing this family is :
> http://pfam.sanger.ac.uk/family/PF07750
>
> Cheers!
> -Bryan Lunt

I see what you mean - using that webpage to get the full alignment
(in any of the supported file formats) using the mixed gap option
(dot or dash) does show both symbols in a meaningful way.

Peter


From tiagoantao at gmail.com  Mon Apr 12 23:39:29 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 13 Apr 2010 00:39:29 +0100
Subject: [Biopython] ASN.1 and Entrez SNP
Message-ID: <m2o6d941f121004121639h33efb822ha9079afcad374615@mail.gmail.com>

Hi,

Just a simple question:
Entrez SNP seems to return ASN.1 format only.
Is there any way to parse this in biopython? I've looked at SeqIO and
found nothing...
I can think of tools to process this outside, but I am just curious if
this is processed natively with Biopython (being an exposed NCBI
format...)

Many thanks,
Tiago
PS - You can easily try this with:
hdl = Entrez.efetch(db="snp", id="3739022")
print hdl.read()

-- 
"If you want to get laid, go to college.  If you want an education, go
to the library." - Frank Zappa


From biopython at maubp.freeserve.co.uk  Tue Apr 13 08:22:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Apr 2010 09:22:42 +0100
Subject: [Biopython] ASN.1 and Entrez SNP
In-Reply-To: <m2o6d941f121004121639h33efb822ha9079afcad374615@mail.gmail.com>
References: <m2o6d941f121004121639h33efb822ha9079afcad374615@mail.gmail.com>
Message-ID: <v2u320fb6e01004130122xd06aecccmec11066651ad7607@mail.gmail.com>

2010/4/13 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> Just a simple question:
> Entrez SNP seems to return ASN.1 format only.
> Is there any way to parse this in biopython? I've looked at SeqIO and
> found nothing...
> I can think of tools to process this outside, but I am just curious if
> this is processed natively with Biopython (being an exposed NCBI
> format...)
>
> Many thanks,
> Tiago
> PS - You can easily try this with:
> hdl = Entrez.efetch(db="snp", id="3739022")
> print hdl.read()

Hi Tiago,

No, we don't support ASN.1, and I don't see any good reason to - I
think it would only be NCBI ASN.1 we'd we interested in, and I think
that all their resources are available in other easier to use formats
like XML these days.

See also http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Instead ask Entrez to give you the SNP data as XML:

Entrez.efetch(db="snp", id="3739022", retmode="xml")

Hopefully the SNP XML file has everything in it.

You have a choice of Python XML parsers to use. However, the
Bio.Entrez parser doesn't like this XML. This appears to be related
(or caused by) a known NCBI bug. See
http://bugzilla.open-bio.org/show_bug.cgi?id=2771

Peter


From bala.biophysics at gmail.com  Tue Apr 13 14:49:03 2010
From: bala.biophysics at gmail.com (Bala subramanian)
Date: Tue, 13 Apr 2010 16:49:03 +0200
Subject: [Biopython] removing redundant sequence
Message-ID: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>

Friends,
Sorry if this question was asked before. Is there any function in Biopython
that can remove redundant sequence records from a fasta file.

Thanks,
Bala


From biopython at maubp.freeserve.co.uk  Tue Apr 13 15:02:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Apr 2010 16:02:52 +0100
Subject: [Biopython] removing redundant sequence
In-Reply-To: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
Message-ID: <t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>

On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Friends,
> Sorry if this question was asked before. Is there any function in Biopython
> that can remove redundant sequence records from a fasta file.
>
> Thanks,
> Bala

No, but you should be able to do this with Biopython - depending on
what exactly you are asking for.

When you say "redundant" do you mean 100% perfect identify?

How big is your FASTA file - are you working with next-gen sequencing
data and millions of reads?. If it is small enough you can keep all
the data in memory to compare sequences to each other. Otherwise
you might try using a checksum (e.g. SEGUID) to spot duplicates.

Peter


From schafer at rostlab.org  Tue Apr 13 15:08:31 2010
From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=)
Date: Tue, 13 Apr 2010 17:08:31 +0200
Subject: [Biopython] removing redundant sequence
In-Reply-To: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
Message-ID: <4BC488EF.3000505@rostlab.org>

Hey,

I think not. But you can use an external tool like cd-hit or uniqueprot 
and implement a wrapper function for that in your code.

Chris

On 04/13/2010 04:49 PM, Bala subramanian wrote:
> Friends,
> Sorry if this question was asked before. Is there any function in Biopython
> that can remove redundant sequence records from a fasta file.
>
> Thanks,
> Bala
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Thu Apr 15 15:03:02 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 15 Apr 2010 16:03:02 +0100
Subject: [Biopython] Draft abstract for BOSC 2010 Biopython Project Update
Message-ID: <i2g320fb6e01004150803s58e14fa6nef112d3b522e5dfe@mail.gmail.com>

Hi all,

I should have circulated this earlier, but here is a draft abstract
for a "Biopython Project Update" talk at BOSC 2010, to be submitted
*today*.
http://www.open-bio.org/wiki/BOSC_2010

I'm hoping to attend BOSC again this year and give the talk, but
haven't sorted out the finances - Brad has offered to present if I
can't go, hence the talk author list. If anyone else wants to help
with slides etc (or as a standby speaker) please let me know.

This is based on the abstract from last year, included in this PDF:
http://www.open-bio.org/w/images/c/c7/BOSC2009_program_20090601.pdf

In the PDF version of the abstract I've made the logo smaller this time ;)

Comments welcome,

Thanks,

Peter

--

Biopython Project Update
Peter Cock, Brad Chapman

In this talk we present the current status of the Biopython project
(www.biopython.org), described in a application note published last
year (Cock et al., 2009). Biopython celebrated its 10th Birthday last
year, and has now been cited or referred to in over 150 scientific
publications (a list is included on our website).

At the end of 2009, following an extended evaluation period, Biopython
successfully migrated from using CVS for source code control to using
git, hosted on github.com. This has helped our existing developers to
work and test new features on publicly viewable branches before being
merged, and has also encouraged new contributors to work on additions
or improvements. Currently about fifty people have their own Biopython
repository on GitHub.

In summer 2009 we had two Google Summer of Code (GSoC) project
students working on phylogenetic code for Biopython in conjunction
with the National Evolutionary Synthesis Center (NESCent). Eric
Talevich?s work on phylogenetic trees including phyloXML support (Han
and Zamesk, 2009) was merged and included with Biopython 1.54, and he
continues to be actively involved with Biopython. We hope to include
Nick Matzke?s module for biogeographical data from the Global
Biodiversity Information Facility (GBIF) later this year. For summer
2010 we have Biopython related GSoC projects submitted via both
NESCent and the Open Bioinformatics Foundation (OBF), and hope to have
students working on Biopython once again.

Since BOSC 2009, Biopython has seen four releases. Biopython 1.51
(August 2009) was an important milestone in dropping support for
Python 2.3 and our legacy parsing infra-structure (Martel/Mindy), but
was most noteworthy for FASTQ support (Cock et al., 2010). Biopython
1.52 (September 2009) introduced indexing of most sequence file
formats for random access, and made interconverting sequence and
alignment files easier. Biopython 1.53 (December 2009) included
wrappers for the new NCBI BLAST+ command line tools, and much improved
support for running under Jython. Our latest release is Biopython 1.54
(April/May 2010), new features include Bio.Phylo for phylogenetic
trees (GSoC project), and support for Standard Flowgram Format (SFF)
files used for 454 Life Sciences (Roche) sequencing.

Biopython is free open source software available from
www.biopython.org under the Biopython License Agreement (an MIT style
license, http://www.biopython.org/DIST/LICENSE).

References

Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke,
A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon,
M.J. (2009) Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
doi:10.1093/bioinformatics/btp163

Han, M.V. and Zmasek, C.M. (2009) phyloXML: XML for evolutionary
biology and comparative genomics. BMC Bioinformatics 10:356.
doi:10.1186/1471-2105-10-356

Cock, P.J.A., Fields, C.J., Goto N., Heuer, M.L., and Rice, P.M.
(2010) The Sanger FASTQ file format for sequences with quality scores,
and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38(6)
1767-71. doi:10.1093/nar/gkp1137


From mok at bioxray.dk  Thu Apr 15 15:15:01 2010
From: mok at bioxray.dk (Morten Kjeldgaard)
Date: Thu, 15 Apr 2010 17:15:01 +0200
Subject: [Biopython] Entrez.efetch bug?
Message-ID: <4BC72D75.1040505@bioxray.dk>

Hi,

I am getting an error with Entrez.efetch() with Biopython version 1.51. This
is my handle:

handle = Entrez.efetch(db='protein', id='114391',rettype='gp')

When I subsequently do this:

 record = Entrez.read(handle)

I get a syntax error from Expat:

ExpatError: syntax error: line 1, column 0

However, if I do the following, it works:

record = handle.read()

but then I need to parse the resulting record using the Genbank parser,
which is a nuisance since I normally should get this for free from the
Entrez module.

Comments, anyone?


-- Morten


From biopython at maubp.freeserve.co.uk  Thu Apr 15 15:31:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Apr 2010 16:31:28 +0100
Subject: [Biopython] Entrez.efetch bug?
In-Reply-To: <4BC72D75.1040505@bioxray.dk>
References: <4BC72D75.1040505@bioxray.dk>
Message-ID: <x2p320fb6e01004150831g77f069e8kf46a84175c35c641@mail.gmail.com>

On Thu, Apr 15, 2010 at 4:15 PM, Morten Kjeldgaard <mok at bioxray.dk> wrote:
> Hi,
>
> I am getting an error with Entrez.efetch() with Biopython version 1.51. This
> is my handle:
>
> handle = Entrez.efetch(db='protein', id='114391',rettype='gp')
>

In the above, you've asked Entrez to give you a plain text GenPept file
(a protein GenBank file).

> When I subsequently do this:
>
> ?record = Entrez.read(handle)
>
> I get a syntax error from Expat:
>
> ExpatError: syntax error: line 1, column 0
>

The Bio.Entrez.read() and Bio.Entrez.parse() functions expect XML.

> However, if I do the following, it works:
>
> record = handle.read()

Well, yes, you get a big string stored as the variable record.

> but then I need to parse the resulting record using the Genbank parser,
> which is a nuisance since I normally should get this for free from the
> Entrez module.
>
> Comments, anyone?

Try this:

from Bio import Entrez
from Bio import SeqIO
handle = Entrez.efetch(db='protein', id='114391',rettype='gp')
record = SeqIO.read(handle, 'genbank')

Peter


From mok at bioxray.dk  Thu Apr 15 21:28:24 2010
From: mok at bioxray.dk (Morten Kjeldgaard)
Date: Thu, 15 Apr 2010 23:28:24 +0200
Subject: [Biopython] Entrez.efetch bug?
In-Reply-To: <x2p320fb6e01004150831g77f069e8kf46a84175c35c641@mail.gmail.com>
References: <4BC72D75.1040505@bioxray.dk>
	<x2p320fb6e01004150831g77f069e8kf46a84175c35c641@mail.gmail.com>
Message-ID: <26E933F7-D7D2-48EC-82B4-4B654403F177@bioxray.dk>


On 15/04/2010, at 17.31, Peter wrote:

> record = SeqIO.read(handle, 'genbank')

d'Oh!! :-) Thanks, just the hint I needed.

Cheers,
Morten


From davidpkilgore at gmail.com  Mon Apr 19 06:54:55 2010
From: davidpkilgore at gmail.com (Kizzo Kilgore)
Date: Sun, 18 Apr 2010 23:54:55 -0700
Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore
In-Reply-To: <20100412123731.GJ20004@sobchak.mgh.harvard.edu>
References: <v2j6e4c2e9e1004090844mdbf57d9cg57073975c6894023@mail.gmail.com>
	<20100409203912.GB20004@sobchak.mgh.harvard.edu>
	<j2o6e4c2e9e1004091352x41b07219w565680434f7db1bd@mail.gmail.com>
	<20100412123731.GJ20004@sobchak.mgh.harvard.edu>
Message-ID: <x2u6e4c2e9e1004182354i60852680m315e6ba9b6813119@mail.gmail.com>

I have taken the time to carefully look over the links and examples
you suggested, and came up with my own draft week by week plan for the
summer.  It is not perfect, or even complete, as I am in the closing
weeks of school and things are getting really busy, but I managed to
pull this together.  You can visit the following public Google Docs
link to get the Gnumeric spreadsheet of my timeline.  If you would
like me to, I will also convert it to some other format if you like
(and if I can), or I can attach a copy of the file itself (or post it
on my website) if for some reason the link does not work.  Thank you.

https://docs.google.com/leaf?id=0B4KRpw_6YxAjMzU3NDgxMWYtZGIxZi00YmY3LTk5MGQtNDlmMjYyYTRhN2M0&hl=en

On Mon, Apr 12, 2010 at 5:37 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Kizzo;
>
>> > Have you still been involved with that community after the work? Did
>> > they decide not to do GSoC this year?
>>
>> Oh yes, I'm still a regular on their IRC channel and mailing lists.
>> OpenCog is closer to my passion, and I already had 2 proposals for
>> OpenCog this summer ready, but unfortunately the project didn't get
>> accepted for GSoC this year. ?I plan to work more with OpenCog as a
>> potential PhD project, so am still am involved with OpenCog.
>
> That's great to hear. One of the most important parts of GSoC for
> myself and many mentors is the chance to get additional folks
> involved in open source.
>
> Reviews of the applications have started, and the main aspect which
> would improve your proposal is to develop a specific project plan
> with detailed descriptions of week to week goals. For each week you
> should have:
>
> - Description of the specific weekly goal.
> - Details on the PyCogent and Biopython code you expect to be working with
> - Possible issues or areas of expansion you expect might impact the
> ?timeline
> - Expected work on documentation and testing. You want to have this
> ?integrated throughout the proposal.
>
> See the examples in the NESCent application documentation to get an
> idea of the level of detail in accepted projects from previous years:
>
> https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply
> http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html
>
> The content we'd like to see in the proposal is interconversion of
> core object (Sequence, Alignment, Phylogeny) in the first half of
> the summer, and applications of this interconversion to developing
> biological workflows in the second half of the summer. Feel free to
> be creative and pick work that is of interest to your studies.
>
> Since you can't edit the proposal currently, please prepare this in
> a publicly accessible Google Doc and provide a link from the public
> comments so other mentors can view it.
>
> Thanks,
> Brad
>


-- 
Kizzo


From mjldehoon at yahoo.com  Mon Apr 19 07:08:04 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 19 Apr 2010 00:08:04 -0700 (PDT)
Subject: [Biopython] Fw: Entrez.efetch
In-Reply-To: <910794.43889.qm@web56207.mail.re3.yahoo.com>
Message-ID: <870000.56671.qm@web62402.mail.re1.yahoo.com>

> I sent the mail to the biopython at biopython.org
> but it was not delivered.

It will be delivered if you subscribe to the mailing list.

--- On Mon, 4/19/10, olumide olufuwa <ludax5 at yahoo.com> wrote:

> From: olumide olufuwa <ludax5 at yahoo.com>
> Subject: Fw: [Biopython]Entrez.efetch
> To: biopython-owner at lists.open-bio.org
> Cc: "Biopython mailing list" <biopython at biopython.org>
> Date: Monday, April 19, 2010, 2:50 AM
> 
> 
> Hello Michel,
> I sent the mail to the biopython at biopython.org
> but it was not delivered. I have edited the message. 
> 
> 
> The code that 
> accepts UNIPROT ID, retrieves the record using
> Entrez.efetch and then it
>  parsed to obtain the Pubmed ID which i use to search
> Medline for the 
> Title, Abstract and other information about the entry. 
> The code:
> 
> query_id=str(raw_input("please
> 
> 
>  enter your UNIPROT_ID: ")) #Request UNIPROT ID from user
> Entrez.email="ludax5 at yahoo.com"
> prothandle=Entrez.efetch(db="protein",
> 
> 
>  id=query_id, rettype="gb" #queries Protein DB with the
> given ID
> #The
>  program returns an error here if a wrong ID is given.
> Details of the 
> error is given below
> seq_record=SeqIO.read(prothandle, "gb")
> for
> 
>  record in seq_record.annotations['references']: # To
> obtain Pubmed id 
> from the seqrecord
> ?? key_word=record.pubmed_id
> ?? if key_word:
> ????
>  handle=Entrez.efetch(db="pubmed",
> 
>  id=key_word, rettype="medline")
> ???? 
> medRecords=Medline.parse(handle)
> ???? for rec in medRecords: #prints 
> title and Abstract
> ???????? if rec.has_key('AB') and 
> rec.has_key('TI'):
> ?????????? print "TITLE: ",rec['TI']
> ??????????
>  print "ABSTRACT: ",rec['AB']
> ?????????? print ' '
> 
> 
> THE 
> PROBLEM: The program gives an error if a wrong ID is
> entered or an ID 
> other than UNIPROT ID e.g PDB ID, GSS ID etc. 
> 
> 
> 
> An Example Run:
> 
> 
> please enter your UNIPROT_ID:
>  1wio #A PDB ID is given instead
> 
> 
> Traceback (most recent call last):
> ? File "file.py", line 11, in 
> <module>
> ??? seq_record=SeqIO.read(prothandle, "gb")
> ? File 
> "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py",
> line 522, in 
> read
> ??? raise ValueError("No records found in handle")
> ValueError:
> 
>  No records found in handle
> 
> I want to avoid this error, thus i 
> want the program to print "INCORRECT ID GIVEN"? when a
> wrong or an 
> incorrect ID is given.
> 
> 
> Thanks a lot.
> lummy
> 
> 
> 
> 


From olumideolufuwa at yahoo.com  Mon Apr 19 07:30:24 2010
From: olumideolufuwa at yahoo.com (Olumide Olufuwa)
Date: Mon, 19 Apr 2010 00:30:24 -0700 (PDT)
Subject: [Biopython] Entrez.efetch
In-Reply-To: <mailman.0.1271661752.7278.biopython@lists.open-bio.org>
Message-ID: <221701.32474.qm@web45106.mail.sp1.yahoo.com>


Hello there,
?
I wrote a program, I am not awesome in biopython but this is what it does: The program code that 
accepts user defined UNIPROT ID, retrieves the record using Entrez.efetch and then it
 is parsed to obtain the Pubmed ID which i use to search Medline for Title, Abstract and other information about the entry. 
The code is simply:

query_id=str(raw_input("please


 enter your UNIPROT_ID: ")) #Request UNIPROT ID from user
Entrez.email="ludax5 at yahoo.com"
prothandle=Entrez.efetch(db="protein",


 id=query_id, rettype="gb" #queries Protein DB with the given ID
#The
 program returns an error here if a wrong ID is given. Details of the 
error is given below
seq_record=SeqIO.read(prothandle, "gb")
for

 record in seq_record.annotations['references']: # To obtain Pubmed id 
from the seqrecord
?? key_word=record.pubmed_id
?? if key_word:
????

 handle=Entrez.efetch(db="pubmed",

 id=key_word, rettype="medline")
???? 
medRecords=Medline.parse(handle)
???? for rec in medRecords: #prints 
title and Abstract
???????? if rec.has_key('AB') and 
rec.has_key('TI'):
?????????? print "TITLE: ",rec['TI']
??????????
 print "ABSTRACT: ",rec['AB']
?????????? print ' '


THE 
PROBLEM: The program gives an error if a wrong ID is entered or an ID 
other than UNIPROT ID e.g PDB ID, GSS ID etc. 


An Example Run with a wrong ID is shown below:


please enter your UNIPROT_ID:
 1wio #A PDB ID is given instead


Traceback (most recent call last):
? File "file.py", line 11, in 
<module>
??? seq_record=SeqIO.read(prothandle, "gb")
? File 
"/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", line 522, in 
read
??? raise ValueError("No records found in handle")
ValueError:


 No records found in handle

I want to avoid this error, thus i 
want the program to print "INCORRECT ID GIVEN"? when a wrong or an 
incorrect ID is given.


Thanks a lot.
lummy


From mjldehoon at yahoo.com  Mon Apr 19 07:45:59 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 19 Apr 2010 00:45:59 -0700 (PDT)
Subject: [Biopython] Entrez.efetch
In-Reply-To: <221701.32474.qm@web45106.mail.sp1.yahoo.com>
Message-ID: <902706.80063.qm@web62402.mail.re1.yahoo.com>

Put a try:/except: block around the call to SeqIO.read, as in:

try:
    seq_record=SeqIO.read(prothandle, "gb")
except ValueError:
    print "INCORRECT ID GIVEN"


--Michiel

--- On Mon, 4/19/10, Olumide Olufuwa <olumideolufuwa at yahoo.com> wrote:

> From: Olumide Olufuwa <olumideolufuwa at yahoo.com>
> Subject: [Biopython] Entrez.efetch
> To: biopython at lists.open-bio.org
> Date: Monday, April 19, 2010, 3:30 AM
> 
> Hello there,
> ?
> I wrote a program, I am not awesome in biopython but this
> is what it does: The program code that 
> accepts user defined UNIPROT ID, retrieves the record using
> Entrez.efetch and then it
>  is parsed to obtain the Pubmed ID which i use to search
> Medline for Title, Abstract and other information about the
> entry. 
> The code is simply:
> 
> query_id=str(raw_input("please
> 
> 
> 
>  enter your UNIPROT_ID: ")) #Request UNIPROT ID from user
> Entrez.email="ludax5 at yahoo.com"
> prothandle=Entrez.efetch(db="protein",
> 
> 
> 
>  id=query_id, rettype="gb" #queries Protein DB with the
> given ID
> #The
>  program returns an error here if a wrong ID is given.
> Details of the 
> error is given below
> seq_record=SeqIO.read(prothandle, "gb")
> for
> 
>  record in seq_record.annotations['references']: # To
> obtain Pubmed id 
> from the seqrecord
> ?? key_word=record.pubmed_id
> ?? if key_word:
> ????
> 
>  handle=Entrez.efetch(db="pubmed",
> 
>  id=key_word, rettype="medline")
> ???? 
> medRecords=Medline.parse(handle)
> ???? for rec in medRecords: #prints 
> title and Abstract
> ???????? if rec.has_key('AB') and 
> rec.has_key('TI'):
> ?????????? print "TITLE: ",rec['TI']
> ??????????
>  print "ABSTRACT: ",rec['AB']
> ?????????? print ' '
> 
> 
> THE 
> PROBLEM: The program gives an error if a wrong ID is
> entered or an ID 
> other than UNIPROT ID e.g PDB ID, GSS ID etc. 
> 
> 
> 
> An Example Run with a wrong ID is shown below:
> 
> 
> please enter your UNIPROT_ID:
>  1wio #A PDB ID is given instead
> 
> 
> Traceback (most recent call last):
> ? File "file.py", line 11, in 
> <module>
> ??? seq_record=SeqIO.read(prothandle, "gb")
> ? File 
> "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py",
> line 522, in 
> read
> ??? raise ValueError("No records found in handle")
> ValueError:
> 
> 
>  No records found in handle
> 
> I want to avoid this error, thus i 
> want the program to print "INCORRECT ID GIVEN"? when a
> wrong or an 
> incorrect ID is given.
> 
> 
> Thanks a lot.
> lummy
> 
> 
> 
> 
> ? ? ? 
> _______________________________________________
> Biopython mailing list? -? Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


From fkauff at biologie.uni-kl.de  Tue Apr 20 14:27:30 2010
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Tue, 20 Apr 2010 16:27:30 +0200
Subject: [Biopython] Code for protein alpha helix prediction
Message-ID: <4BCDB9D2.4050207@biologie.uni-kl.de>

Hi all,

I've recently been asked to help with screening protein sequences for 
certain features, something I don't really know much about... Yet!

My questions: Is there some code in Biopython that allows for a quick 
check whether an amino acid sequece is likely to be a alpha helix? 
Couldn't find any. Or is there an algorithm that could be 
straightforwardly implemented in python, or a commandline tool that 
could be called from within a python script?

Thanks in advance,
Frank


From rodrigo_faccioli at uol.com.br  Tue Apr 20 15:34:47 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Tue, 20 Apr 2010 12:34:47 -0300
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
Message-ID: <j2q3715adb71004200834l263718c1u3a3590a4750198a7@mail.gmail.com>

Hi Frank,

I'm not sure if I understood your question. I'm computer scientist and I'm
researching globular protein structure prediction. In fact, I've studied the
application of Evolutionary Algorithms for it. Therefore, our goals are
different.

if I understood your question, you have a Fasta file of your protein.  So,
you need to communicate with databases such as NCBI, scop and CATH. In this
way, I recommend you use Entrez BioPython module. Other suggestion is the
use of BioPython Blast module.

Sorry if my answer is not what you is looking for.

Thanks,

--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218


On Tue, Apr 20, 2010 at 11:27 AM, Frank Kauff <fkauff at biologie.uni-kl.de>wrote:

> Hi all,
>
> I've recently been asked to help with screening protein sequences for
> certain features, something I don't really know much about... Yet!
>
> My questions: Is there some code in Biopython that allows for a quick check
> whether an amino acid sequece is likely to be a alpha helix? Couldn't find
> any. Or is there an algorithm that could be straightforwardly implemented in
> python, or a commandline tool that could be called from within a python
> script?
>
> Thanks in advance,
> Frank
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From biopython at maubp.freeserve.co.uk  Tue Apr 20 15:43:02 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Apr 2010 16:43:02 +0100
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
Message-ID: <g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>

On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff <fkauff at biologie.uni-kl.de> wrote:
> Hi all,
>
> I've recently been asked to help with screening protein sequences for
> certain features, something I don't really know much about... Yet!
>
> My questions: Is there some code in Biopython that allows for a quick check
> whether an amino acid sequece is likely to be a alpha helix? Couldn't find
> any. Or is there an algorithm that could be straightforwardly implemented in
> python, or a commandline tool that could be called from within a python
> script?

Hi Frank,

There are lots of tools for predicting secondary structure (alpha helices,
beta sheets etc) both de novo, and guided by reference sequences with
known structures. Some of these are online web services.

I'm pretty sure there is nothing for this built into Biopython, so for scripting
this for a large number of sequences then (as you have also suggested),
my first approach would be to look for command line tools which you could
call from Python. I've never needed to do this myself, and have no specific
recommendations regarding which tools to try first.

If you do find some useful algorithms which could easily be implemented
in Python, they could be worth including - maybe under Bio.SeqUtils?

Peter


From darnells at dnastar.com  Tue Apr 20 18:16:22 2010
From: darnells at dnastar.com (Steve Darnell)
Date: Tue, 20 Apr 2010 13:16:22 -0500
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
	<g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610021BE668@FS1.dnastar.com>

Frank,

One of the most accurate (and popular) algorithms is PSIPRED.  A
stand-alone command line version is available:
http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/

If memory serves, it requires a local installation of blast and the nr
database.  A position weight matrix generated from PSI-BLAST acts as
input to a neural network, which makes the secondary structure
predictions.

The Rosetta Design group had a poll last year of people's favorite
tools.  There are plenty of others to try if PSIPRED doesn't meet your
needs.

http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi
ction-algorithm/

I am not a PSIPRED developer, just a satisfied user.

Regards,
Steve

-----Original Message-----
From: biopython-bounces at lists.open-bio.org
[mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
Sent: Tuesday, April 20, 2010 10:43 AM
To: Frank Kauff
Cc: BioPython Mailing List
Subject: Re: [Biopython] Code for protein alpha helix prediction

On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff <fkauff at biologie.uni-kl.de>
wrote:
> Hi all,
>
> I've recently been asked to help with screening protein sequences for
> certain features, something I don't really know much about... Yet!
>
> My questions: Is there some code in Biopython that allows for a quick
check
> whether an amino acid sequece is likely to be a alpha helix? Couldn't
find
> any. Or is there an algorithm that could be straightforwardly
implemented in
> python, or a commandline tool that could be called from within a
python
> script?

Hi Frank,

There are lots of tools for predicting secondary structure (alpha
helices,
beta sheets etc) both de novo, and guided by reference sequences with
known structures. Some of these are online web services.

I'm pretty sure there is nothing for this built into Biopython, so for
scripting
this for a large number of sequences then (as you have also suggested),
my first approach would be to look for command line tools which you
could
call from Python. I've never needed to do this myself, and have no
specific
recommendations regarding which tools to try first.

If you do find some useful algorithms which could easily be implemented
in Python, they could be worth including - maybe under Bio.SeqUtils?

Peter
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From fkauff at biologie.uni-kl.de  Wed Apr 21 11:50:30 2010
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Wed, 21 Apr 2010 13:50:30 +0200
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <A4009967D1886D4286A9B7931FD58610021BE668@FS1.dnastar.com>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
	<g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610021BE668@FS1.dnastar.com>
Message-ID: <4BCEE686.3080803@biologie.uni-kl.de>

Thanks everybody!

Now I have plenty of tools to look at - the standalone version of 
psipred certainly fulfills the easy-to-use and quick-to-try-out 
requirements.

Frank

On 04/20/2010 08:16 PM, Steve Darnell wrote:
> Frank,
>
> One of the most accurate (and popular) algorithms is PSIPRED.  A
> stand-alone command line version is available:
> http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/
>
> If memory serves, it requires a local installation of blast and the nr
> database.  A position weight matrix generated from PSI-BLAST acts as
> input to a neural network, which makes the secondary structure
> predictions.
>
> The Rosetta Design group had a poll last year of people's favorite
> tools.  There are plenty of others to try if PSIPRED doesn't meet your
> needs.
>
> http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi
> ction-algorithm/
>
> I am not a PSIPRED developer, just a satisfied user.
>
> Regards,
> Steve
>
> -----Original Message-----
> From: biopython-bounces at lists.open-bio.org
> [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
> Sent: Tuesday, April 20, 2010 10:43 AM
> To: Frank Kauff
> Cc: BioPython Mailing List
> Subject: Re: [Biopython] Code for protein alpha helix prediction
>
> On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff<fkauff at biologie.uni-kl.de>
> wrote:
>    
>> Hi all,
>>
>> I've recently been asked to help with screening protein sequences for
>> certain features, something I don't really know much about... Yet!
>>
>> My questions: Is there some code in Biopython that allows for a quick
>>      
> check
>    
>> whether an amino acid sequece is likely to be a alpha helix? Couldn't
>>      
> find
>    
>> any. Or is there an algorithm that could be straightforwardly
>>      
> implemented in
>    
>> python, or a commandline tool that could be called from within a
>>      
> python
>    
>> script?
>>      
> Hi Frank,
>
> There are lots of tools for predicting secondary structure (alpha
> helices,
> beta sheets etc) both de novo, and guided by reference sequences with
> known structures. Some of these are online web services.
>
> I'm pretty sure there is nothing for this built into Biopython, so for
> scripting
> this for a large number of sequences then (as you have also suggested),
> my first approach would be to look for command line tools which you
> could
> call from Python. I've never needed to do this myself, and have no
> specific
> recommendations regarding which tools to try first.
>
> If you do find some useful algorithms which could easily be implemented
> in Python, they could be worth including - maybe under Bio.SeqUtils?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>    


From fkauff at biologie.uni-kl.de  Wed Apr 21 11:59:31 2010
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Wed, 21 Apr 2010 13:59:31 +0200
Subject: [Biopython] Code for protein alpha helix prediction
In-Reply-To: <g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
References: <4BCDB9D2.4050207@biologie.uni-kl.de>
	<g2t320fb6e01004200843h7940bdb6w30b7ebe392759ba9@mail.gmail.com>
Message-ID: <4BCEE8A3.3010008@biologie.uni-kl.de>

Hi Peter,

for the start, it seems psipred is the easiest one to use and to 
implement. I'll start with that, and once the parser for the output goes 
beyond the quick-and-dirty level, we can think about including it.

Frank

On 04/20/2010 05:43 PM, Peter wrote:
> On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff<fkauff at biologie.uni-kl.de>  wrote:
>    
>> Hi all,
>>
>> I've recently been asked to help with screening protein sequences for
>> certain features, something I don't really know much about... Yet!
>>
>> My questions: Is there some code in Biopython that allows for a quick check
>> whether an amino acid sequece is likely to be a alpha helix? Couldn't find
>> any. Or is there an algorithm that could be straightforwardly implemented in
>> python, or a commandline tool that could be called from within a python
>> script?
>>      
> Hi Frank,
>
> There are lots of tools for predicting secondary structure (alpha helices,
> beta sheets etc) both de novo, and guided by reference sequences with
> known structures. Some of these are online web services.
>
> I'm pretty sure there is nothing for this built into Biopython, so for scripting
> this for a large number of sequences then (as you have also suggested),
> my first approach would be to look for command line tools which you could
> call from Python. I've never needed to do this myself, and have no specific
> recommendations regarding which tools to try first.
>
> If you do find some useful algorithms which could easily be implemented
> in Python, they could be worth including - maybe under Bio.SeqUtils?
>
> Peter
>    


From bala.biophysics at gmail.com  Wed Apr 21 14:25:35 2010
From: bala.biophysics at gmail.com (Bala subramanian)
Date: Wed, 21 Apr 2010 16:25:35 +0200
Subject: [Biopython] removing redundant sequence
In-Reply-To: <t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
	<t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
Message-ID: <r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>

Peter,
Sorry for the delayed reply. Yes i want to remove those sequences that are
100% identical but they have different identifier. I created a sample fasta
file with two redundant sequences. But when i use checksums seguid to spot
the redundancies, it spots only the first one.

In [36]: for record in SeqIO.parse(open('t'),'fasta'):
   ....:     print record.id, seguid(record.seq)
   ....:
   ....:
A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
*AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
*AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
*
In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
rec:seguid(rec.seq))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/home/cbala/test/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
to_dict(sequences, key_function)
    585         key = key_function(record)
    586         if key in d :
--> 587             raise ValueError("Duplicate key '%s'" % key)
    588         d[key] = record
    589     return d

ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'


On Tue, Apr 13, 2010 at 5:02 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian
> <bala.biophysics at gmail.com> wrote:
> > Friends,
> > Sorry if this question was asked before. Is there any function in
> Biopython
> > that can remove redundant sequence records from a fasta file.
> >
> > Thanks,
> > Bala
>
> No, but you should be able to do this with Biopython - depending on
> what exactly you are asking for.
>
> When you say "redundant" do you mean 100% perfect identify?
>
> How big is your FASTA file - are you working with next-gen sequencing
> data and millions of reads?. If it is small enough you can keep all
> the data in memory to compare sequences to each other. Otherwise
> you might try using a checksum (e.g. SEGUID) to spot duplicates.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Wed Apr 21 15:10:45 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Apr 2010 16:10:45 +0100
Subject: [Biopython] removing redundant sequence
In-Reply-To: <r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
	<t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
	<r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>
Message-ID: <u2n320fb6e01004210810ve1c9a2f8qa09d0970c23e2062@mail.gmail.com>

On Wed, Apr 21, 2010 at 3:25 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Peter,
> Sorry for the delayed reply. Yes i want to remove those sequences that are
> 100% identical but they have different identifier. I created a sample fasta
> file with two redundant sequences. But when i use checksums seguid to spot
> the redundancies, it spots only the first one.
>
> In [36]: for record in SeqIO.parse(open('t'),'fasta'):
> ? ....: ? ? print record.id, seguid(record.seq)
> ? ....:
> ? ....:
> A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
> *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
> AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
> AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
> AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
> AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
> *
> In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
> rec:seguid(rec.seq))
> ---------------------------------------------------------------------------
> ValueError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Traceback (most recent call last)
>
> /home/cbala/test/<ipython console> in <module>()
>
> /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
> to_dict(sequences, key_function)
> ? ?585 ? ? ? ? key = key_function(record)
> ? ?586 ? ? ? ? if key in d :
> --> 587 ? ? ? ? ? ? raise ValueError("Duplicate key '%s'" % key)
> ? ?588 ? ? ? ? d[key] = record
> ? ?589 ? ? return d
>
> ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'

Hi Bala,

You know there are duplicate sequences in your file, so if you try
to use the SEGUID as a key, there will be duplicate keys. Thus
you get this error message. If you want to use Bio.SeqIO.to_dict
you have to have unique keys.

What you should do is loop over the records and keep a record
of the checksums you have saved, and use that to ignore duplicates.
I would use a python set rather than a python list for speed.

You could do this with a for loop. However, I would probably use an
iterator based approach with a generator function - I think it is more
elegant but perhaps not so easy for a beginner:

from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid

def remove_dup_seqs(records):
    """"SeqRecord iterator to removing duplicate sequences."""
    checksums = set()
    for record in records:
        checksum = seguid(record.seq)
        if checksum in checksums:
            print "Ignoring %s" % record.id
            continue
        checksums.add(checksum)
        yield record

records = remove_dup_seqs(SeqIO.parse("with_dups.fasta", "fasta"))
count = SeqIO.write(records, "no_dups.fasta", "fasta")
print "Saved %i records" % count

Note I've used filename with Bio.SeqIO which requires Biopython 1.54b
or later - for older versions use handles. See also:

http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/

Peter


From silvio.tschapke at googlemail.com  Wed Apr 21 18:34:54 2010
From: silvio.tschapke at googlemail.com (Silvio Tschapke)
Date: Wed, 21 Apr 2010 20:34:54 +0200
Subject: [Biopython] Entrez.efetch rettype retmode
Message-ID: <l2sd3ddc94e1004211134pe3590354kf65fdd8f76aa5cad@mail.gmail.com>

Hello.

I am new to Biopython and I tried to download a whole record with efetch.
The problem is that I get an error message in the output:
""Report 'full' not found in 'pmc' presentation""
Maybe I haven't understood the whole principle.

But isn't it the goal of pmc to provide full text? I have read the help-page
of efetch but it doesn't help me a lot.


----
handle = Entrez.efetch(db="pmc", id="2531137", rettype="full",
retmode="text")
string = str(handle.read())


f = open('./output.txt', 'w')
f.write(string)
----

Thanks for your help!


From robert.campbell at queensu.ca  Wed Apr 21 20:14:10 2010
From: robert.campbell at queensu.ca (Robert Campbell)
Date: Wed, 21 Apr 2010 16:14:10 -0400
Subject: [Biopython] Entrez.efetch rettype retmode
In-Reply-To: <l2sd3ddc94e1004211134pe3590354kf65fdd8f76aa5cad@mail.gmail.com>
References: <l2sd3ddc94e1004211134pe3590354kf65fdd8f76aa5cad@mail.gmail.com>
Message-ID: <20100421161410.4fd950ec@adelie.biochem.queensu.ca>

Hello Silvio,

On Wed, 21 Apr 2010 20:34:54 +0200 Silvio Tschapke
<silvio.tschapke at googlemail.com> wrote:

> Hello.
> 
> I am new to Biopython and I tried to download a whole record with efetch.
> The problem is that I get an error message in the output:
> ""Report 'full' not found in 'pmc' presentation""
> Maybe I haven't understood the whole principle.
> 
> But isn't it the goal of pmc to provide full text? I have read the help-page
> of efetch but it doesn't help me a lot.
> 
> 
> ----
> handle = Entrez.efetch(db="pmc", id="2531137", rettype="full",
> retmode="text")
> string = str(handle.read())

The documentation on efetch
(http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html)
specifies that:

  pmc - PubMed Central contains a number of articles classified as "open
  access" for which you may download the full text as XML. For the remaining
  articles in PMC you may download only the abstracts as XML.
 
So you just need to change your retmode='text' to retmode='xml' and omit the
rettype option altogether.  You will find that not all articles are free to
download this way though.  I tried a random one and got an error message that
the particular journal didn't allow download of full text as XML.

Cheers,
Rob
-- 
Robert L. Campbell, Ph.D.
Senior Research Associate/Adjunct Assistant Professor 
Botterell Hall Rm 644
Department of Biochemistry, Queen's University, 
Kingston, ON K7L 3N6  Canada
Tel: 613-533-6821            Fax: 613-533-2497
<robert.campbell at queensu.ca>    http://pldserver1.biochem.queensu.ca/~rlc


From laserson at mit.edu  Thu Apr 22 01:07:19 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 21 Apr 2010 21:07:19 -0400
Subject: [Biopython] Bug in GenBank/EMBL parser?
Message-ID: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>

Hi,

I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
supposedly conforms to the EMBL standard).

The short story is that whenever there is a feature, the parser checks
whether there are qualifiers in the feature with an assert statement, and
does not allow features with no qualifiers.  However, the IMGT flatfile is
full of entries that have features with no qualifiers (only coordinates).

Who is wrong here?  Does the EMBL specification require that a feature have
qualifiers?  Or is this a bug to be fixed in the parser.

To be more concrete, the parser broke on the following record:

ID   A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412
BP.
XX
AC   A03907;
XX
DT   11-MAR-1998 (Rel. 8, arrived in LIGM-DB )
DT   10-JUN-2008 (Rel. 200824-2, Last updated, Version 3)
XX
DE   H.sapiens antibody D1.3 variable region protein  ;
DE   unassigned DNA; rearranged configuration; Ig-Heavy; regular; group
IGHV.
XX
KW   antigen receptor; Immunoglobulin superfamily (IgSF);
KW   Immunoglobulin (IG); IG-Heavy; variable; diversity; joining;
KW   rearranged.
XX
OS   Homo sapiens (human)
OC   cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa;
Eumetazoa;
OC   Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata;
OC   Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda;
OC   Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates;
OC   Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae;
OC   Homo/Pan/Gorilla group; Homo.
XX
RN   [1]
RP   1-412
RA   ;
RT   "Recombinant antibodies and methods for their production.";
RL   Patent number EP0239400-A/10, 30-SEP-1987.
RL   MEDICAL RESEARCH COUNCIL.
XX
DR   EMBL; A03907.
XX
FH   Key             Location/Qualifiers (from EMBL)
FH
FT   source          1..412
FT                   /organism="Homo sapiens"
FT                   /mol_type="unassigned DNA"
FT                   /db_xref="taxon:9606"
FT   V_region        8..>412
FT                   /note="antibody D1.3 V region"
FT   sig_peptide     8..64
FT   CDS             8..>412
FT                   /product="antibody D1.3 V region (VDJ)"
FT                   /protein_id="CAA00308.1"
FT
/translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG
FT
FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL
FT                   HTDDTARYYCARERDYRLDYWGQGTTLTVSS"
FT   D_segment       356..371
FT   J_segment       372..>412
FT                   /note="J(H)2 region"
XX
SQ   Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other;
     tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct
 60
     ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc
120
     catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca
180
     gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta
240
     taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt
300
     cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag
360
     agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca
412
//

And the traceback was:

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (311, 0))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

/Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/<ipython
console> in <module>()

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_records(self, handle, do_features)
    418         #This is a generator function
    419         while True :
--> 420             record = self.parse(handle, do_features)
    421             if record is None : break
    422             assert record.id is not None

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse(self, handle, do_features)
    401                     feature_cleaner = FeatureValueCleaner())
    402
--> 403         if self.feed(handle, consumer, do_features) :
    404             return consumer.data
    405         else :

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in feed(self, handle, consumer, do_features)
    373         #Features (common to both EMBL and GenBank):
    374         if do_features :
--> 375             self._feed_feature_table(consumer,
self.parse_features(skip=False))
    376         else :
    377             self.parse_features(skip=True) # ignore the data

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_features(self, skip)
    170
feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip())
    171                     line = self.handle.readline()
--> 172                 features.append(self.parse_feature(feature_key,
feature_lines))
    173         self.line = line
    174         return features

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_feature(self, feature_key, lines)
    267                 else :
    268                     #Unquoted continuation
--> 269                     assert len(qualifiers) > 0
    270                     assert key==qualifiers[-1][0]
    271                     #if debug : print "Unquoted Cont %s:%s" % (key,
line)

AssertionError:

Which is tracked to an assert statement in Scanner.py at line 269.  It
appears that the assumption in the code is that there is an unquoted
continuation of a feature qualifier.

Finally, I am using biopython 1.51 that I built from source using python 2.5
(from an EPD install 4.3.0).  I am on a Mac running OS X 10.5.8 (Leopard)

Thanks!
Uri


From biopython at maubp.freeserve.co.uk  Thu Apr 22 08:56:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Apr 2010 09:56:52 +0100
Subject: [Biopython] Bug in GenBank/EMBL parser?
In-Reply-To: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>
References: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>
Message-ID: <h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com>

On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
> supposedly conforms to the EMBL standard).
>
> The short story is that whenever there is a feature, the parser checks
> whether there are qualifiers in the feature with an assert statement, and
> does not allow features with no qualifiers. ?However, the IMGT flatfile is
> full of entries that have features with no qualifiers (only coordinates).
>
> Who is wrong here? ?Does the EMBL specification require that a feature have
> qualifiers? ?Or is this a bug to be fixed in the parser.

Hi Uri,

Thank you for your detailed report,

Since you have raised this, I went back over the EMBL documentation.
All their example features qualifiers (and from personal experience all
EMBL files from the EMBL and GenBank files from the NCBI) do have
qualifiers. However, in Section 7.2 they are called "Optional qualifiers".
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2

So it does look like an unwarranted assumption in the Biopython
parser (even though it has been a safe assumption on "official" EMBL
and GenBank files thus far), which we should fix.

Could you file a bug please?
http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython

This also affect Biopython 1.54b (the latest release) and the current
code in the repository. I would hope we can solve this before
Biopython 1.54 proper is released.

Regards,

Peter


From chapmanb at 50mail.com  Thu Apr 22 12:18:10 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 22 Apr 2010 08:18:10 -0400
Subject: [Biopython] removing redundant sequence
In-Reply-To: <u2n320fb6e01004210810ve1c9a2f8qa09d0970c23e2062@mail.gmail.com>
References: <q2s288df32a1004130749re9f12d3ejcd60ffd3d05cce2b@mail.gmail.com>
	<t2j320fb6e01004130802u78271ebp8b48b32b488c6e2b@mail.gmail.com>
	<r2o288df32a1004210725yeb09e79fs57d7c3352da025cb@mail.gmail.com>
	<u2n320fb6e01004210810ve1c9a2f8qa09d0970c23e2062@mail.gmail.com>
Message-ID: <20100422121810.GV29724@sobchak.mgh.harvard.edu>

Bala;

> > I created a sample fasta
> > file with two redundant sequences. But when i use checksums seguid to spot
> > the redundancies, it spots only the first one.

> What you should do is loop over the records and keep a record
> of the checksums you have saved, and use that to ignore duplicates.
> I would use a python set rather than a python list for speed.
> 
> You could do this with a for loop. However, I would probably use an
> iterator based approach with a generator function - I think it is more
> elegant but perhaps not so easy for a beginner:
[... Nice code example from Peter ..]

This is a nice problem example and discussion. Bala, it sounds like
Peter provided some useful example code to solve this. Once you use
this to get together a program that solves your problem, it would be
very helpful if you could write it up as a Cookbook entry:

http://biopython.org/wiki/Category:Cookbook

That would help others in the future who will be tackling similar
issues. Thanks much,
Brad


From cloudycrimson at gmail.com  Fri Apr 23 07:56:45 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Fri, 23 Apr 2010 13:26:45 +0530
Subject: [Biopython] Qblast : no hits
Message-ID: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>

Hello freinds,


I have a  problem with qblast. I have sequences from the mass

spectromerty equipment that needs to be BLASTed to find the protein it
belongs

to. When I blast these sequences in the NCBI website it takes some time

(longer than usual ) but does gives me hits. When i blast them using the

following code in biopython they dont give me any hits.


CODE:

****************************************************************************

>>> from Bio.Blast import NCBIWWW

>>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR")

>>> blast_results = result_handle.read()

>>> save_file = open( "testseq.xml", "w")

>>> save_file.write(blast_results)

>>> save_file.close()

****************************************************************************


OUTPUT:

****************************************************************************

<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI

BlastOutput/EN" "NCBI_BlastOutput.dtd">

<BlastOutput>

  <BlastOutput_program>blastp</BlastOutput_program>

  <BlastOutput_version>BLASTP 2.2.23+</BlastOutput_version>

  <BlastOutput_reference>Alejandro A. Sch&auml;ffer, L. Aravind, Thomas L.

Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and


Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein

database searches with composition-based statistics and other

refinements", Nucleic Acids Res. 29:2994-3005.</BlastOutput_reference>

  <BlastOutput_db>nr</BlastOutput_db>

  <BlastOutput_query-ID>12361</BlastOutput_query-ID>

  <BlastOutput_query-def>unnamed protein product</BlastOutput_query-def>

  <BlastOutput_query-len>19</BlastOutput_query-len>

  <BlastOutput_param>

    <Parameters>

      <Parameters_matrix>BLOSUM62</Parameters_matrix>

      <Parameters_expect>10</Parameters_expect>

      <Parameters_gap-open>11</Parameters_gap-open>

      <Parameters_gap-extend>1</Parameters_gap-extend>

      <Parameters_filter>F</Parameters_filter>

    </Parameters>

  </BlastOutput_param>

<BlastOutput_iterations>

<Iteration>

  <Iteration_iter-num>1</Iteration_iter-num>

  <Iteration_query-ID>12361</Iteration_query-ID>

  <Iteration_query-def>unnamed protein product</Iteration_query-def>

  <Iteration_query-len>19</Iteration_query-len>

<Iteration_hits>

</Iteration_hits>

  <Iteration_stat>

    <Statistics>

      <Statistics_db-num>10888645</Statistics_db-num>

      <Statistics_db-len>-585703444</Statistics_db-len>

      <Statistics_hsp-len>0</Statistics_hsp-len>

      <Statistics_eff-space>0</Statistics_eff-space>

      <Statistics_kappa>0.041</Statistics_kappa>

      <Statistics_lambda>0.267</Statistics_lambda>

      <Statistics_entropy>0.14</Statistics_entropy>

    </Statistics>

  </Iteration_stat>

</Iteration>

</BlastOutput_iterations>

</BlastOutput>

*****************************************************************************


Is this because a normal blast code doesn wait long till the results are

given? I mean the RTOE error. if yes, how to control the "time of
execution"?

Or else what is the problem with my code?

If you guys know anything on this issue, please give me your ideas.

Thanking you in advance.

Sincerely,

Karthik


From biopython at maubp.freeserve.co.uk  Fri Apr 23 09:49:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Apr 2010 10:49:55 +0100
Subject: [Biopython] Qblast : no hits
In-Reply-To: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
Message-ID: <s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>

Hello Karthik

On Fri, Apr 23, 2010 at 8:56 AM, Karthik Raja <cloudycrimson at gmail.com> wrote:
> Hello freinds,
>
> I have a ?problem with qblast. I have sequences from the mass
> spectromerty equipment that needs to be BLASTed to find the protein it
> belongs to. When I blast these sequences in the NCBI website it takes
> some time (longer than usual ) but does gives me hits. When i blast
> them using the following code in biopython they dont give me any hits.
>
> CODE:
>
> ****************************************************************************
>
>>>> from Bio.Blast import NCBIWWW
>>>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR")
>>>> blast_results = result_handle.read()
>>>> save_file = open( "testseq.xml", "w")
>>>> save_file.write(blast_results)
>>>> save_file.close()
>
> ****************************************************************************
>
> Is this because a normal blast code doesn wait long till the results are
> given? I mean the RTOE error. if yes, how to control the "time of
> execution"?

What error? It looks like your example ran fine.

> Or else what is the problem with my code?
>
> If you guys know anything on this issue, please give me your ideas.

Differences between a manual BLAST search on the NCBI website
and a script search via QBLAST are almost always down to different
parameter settings. The NCBI have often adjusted the defaults on
the website, and they no longer match the defaults on QBLAST.
You should check things like the expectation cut off, the matrix,
gap penalties etc. The simplest option would be just to copy the
current defaults from the website into your python code.

We probably need to put this into the Biopython FAQ ...

Regards,

Peter


From cjfields at illinois.edu  Fri Apr 23 12:00:07 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 23 Apr 2010 07:00:07 -0500
Subject: [Biopython] Qblast : no hits
In-Reply-To: <s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
Message-ID: <C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>

On Apr 23, 2010, at 4:49 AM, Peter wrote:

>> ...
> 
> Differences between a manual BLAST search on the NCBI website
> and a script search via QBLAST are almost always down to different
> parameter settings. The NCBI have often adjusted the defaults on
> the website, and they no longer match the defaults on QBLAST.
> You should check things like the expectation cut off, the matrix,
> gap penalties etc. The simplest option would be just to copy the
> current defaults from the website into your python code.
> 
> We probably need to put this into the Biopython FAQ ...
> 
> Regards,
> 
> Peter

Same for BioPerl.

chris


From cloudycrimson at gmail.com  Sat Apr 24 03:27:10 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Sat, 24 Apr 2010 08:57:10 +0530
Subject: [Biopython] Qblast : no hits
In-Reply-To: <C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
Message-ID: <l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>

Hello Peter,

I did try changing the paramters according to the WWW BLAST and its gives an
error saying "no RID or no RTOE found". Its the same error i was trying to
tell you in the 1st post. Its the "request time of execution". Is there any
way to change this RTOE i.e. to increase it? Any idea?

On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields <cjfields at illinois.edu> wrote:

> On Apr 23, 2010, at 4:49 AM, Peter wrote:
>
> >> ...
> >
> > Differences between a manual BLAST search on the NCBI website
> > and a script search via QBLAST are almost always down to different
> > parameter settings. The NCBI have often adjusted the defaults on
> > the website, and they no longer match the defaults on QBLAST.
> > You should check things like the expectation cut off, the matrix,
> > gap penalties etc. The simplest option would be just to copy the
> > current defaults from the website into your python code.
> >
> > We probably need to put this into the Biopython FAQ ...
> >
> > Regards,
> >
> > Peter
>
> Same for BioPerl.
>
> chris
>


From p.j.a.cock at googlemail.com  Sat Apr 24 11:40:27 2010
From: p.j.a.cock at googlemail.com (Peter)
Date: Sat, 24 Apr 2010 12:40:27 +0100
Subject: [Biopython] Qblast : no hits
In-Reply-To: <l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
Message-ID: <6540A260-554B-488A-AED7-B0559883F7F7@googlemail.com>


On 24 Apr 2010, at 04:27, Karthik Raja <cloudycrimson at gmail.com> wrote:

> Hello Peter,
>
> I did try changing the paramters according to the WWW BLAST and its  
> gives an
> error saying "no RID or no RTOE found". Its the same error i was  
> trying to
> tell you in the 1st post. Its the "request time of execution". Is  
> there any
> way to change this RTOE i.e. to increase it? Any idea?
>
> On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields  
> <cjfields at illinois.edu> wrote:
>
>> On Apr 23, 2010, at 4:49 AM, Peter wrote:
>>
>>>> ...
>>>
>>> Differences between a manual BLAST search on the NCBI website
>>> and a script search via QBLAST are almost always down to different
>>> parameter settings. The NCBI have often adjusted the defaults on
>>> the website, and they no longer match the defaults on QBLAST.
>>> You should check things like the expectation cut off, the matrix,
>>> gap penalties etc. The simplest option would be just to copy the
>>> current defaults from the website into your python code.
>>>
>>> We probably need to put this into the Biopython FAQ ...
>>>
>>> Regards,
>>>
>>> Peter
>>
>> Same for BioPerl.
>>
>> chris
>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Sat Apr 24 11:49:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 24 Apr 2010 12:49:55 +0100
Subject: [Biopython] Qblast : no hits
In-Reply-To: <l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
Message-ID: <m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>

Hi all,

Sorry for the blank email just now.

On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote:
> Hello Peter,
>
> I did try changing the paramters according to the WWW BLAST
> and its gives an error saying "no RID or no RTOE found". Its the
> same error i was trying to tell you in the 1st post. Its the "request
> time of execution". Is there any way to change this RTOE i.e. to
> increase it? Any idea?

Please show us an example with this problem (i.e. the python
code and the traceback).

What is meant to happen is we send the query to the NCBI, and
they reply with reference details (RID and RTOE) which are
used to fetch the results after BLAST has finished running.

My guess for what is happening is your parameters are for
some reason invalid, and the NCBI is giving an error page
(so no RID and no RTOE). Biopython tries to spot any error
message in this situation, but in your case could not.

Peter


From cloudycrimson at gmail.com  Sun Apr 25 03:24:59 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Sun, 25 Apr 2010 08:54:59 +0530
Subject: [Biopython] Fwd:  Qblast : no hits
In-Reply-To: <s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
Message-ID: <y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>

 Hello Peter,

As said i did try changing the parameters of qblast according to the set in
the web blast.
The parameters that I changed are
1. Martrix
2. Word size
3. Expect

There is a check box option in the web page that allows us to check it if we
want the web blast to adjust according short sequences. I am not sure how to
bring that option into the qblast.

*Below given are the code and the traceback. *
 >>> from Bio.Blast import NCBIWWW
>>> result_handle = NCBIWWW.qblast ("blastp", "nr", "SSRVQDGMGLYTARRVR",
auto_format=None, composition_based_statistics=None, db_genetic_code=None,
endpoints=None, entrez_query='(none)', expect=200000, filter=None,
gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None,
layout=None, lcase_mask=None, matrix_name= 'PAM30', nucl_penalty=None,
nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None,
query_file=None, query_believe_defline=None, query_from=None, query_to=None,
searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None,
word_size=2, alignments=500, alignment_view=None, descriptions=500,
entrez_links_new_window=None, expect_low=None, expect_high=None,
format_entrez_query=None, format_object=None, format_type='XML',
ncbi_gi=None, results_file=None, show_overview=None)

*Traceback (most recent call last):
*  File "<pyshell#2>", line 1, in <module>
    result_handle = NCBIWWW.qblast *("blastp", "nr",
"SSRVQDGMGLYTARRVR",*auto_format=None,
composition_based_statistics=None, db_genetic_code=None,
endpoints=None, entrez_query='(none)', *expect=200000*, filter=None,
gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None,
layout=None, lcase_mask=None, *matrix_name= 'PAM30'*, nucl_penalty=None,
nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None,
query_file=None, query_believe_defline=None, query_from=None, query_to=None,
searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, *
word_size=2*, alignments=500, alignment_view=None, descriptions=500,
entrez_links_new_window=None, expect_low=None, expect_high=None,
format_entrez_query=None, format_object=None, format_type='XML',
ncbi_gi=None, results_file=None, show_overview=None)
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in
qblast
    rid, rtoe = _parse_qblast_ref_page(handle)
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in
_parse_qblast_ref_page
    raise ValueError("No RID and no RTOE found in the 'please wait' page."
ValueError: No RID and no RTOE found in the 'please wait' page. (there was
probably a problem with your request)

Here are a few examples of my MS sequences.


   1. *IMYTALPVIGKRHFRPSFTR *
   2. *RSSRGRGR *
   3. *AGPGPRRAKAAPYR *
   4. *ASRSYSSERRAR *
   5. *AASAAPPRAGRPDRGPLALAGR *
   6. *GSDGKSRGR *
   7. *TYGWRAEPR *
   8. *PPEPAREPRLSPRR *
   9. *GVLTALRR *
   10. *AGMRLPSRRQSFPAPVSR *

*Sincerely, *
*Karthikraja*

On Sat, Apr 24, 2010 at 5:19 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi all,
>
> Sorry for the blank email just now.
>
> On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote:
> > Hello Peter,
> >
> > I did try changing the paramters according to the WWW BLAST
> > and its gives an error saying "no RID or no RTOE found". Its the
> > same error i was trying to tell you in the 1st post. Its the "request
> > time of execution". Is there any way to change this RTOE i.e. to
> > increase it? Any idea?
>
> Please show us an example with this problem (i.e. the python
> code and the traceback).
>
> What is meant to happen is we send the query to the NCBI, and
> they reply with reference details (RID and RTOE) which are
> used to fetch the results after BLAST has finished running.
>
> My guess for what is happening is your parameters are for
> some reason invalid, and the NCBI is giving an error page
> (so no RID and no RTOE). Biopython tries to spot any error
> message in this situation, but in your case could not.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Sun Apr 25 12:45:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 25 Apr 2010 13:45:05 +0100
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
Message-ID: <t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>

On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote:

> *Below given are the code and the traceback. *

Great - I can run that and get the same traceback.

Here is a shorter version which does the same thing - removing all the
parameters you don't actually set:

from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
entrez_query='(none)', expect=200000, hitlist_size=50,
matrix_name='PAM30', word_size=2, alignments=500, descriptions=500,
format_type='XML')

Getting shorter still:

result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
matrix_name='PAM30')

The problem is the matrix name - remove that and the error goes away.
So progress :)

Doing a little digging, this is the error message from the NCBI is:

Message ID#35 Error: Cannot validate the Blast options:  Gap existence
and extension values of 11 and 1 not supported for PAM30
supported values are:
32767, 32767
7, 2
6, 2
5, 2
10, 1
9, 1
8, 1

As I guessed earlier, Biopython needed a little update to recognise
this error message and pass it to the user. I've done that.

In your case, you need to pick gap parameters appropriate for PAM30.

Peter


From cloudycrimson at gmail.com  Mon Apr 26 08:38:59 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Mon, 26 Apr 2010 14:08:59 +0530
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
Message-ID: <s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>

Hello Peter,

I tried out what you suggested and it works perfectly. I checked the result
XML file and there was no problem at all.
But I still have one more small issue that I am sure you can help me with.
The main reason i wanted to use python was that I could put all the query
sequences in a file and blast it. So when I tried the above code to blast a
sequence that I have put in a fasta file, it gives an error. Same kinda
error. Below are the code and traceback.

>>> fasta_string = open("test.fasta").read()
>>> result_handle = NCBIWWW.qblast("blastp", "nr",
fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50,
word_size=2, alignments=500, descriptions=500,format_type='XML')

*Traceback (most recent call last):
*  File "<pyshell#28>", line 2, in <module>
    word_size=2, alignments=500, descriptions=500,format_type='XML')
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in
qblast
    rid, rtoe = _parse_qblast_ref_page(handle)
  File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in
_parse_qblast_ref_page
    raise ValueError("No RID and no RTOE found in the 'please wait' page."
ValueError: No RID and no RTOE found in the 'please wait' page. (there was
probably a problem with your request)

Please let me know if you could sense in the problem with the code.

Sincerely,
Karthik

On Sun, Apr 25, 2010 at 6:15 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote:
>
> > *Below given are the code and the traceback. *
>
> Great - I can run that and get the same traceback.
>
> Here is a shorter version which does the same thing - removing all the
> parameters you don't actually set:
>
> from Bio.Blast import NCBIWWW
> result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
> entrez_query='(none)', expect=200000, hitlist_size=50,
> matrix_name='PAM30', word_size=2, alignments=500, descriptions=500,
> format_type='XML')
>
> Getting shorter still:
>
> result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR",
> matrix_name='PAM30')
>
> The problem is the matrix name - remove that and the error goes away.
> So progress :)
>
> Doing a little digging, this is the error message from the NCBI is:
>
> Message ID#35 Error: Cannot validate the Blast options:  Gap existence
> and extension values of 11 and 1 not supported for PAM30
> supported values are:
> 32767, 32767
> 7, 2
> 6, 2
> 5, 2
> 10, 1
> 9, 1
> 8, 1
>
> As I guessed earlier, Biopython needed a little update to recognise
> this error message and pass it to the user. I've done that.
>
> In your case, you need to pick gap parameters appropriate for PAM30.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Mon Apr 26 10:02:24 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 11:02:24 +0100
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2o320fb6e01004230249q6bd3e271p6f875d722196698c@mail.gmail.com>
	<C1919A4F-2779-498A-83E9-0AC04856B9F8@illinois.edu>
	<l2jddd5adac1004232027x1c213fd7s12d03c916037160f@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
Message-ID: <h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>

Hi Karthik,

On Mon, Apr 26, 2010 at 9:38 AM, Karthik Raja <cloudycrimson at gmail.com> wrote:
> Hello Peter,
>
> I tried out what you suggested and it works perfectly. I checked the result
> XML file and there was no problem at all.

That's good :)

> But I still have one more small issue that I am sure you can help me with.
> The main reason i wanted to use python was that I could put all the query
> sequences in a file and blast it.

I wouldn't recommend that approach.

For a modest number of queries, I would suggest doing one online BLAST
query at a time. This will spread out the load on the NCBI, and means each
time your XML results won't be too big. Trying to do too many queries at
risks hitting an NCBI CPU limit, or having problems downloading a very
large XML result file.

For a large number of queries, I would suggest using standalone BLAST
(installed and run locally) - especially if you want to use very lenient
parameters giving lots of results (meaning large output files).

> So when I tried the above code to blast a
> sequence that I have put in a fasta file, it gives an error. Same kinda
> error. Below are the code and traceback.
>
>>>> fasta_string = open("test.fasta").read()
>>>> result_handle = NCBIWWW.qblast("blastp", "nr",
> fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50,
> word_size=2, alignments=500, descriptions=500,format_type='XML')
>
> *Traceback (most recent call last):
> ...
> ValueError: No RID and no RTOE found in the 'please wait' page. (there was
> probably a problem with your request)
>
> Please let me know if you could sense in the problem with the code.
>
> Sincerely,
> Karthik

The code works fine - I just tried it using a FASTA file with four proteins.
I would guess there is a problem with your FASTA file - perhaps there
is a bad sequence in it, or too many sequences. Since you don't have
the latest code we can't see the NCBI error message in the traceback,
which would help a lot.

I see you are running on Windows, so the easiest way to try this is
to backup C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py
and replace it with the new version from our repository:
http://biopython.open-bio.org/SRC/biopython/Bio/Blast/NCBIWWW.py
or:
http://github.com/biopython/biopython/raw/master/Bio/Blast/NCBIWWW.py

Or, could you send me the FASTA file to try it here (please send it to me
directly, not the mailing list).

Regards,

Peter


From nick_leake77 at hotmail.com  Mon Apr 26 15:36:28 2010
From: nick_leake77 at hotmail.com (Nick Leake)
Date: Mon, 26 Apr 2010 11:36:28 -0400
Subject: [Biopython] parsing a fasta with multiple entries
Message-ID: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>


Hello,
 
I'm having trouble parsing a fasta file with multiple sequences - it is a fasta that has most of the transposable elements in fruit flies found at http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down.  I want to be able to access the DNA sequences for manipulation and later removal from a chromosomal region.  I originally thought that I could follow the same fasta format example shown in the biopython tutorial.  However, that failed to work.  I think it might be because there are multiple entries.  
 
Basically, I just want parse the information and have dictionaries hold the transposon elements name and sequence for later use.  Can I do that with biopython or should I make my own parser? Any help would be greatly appreciated.  I'm still very much a python novice and get frustrated by not knowing how to ask my questions appropriately. 		 	   		  
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5


From biopython at maubp.freeserve.co.uk  Mon Apr 26 15:52:28 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 16:52:28 +0100
Subject: [Biopython] help with parsing EMBL
In-Reply-To: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
Message-ID: <g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>

Hi Nick,

On Mon, Apr 26, 2010 Nick Leake wrote:
> Hello,
>
> I'm having trouble parsing an embl file (attached) with multiple
> sequences. ?I want to be able to access the DNA sequences for
> manipulation and removal from a chromosomal region. ?I originally
> thought that I could follow the same fasta format example shown in the
> biopython tutorial. ?However, that failed to work. ?Next, I tried to
> convert the file to a fastq or a fasta to just follow the examples -
> again, failed. ?So, I looked around and found some embl parsing code:
>
> from Bio import SeqIO
>
> p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
> p.next()
> record=p.next()
>
> print record
>
> This kinda works, but fails to read all entries.

Well, yes:

from Bio import SeqIO
#that imports the library

p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
#that sets up the EMBL parser (although EMBL files are text so it is a bit
#odd to open it in binary read mode)

p.next() #reads the first record and discards it

record=p.next() #reads the second record and stores as variable record

You only ever try and look at the second record. See below...

> ... ?In addition, I don't know what code I need to 'grab' the DNA
> information for manipulations and remove these sequences from
> a given DNA segment. ? ?Can I get a little guidance to
> what I need to do or where I can look to help solve my problem?

What you probably want to start with is a simple for loop,

from Bio import SeqIO
for record in SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41"),"embl"):
    print record.id, record.seq

However, this runs into a problem:

Traceback (most recent call last):
  ...
ValueError: Expected sequence length 2, found 2483.

Looking at your file (which was too big to send to the list), your EMBL
file is invalid. Specifically this is failing on the record which starts:

ID   FROGGER    standard; DNA; INV; 2 BP.

That ID line says the sequence is just 2 base pairs, but in fact the
seems to be 2483bp. The ID line should probably be edited like this:

ID   FROGGER    standard; DNA; INV; 2483 BP.

Fixing that shows up another similar problem,

ID   TV1    standard; DNA; INV; 1728 BP.

should probably be:

ID   TV1    standard; DNA; INV; 1730 BP.

Then there is this record:

ID   DDBARI1    standard; DNA; INV; 1676 BP.

Several parts of the record suggest it should be 1676bp (not just the ID
line, but also for example the SQ line), but there is actually 1677bp of
sequence present.

After making those three edits by hand, Biopython should parse it.
I suspect your EMBL file has been manually edited. Where did it
come from?

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 15:54:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 16:54:54 +0100
Subject: [Biopython] Fwd: help with parsing EMBL
In-Reply-To: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
Message-ID: <l2w320fb6e01004260854td3e000cbkb51c19792f024328@mail.gmail.com>

Hi all,

I'm forwarding this email from Nick Leake about parsing EMBL files,
but without his 1.3MB attachment. I'll reply to his questions in a
follow up email...

Peter

---------- Forwarded message ----------
From:?Nick Leake
To:?<biopython at lists.open-bio.org>
Date:?Mon, 26 Apr 2010 09:35:45 -0400
Subject:?help with parsing

Hello,


I'm having trouble parsing an embl file (attached) with multiple
sequences. ?I want to be able to access the DNA sequences for
manipulation and removal from a chromosomal region. ?I originally
thought that I could follow the same fasta format example shown in the
biopython tutorial. ?However, that failed to work. ?Next, I tried to
convert the file to a fastq or a fasta to just follow the examples -
again, failed. ?So, I looked around and found some embl parsing code:


from Bio import SeqIO

p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
p.next()
record=p.next()

print record


This kinda works, but fails to read all entries. ?Also, there is no
'record' argument for output. ?In addition, I don't know what code I
need to 'grab' the DNA information for manipulations and remove these
sequences from a given DNA segment. ? ?Can I get a little guidance to
what I need to do or where I can look to help solve my problem?

Any help would be greatly appreciated. ?I'm still very much a python
novice and get frustrated by not knowing how to ask my questions
appropriately.

_________________________________________________________________
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

---------- Forwarded message ----------
From:?biopython-request at lists.open-bio.org
To:
Date:?Mon, 26 Apr 2010 09:44:02 -0400
Subject:?confirm 29081d7dc4252dd9c96c13f5018658d3414acbdc
If you reply to this message, keeping the Subject: header intact,
Mailman will discard the held message. ?Do this if the message is
spam. ?If you reply to this message and include an Approved: header
with the list password in it, the message will be approved for posting
to the list. ?The Approved: header can also appear in the first line
of the body of the reply.


From biopython at maubp.freeserve.co.uk  Mon Apr 26 15:59:02 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 16:59:02 +0100
Subject: [Biopython] parsing a fasta with multiple entries
In-Reply-To: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>
References: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>
Message-ID: <h2s320fb6e01004260859m218775ffj9dbebe1220358480@mail.gmail.com>

On Mon, Apr 26, 2010 at 4:36 PM, Nick Leake <nick_leake77 at hotmail.com> wrote:
>
> Hello,
>
> I'm having trouble parsing a fasta file with multiple sequences - it is a fasta
> that has most of the transposable elements in fruit flies found at
> http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down.

Hi Nick,

You mean this file?
http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.fasta

> I want to be able to access the DNA sequences for manipulation and later
> removal from a chromosomal region. ?I originally thought that I could follow
> the same fasta format example shown in the biopython tutorial. ?However,
> that failed to work. ?I think it might be because there are multiple entries.

The Bio.SeqIO.read() function is for when there is a single record. The
Bio.SeqIO.parse() function is for when you have multiple records. Could
you clarify which bit of the tutorial was confusing? We'd like to make it
better.

> Basically, I just want parse the information and have dictionaries hold the
> transposon elements name and sequence for later use. ?Can I do that with
> biopython or should I make my own parser? Any help would be greatly
> appreciated. ?I'm still very much a python novice and get frustrated by not
> knowing how to ask my questions appropriately.

You should be able to use the Bio.SeqIO.index() function for this.

>>> from Bio import SeqIO
>>> data = SeqIO.index("D_mel_transposon_sequence_set.fasta", "fasta")
>>> data.keys()[:10]
['gb|U14101|TART-B', 'gb|AF162798|Dbuz\\BuT1',
'gb|U26847|Dvir\\Helena', 'gb|X67681|Bari1', 'gb|M69216|hobo',
'gb|U29466|Dkoe\\Gandalf', 'gb|Z27119|flea',
'gb|AB022762|aurora-element', 'gb|nnnnnnnn|Stalker3T',
'gb|AF518730|Dwil\\Vege']
>>> data["gb|nnnnnnnn|Stalker3T"]
SeqRecord(seq=Seq('TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAAT...ACA',
SingleLetterAlphabet()), id='gb|nnnnnnnn|Stalker3T',
name='gb|nnnnnnnn|Stalker3T', description='gb|nnnnnnnn|Stalker3T
STALKER3 372bp', dbxrefs=[])
>>> print data["gb|nnnnnnnn|Stalker3T"].seq
TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAATATGTAAAGTAGAGTTAATATGTAAGTAAGCAAAAGACCACCAACACTTACATGAACACTCCAGCTCTTGAAATACGATCGAGCGCTTAAACATAAGCCGATCGCGGAGCGTGAGAGTGCCGAGCATACACCTAGCAGCTCAAGTGATTAAGATAAGATAAGATAAGATAACAAACACGTAGTCTTAAGCGCGTCATGTGCGGGTGGCTGTACCCAAGAACAGCAAAGTGAATTCATTCGAATAAACCGCTTCAAGCAGAGCAGAGCCAAGTCTATTATATCAACTTCAAAAATACCGTATAACCTTGAACCTATTACA

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 16:02:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 17:02:18 +0100
Subject: [Biopython] help with parsing EMBL
In-Reply-To: <g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
	<g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>
Message-ID: <r2z320fb6e01004260902j46228ecl3b938d0853b4cc56@mail.gmail.com>

On Mon, Apr 26, 2010 at 4:52 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Nick,
>
> On Mon, Apr 26, 2010 Nick Leake wrote:
>> Hello,
>>
>> I'm having trouble parsing an embl file (attached) with multiple
>> sequences. ...
>
> After making those three edits by hand, Biopython should parse it.
> I suspect your EMBL file has been manually edited. Where did it
> come from?

>From Nick's other email about the FASTA file,
http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html
I can can see that the funny EMBL file came from the Berkeley Drosophil
 Genome Project (BDGP)'s Natural Transposable Element Project:
http://www.fruitfly.org/p_disrupt/TE.html

Specifically this file:
http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl

I'll email them to alert them about the three obvious errors I discussed.

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 16:28:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 17:28:31 +0100
Subject: [Biopython] help with parsing EMBL
In-Reply-To: <r2z320fb6e01004260902j46228ecl3b938d0853b4cc56@mail.gmail.com>
References: <w2n320fb6e01004260837z8a682697w1f2c5e10ab8bfa63@mail.gmail.com>
	<g2m320fb6e01004260852t85f84edam5a2bbd9529ddb3ef@mail.gmail.com>
	<r2z320fb6e01004260902j46228ecl3b938d0853b4cc56@mail.gmail.com>
Message-ID: <h2j320fb6e01004260928v7fd51eb0j599a506b9cbe3186@mail.gmail.com>

On Mon, Apr 26, 2010 at 5:02 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> From Nick's other email about the FASTA file,
> http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html
> I can can see that the funny EMBL file came from the Berkeley Drosophil
> ?Genome Project (BDGP)'s Natural Transposable Element Project:
> http://www.fruitfly.org/p_disrupt/TE.html
>
> Specifically this file:
> http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl
>
> I'll email them to alert them about the three obvious errors I discussed.

There is also something odd going on with the features, which the
Biopython parser seems to be ignoring...

Peter


From biopython at maubp.freeserve.co.uk  Mon Apr 26 22:04:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Apr 2010 23:04:15 +0100
Subject: [Biopython] parsing a fasta with multiple entries
In-Reply-To: <SNT113-W64C744E14961EDDDAF60CDF9040@phx.gbl>
References: <SNT113-W361DAA45FF077EF8E9A663F9040@phx.gbl>
	<h2s320fb6e01004260859m218775ffj9dbebe1220358480@mail.gmail.com>
	<SNT113-W64C744E14961EDDDAF60CDF9040@phx.gbl>
Message-ID: <l2m320fb6e01004261504ye631c395if33c0a6493c7393d@mail.gmail.com>

On Mon, Apr 26, 2010 at 8:05 PM, Nick Leake wrote:
> Thanks Peter,
>
> All of the information is?very helpful.? I apologize for sending?second
> email.? I was thinking that?the first email was going to be discarded for
> having the attachment - which in hindsight is an obvious fact.? At that
> time, I had only seen the initial email for rejecting the first.

I managed to reply before sending the original email (without
attachment) to the list - so partly my fault.

>>> I want to be able to access the DNA sequences for manipulation and
>>> later removal from a chromosomal region. ?I originally thought that I
>>> could follow the same fasta format example shown in the biopython
>>> tutorial. ?However, that failed to work. ?I think it might be because
>>> there are multiple entries.
>>
>> The Bio.SeqIO.read() function is for when there is a single record. The
>> Bio.SeqIO.parse() function is for when you have multiple records. Could
>> you clarify which bit of the tutorial was confusing? We'd like to make it
>> better.
>
> The tutorial I used was from
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html

OK, good - that is the current version.

> I will admit I didn't really know the difference from the Bio.SeqIO.read()
> verse the Bio.SeqIO.parse() functions even though they should be
> intuitive.? Still, the mentioned tutorial doen't seem to have a multiple
> entry parsed example.?This is where my naivet??and confusion on
> the matter probably started.

It does (the file ls_orchid.fasta used in several examples has 94
entries), but I guess there is a lot of information in there and it can be
overwhelming.

Your problems with the funny EMBL file probably didn't help :(

Peter


From p.j.a.cock at googlemail.com  Mon Apr 26 22:30:54 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 26 Apr 2010 23:30:54 +0100
Subject: [Biopython] Google Summer of Code - accepted students
In-Reply-To: <4BD60D63.1040400@cornell.edu>
References: <4BD60D63.1040400@cornell.edu>
Message-ID: <m2l320fb6e01004261530zd9f1a723ge8958362426bb7be@mail.gmail.com>

---------- Forwarded message ----------
From: Robert Buels <rmb32 at cornell.edu>
Date: Mon, Apr 26, 2010 at 11:02 PM
Subject: Google Summer of Code - accepted students
To: rmb32 at cornell.edu


Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology
module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5
originally assigned, plus 1 extra) allotted to us by Google.
Proposals were extremely competitive: 6 out of 52 translates to an
11.5% acceptance rate. ?We received a lot of really excellent
proposals, the decisions were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From rmb32 at cornell.edu  Mon Apr 26 22:02:11 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 15:02:11 -0700
Subject: [Biopython] Google Summer of Code - accepted students
Message-ID: <4BD60D63.1040400@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of 
Code students, listed in alphabetical order with their project titles 
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including 
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, 
Classification, and Visualization of Posttranslational Modification of 
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & 
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending 
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally 
assigned, plus 1 extra) allotted to us by Google.  Proposals were 
extremely competitive: 6 out of 52 translates to an 11.5% acceptance 
rate.  We received a lot of really excellent proposals, the decisions 
were not easy.

Thanks very much to all the students who applied, we very much 
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do 
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From anaryin at gmail.com  Tue Apr 27 04:29:36 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 27 Apr 2010 12:29:36 +0800
Subject: [Biopython] Google Summer of Code - accepted students
In-Reply-To: <m2l320fb6e01004261530zd9f1a723ge8958362426bb7be@mail.gmail.com>
References: <4BD60D63.1040400@cornell.edu>
	<m2l320fb6e01004261530zd9f1a723ge8958362426bb7be@mail.gmail.com>
Message-ID: <g2lb537e3711004262129tdda1424eke077314b53a90a2a@mail.gmail.com>

Hello all!

Thanks for the confidence! I'm sure it's going to work alright! If
anyone has any comments to add to my application feel free either to
email me!

Regards!

Jo?o [...] Rodrigues

On Monday, April 26, 2010, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> ---------- Forwarded message ----------
> From: Robert Buels <rmb32 at cornell.edu>
> Date: Mon, Apr 26, 2010 at 11:02 PM
> Subject: Google Summer of Code - accepted students
> To: rmb32 at cornell.edu
>
>
> Hi all,
>
> I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
> Code students, listed in alphabetical order with their project titles
> and primary mentors:
>
> Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
> Implementation of Multiple Sequence Alignment Algorithms
>
> Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
> Classification, and Visualization of Posttranslational Modification of
> Proteins
>
> Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby
>
> Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
> Duplication Inference Algorithm for Binary and Non-binary Species Tree
>
> Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
> Bio.PDB: broadening the usefulness of BioPython's Structural Biology
> module
>
> Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring
>
> Congratulations to our accepted students!
>
> All told, we had 52 applications submitted for the 6 slots (5
> originally assigned, plus 1 extra) allotted to us by Google.
> Proposals were extremely competitive: 6 out of 52 translates to an
> 11.5% acceptance rate. ?We received a lot of really excellent
> proposals, the decisions were not easy.
>
> Thanks very much to all the students who applied, we very much
> appreciate your hard work.
>
> Here's to a great 2010 Summer of Code, I'm sure these students will do
> some wonderful work.
>
> Rob Buels
> OBF GSoC 2010 Administrator
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

-- 
Jo?o [...] Rodrigues
@ http://stanford.edu/~joaor/


From rmb32 at cornell.edu  Tue Apr 27 05:52:57 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 22:52:57 -0700
Subject: [Biopython] Google Summer of Code - accepted students
Message-ID: <4BD67BB9.3000804@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally
assigned, plus 1 extra) allotted to us by Google.  Proposals were
extremely competitive: 6 out of 52 translates to an 11.5% acceptance
rate.  We received a lot of really excellent proposals, the decisions
were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From biopython at maubp.freeserve.co.uk  Tue Apr 27 09:45:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Apr 2010 10:45:20 +0100
Subject: [Biopython] Bug in GenBank/EMBL parser?
In-Reply-To: <h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com>
References: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com>
	<h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com>
Message-ID: <z2v320fb6e01004270245s6d83b011jd4e135e8444beee8@mail.gmail.com>

On Thu, Apr 22, 2010 at 9:56 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson <laserson at mit.edu> wrote:
>> Hi,
>>
>> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
>> supposedly conforms to the EMBL standard).
>>
>> The short story is that whenever there is a feature, the parser checks
>> whether there are qualifiers in the feature with an assert statement, and
>> does not allow features with no qualifiers. ?However, the IMGT flatfile is
>> full of entries that have features with no qualifiers (only coordinates).
>>
>> Who is wrong here? ?Does the EMBL specification require that a feature have
>> qualifiers? ?Or is this a bug to be fixed in the parser.
>
> Hi Uri,
>
> Thank you for your detailed report,
>
> Since you have raised this, I went back over the EMBL documentation.
> All their example features qualifiers (and from personal experience all
> EMBL files from the EMBL and GenBank files from the NCBI) do have
> qualifiers. However, in Section 7.2 they are called "Optional qualifiers".
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2
>
> So it does look like an unwarranted assumption in the Biopython
> parser (even though it has been a safe assumption on "official" EMBL
> and GenBank files thus far), which we should fix.

Bug filed and now fixed,
http://bugzilla.open-bio.org/show_bug.cgi?id=3062

It turned out to be an invalid EMBL file where the features were over-
indented. Biopython was quite happy to parse valid EMBL or GenBank
files with features without qualifiers (although I don't recall seeing any
examples from EMBL or the NCBI like this).

Peter


From silvio.tschapke at googlemail.com  Wed Apr 28 09:24:25 2010
From: silvio.tschapke at googlemail.com (Silvio Tschapke)
Date: Wed, 28 Apr 2010 11:24:25 +0200
Subject: [Biopython] save efetch results in different files
Message-ID: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>

Hi all,

I'd like to download hundreds of pubmed entries in one turn, but save every
entry in a single file for further processing with e.g. NLTK.
Is this possible? Or what is the common way to do this? Or do I have to call
efetch for every single pmid? I dont know how.
Could you also explain me what handle.read() does? Entrez.read(handle) I
understand, because it is documented, but handle.read() not. What kind of
type is a handle?


search_results = Entrez.read(Entrez.esearch(db="pubmed",
                                            term="Biopython",
                                            usehistory="y"))

batch_size = 10


for start in range(0,count,batch_size):
    end = min(count, start+batch_size)
    print "Going to download record %i to %i" % (start+1, end)
    fetch_handle = Entrez.efetch(db="pubmed", rettype="xml",
                                 retstart=start, retmax=batch_size,
                                 webenv=search_results["WebEnv"],
                                 query_key=search_results["QueryKey"])


for pmid in search_results["IdList"]:
    out_handle = open(pmid+".txt", "w")
    HERE I HAVE TO ACCESS THE ENTRY FROM THE fetch_handle FOR THE
CORRESPONDING pmid

    #data = Entrez.read(fetch_handle)
    #data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
    out_handle.close()


Cheers,
Silvio


From biopython at maubp.freeserve.co.uk  Wed Apr 28 09:57:48 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 28 Apr 2010 10:57:48 +0100
Subject: [Biopython] save efetch results in different files
In-Reply-To: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>
References: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>
Message-ID: <g2w320fb6e01004280257jbf297b99qac07bc2b7f9519ae@mail.gmail.com>

On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke
<silvio.tschapke at googlemail.com> wrote:
> Hi all,
>
> I'd like to download hundreds of pubmed entries in one turn, but save every
> entry in a single file for further processing with e.g. NLTK.
> Is this possible? Or what is the common way to do this? Or do I have to call
> efetch for every single pmid? I dont know how.

Personally I would probably save each pubmed result to a separate file
named using the pmid - a Unix filesystem should cope fine with a few
thousand files in a single directory. This is simple and lets you add more
entries at a later date, and you have simple access to any record.

The other approach of combining separate entries into multiple files sounds
overly complicated (although possible), while another approach would be a
single large file containing all the records in one. These would require a
index if you needed random access to the entries by pmid.

> Could you also explain me what handle.read() does? Entrez.read(handle) I
> understand, because it is documented, but handle.read() not. What kind of
> type is a handle?

It is *like* a standard handle that you'd get in python from open(filename).
This is an object supporting read() giving all the remaining data as a string,
readline() giving the next line etc.

Peter


From laserson at mit.edu  Wed Apr 28 18:49:40 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 28 Apr 2010 14:49:40 -0400
Subject: [Biopython] SPARK error messages to be sent to stderr?
Message-ID: <r2y165c1bda1004281149jbd33bb38hff43801adb023676@mail.gmail.com>

The spark error messages when there is a parsing problem are currently
getting sent to stdout:

(line 181 in Bio/Parsers/spark.py)
print "Syntax error at or near `%s' token" % token

Can this be changed to:
print >>sys.stderr, "Syntax error at or near `%s' token" % token

This way the error messages can be handled separately.

Thanks!
Uri


From laserson at mit.edu  Wed Apr 28 19:12:28 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 28 Apr 2010 15:12:28 -0400
Subject: [Biopython] Can the GenBank/EMBL parser recover from errors?
Message-ID: <t2o165c1bda1004281212waa7dbf7ar5d1b4ca3d4b7664e@mail.gmail.com>

Hi,

I am trying to parse a large file of EMBL records that I know has some
errors in it.  However, rather than having the parser break when it gets to
the error, I'd rather it just skip that record, and move on to the next one.
 I was wondering if this functionality is already built in somewhere.  One
way I can do this is like this:

iterator = SeqIO.parse(ip,'embl').__iter__()
while True:
    try:
        record = iterator.next()
    # Now I specify all the parsing errors I want to catch:
    except LocationParserError:
        # Reinitialize iterator at current file position. The iterator
        # then skips to the beginning of the next record and continues.
        iterator = SeqIO.parse(ip,'embl').__iter__()
    except StopIteration:
        break

This way, whenever there is a parsing error, I just reinitialize the
iterator at the current file position, and it seeks to the beginning of the
next record.  However, this requires me to write out the for loop manually
(using StopIteration).  Does anyone know of a cleaner/more elegant way of
doing this?

Thanks!
Uri

-- 
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


From laserson at mit.edu  Wed Apr 28 21:38:52 2010
From: laserson at mit.edu (Uri Laserson)
Date: Wed, 28 Apr 2010 17:38:52 -0400
Subject: [Biopython] Bug in GenBank/EMBL parser?
In-Reply-To: <z2v320fb6e01004270245s6d83b011jd4e135e8444beee8@mail.gmail.com>
References: <l2x165c1bda1004211807o1cdcea19w46608a52fdf2a679@mail.gmail.com> 
	<h2p320fb6e01004220156t7add106dn6e49af03a2b04c6c@mail.gmail.com> 
	<z2v320fb6e01004270245s6d83b011jd4e135e8444beee8@mail.gmail.com>
Message-ID: <y2h165c1bda1004281438kb5e5049n913fa4af53e184a@mail.gmail.com>

This fixed the main problem with parsing IMGT files that have increased
indentation.  I also filed an additional bug/enhancement with a proposed
patch, which should make biopython compatible with IMGT and still conform to
the INSDC format: http://bugzilla.open-bio.org/show_bug.cgi?id=3069

<http://bugzilla.open-bio.org/show_bug.cgi?id=3069>Uri

On Tue, Apr 27, 2010 at 05:45, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Apr 22, 2010 at 9:56 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson <laserson at mit.edu> wrote:
> >> Hi,
> >>
> >> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
> >> supposedly conforms to the EMBL standard).
> >>
> >> The short story is that whenever there is a feature, the parser checks
> >> whether there are qualifiers in the feature with an assert statement,
> and
> >> does not allow features with no qualifiers.  However, the IMGT flatfile
> is
> >> full of entries that have features with no qualifiers (only
> coordinates).
> >>
> >> Who is wrong here?  Does the EMBL specification require that a feature
> have
> >> qualifiers?  Or is this a bug to be fixed in the parser.
> >
> > Hi Uri,
> >
> > Thank you for your detailed report,
> >
> > Since you have raised this, I went back over the EMBL documentation.
> > All their example features qualifiers (and from personal experience all
> > EMBL files from the EMBL and GenBank files from the NCBI) do have
> > qualifiers. However, in Section 7.2 they are called "Optional
> qualifiers".
> >
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2
> >
> > So it does look like an unwarranted assumption in the Biopython
> > parser (even though it has been a safe assumption on "official" EMBL
> > and GenBank files thus far), which we should fix.
>
> Bug filed and now fixed,
> http://bugzilla.open-bio.org/show_bug.cgi?id=3062
>
> It turned out to be an invalid EMBL file where the features were over-
> indented. Biopython was quite happy to parse valid EMBL or GenBank
> files with features without qualifiers (although I don't recall seeing any
> examples from EMBL or the NCBI like this).
>
> Peter
>


-- 
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
laserson at mit.edu


From p.j.a.cock at googlemail.com  Wed Apr 28 22:11:43 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 28 Apr 2010 23:11:43 +0100
Subject: [Biopython] Can the GenBank/EMBL parser recover from errors?
In-Reply-To: <t2o165c1bda1004281212waa7dbf7ar5d1b4ca3d4b7664e@mail.gmail.com>
References: <t2o165c1bda1004281212waa7dbf7ar5d1b4ca3d4b7664e@mail.gmail.com>
Message-ID: <r2k320fb6e01004281511x897d58f7gca81d2b8f91e303@mail.gmail.com>

On Wednesday, April 28, 2010, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to parse a large file of EMBL records that I know has some
> errors in it. ?However, rather than having the parser break when it gets to
> the error, I'd rather it just skip that record, and move on to the next one.
> ?I was wondering if this functionality is already built in somewhere. ?One
> way I can do this is like this:
>
> iterator = SeqIO.parse(ip,'embl').__iter__()
> while True:
>  ? ?try:
>  ? ? ? ?record = iterator.next()
>  ? ?# Now I specify all the parsing errors I want to catch:
>  ? ?except LocationParserError:
>  ? ? ? ?# Reinitialize iterator at current file position. The iterator
>  ? ? ? ?# then skips to the beginning of the next record and continues.
>  ? ? ? ?iterator = SeqIO.parse(ip,'embl').__iter__()
>  ? ?except StopIteration:
>  ? ? ? ?break
>
> This way, whenever there is a parsing error, I just reinitialize the
> iterator at the current file position, and it seeks to the beginning of the
> next record. ?However, this requires me to write out the for loop manually
> (using StopIteration). ?Does anyone know of a cleaner/more elegant way of
> doing this?
>
> Thanks!

Hi Uri,

There is no obvious way to handle this within the Bio.SeqIO.parse framework.

I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't
so corrupt that it can't be scanned to identify each record). Just
wrap each record access in an error handler.

Peter


From cloudycrimson at gmail.com  Thu Apr 29 06:58:26 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Thu, 29 Apr 2010 12:28:26 +0530
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<m2r320fb6e01004240449qcb2a7cc8lc2798fba8d596036@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
	<h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>
	<q2gddd5adac1004260425q3db04bd9rfdc08b80997ffa63@mail.gmail.com>
	<l2o320fb6e01004260452oe343dda3tfeca71b3d72b7403@mail.gmail.com>
	<p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
Message-ID: <y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>

hello Peter,

Sorry for the late reply. I am writing to thank you. The suggestions you
gave were of massive work in our research by reducing the BLASTing time.
Thank you for taking interest,

Sincerely,
Karthikaja
On Mon, Apr 26, 2010 at 5:27 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>  On Mon, Apr 26, 2010 at 12:52 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja <cloudycrimson at gmail.com>
> wrote:
> >> Hi Peter,
> >>
> >> I will seriously consider using the stand alone blast option. And thank
> you
> >> so much for the links. :)  I have replaced the repository.
> >>
> >> You suspected a problem with the sequences but they work very well when
> >> given directly in the code. I have attached my fasta file. Please tell
> me
> >> how it works with you.
> >>
> >> Karthikraja.
> >
> > You seem to have made a mistake with the FASTA file, there should be
> > a read name on the ">" lines with the sequence on the subsequence lines.
> > E.g. More like this:
> >
> >>Seq1
> > IMYTALPVIGKRHFRPSFTR
> >>Seq2
> > RSSRGRGR
> > (etc)
> >
> > As is, your file is valid but describes seven records each with no
> sequence
> > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc).
>
> P.S. The updated Biopython should have given you this error message:
>
> ValueError: Error message from NCBI: Message ID#32 Error: Query
> contains no data: Query contains no sequence data
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Thu Apr 29 09:08:00 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Apr 2010 10:08:00 +0100
Subject: [Biopython] save efetch results in different files
In-Reply-To: <s2ld3ddc94e1004280956w8f442c60u988a324394a1cc48@mail.gmail.com>
References: <t2jd3ddc94e1004280224me86a2493t372d45b28954b69c@mail.gmail.com>
	<g2w320fb6e01004280257jbf297b99qac07bc2b7f9519ae@mail.gmail.com>
	<s2ld3ddc94e1004280956w8f442c60u988a324394a1cc48@mail.gmail.com>
Message-ID: <p2h320fb6e01004290208wb59d11fax1e4dd751b1ae67cf@mail.gmail.com>

On Wed, Apr 28, 2010 at 5:56 PM, Silvio Tschapke wrote:
>
> On Wed, Apr 28, 2010 at 11:57 AM, Peter wrote:
>>
>> On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote:
>> > Hi all,
>> >
>> > I'd like to download hundreds of pubmed entries in one turn, but save
>> > every entry in a single file for further processing with e.g. NLTK.
>> > Is this possible? Or what is the common way to do this? Or do I have to
>> > call efetch for every single pmid? I dont know how.
>>
>> Personally I would probably save each pubmed result to a separate file
>> named using the pmid - a Unix filesystem should cope fine with a few
>> thousand files in a single directory. This is simple and lets you add more
>> entries at a later date, and you have simple access to any record.
>
> This is what I thought..to save each pubmed result to a separate file named
> using the pmid, as you can see in the code snippet.
> But it isn't working so far. Could you help me with the efetch_handle? I
> have called efetch one time with all pmids. So the efetch_handle contains
> all results. But now I need to pull out every single result from this handle
> to save it in a separate file with its pmid. And I don't know how to do it.
> Or isn't there another way..do I have to call efetch for every pmid and than
> save it into a file inside the loop?
> Because Biopython recommends to not do many queries per second I
> thought it would be better to only call efetch one time for all pmids.

The simplest answer is to make one efetch call per PMID, giving a single
record at a time which you can save to individual files. You can still do
this with the esearch+efetch history support. This does mean making
many small queries to the NCBI, rather than batching them together -
but the NCBI do not have any explicit guidelines on batch sizes.

Note - you would be making over 100 queries, so make sure you don't
run this during USA office hours!

The more complex approach (which the NCBI might prefer) is to
download batches of records together (e.g. 50 PMID results at once).
If you wanted to save these to separate files, you would have to divide
the text up yourself. I think you just need to look for lines starting
"PMID-" so this shouldn't be too hard.

Peter


From cloudycrimson at gmail.com  Fri Apr 30 10:50:08 2010
From: cloudycrimson at gmail.com (Karthik Raja)
Date: Fri, 30 Apr 2010 16:20:08 +0530
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<s2qddd5adac1004242020mab0152e3j589a75ad78d3b979@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
	<h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>
	<q2gddd5adac1004260425q3db04bd9rfdc08b80997ffa63@mail.gmail.com>
	<l2o320fb6e01004260452oe343dda3tfeca71b3d72b7403@mail.gmail.com>
	<p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
	<y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>
Message-ID: <l2vddd5adac1004300350q7fd71962ucfebf885901923fa@mail.gmail.com>

hello Peter,

I have done blast for 25 sequences and have got 10 hits for each sequence. I
have stored the results in an XML file. Now i need to *parse* it and the
information in the cookbook isn helping me.

>>> from Bio.Blast import NCBIWWW
>>> result_handle = open("finaltest3.xml")
>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)
>>> for blast_record in blast_records:

I am using the above code. Please tell me how to proceed to get information
namely "sequence, seq id, e value and alignment".

And I also have another doubt. While using q blast, is it possible to
restrict the results to only human and mouse hits? If yes, it will be great
if you could give me an example code or link.

Sincerely,
Karthik.


On Thu, Apr 29, 2010 at 12:28 PM, Karthik Raja <cloudycrimson at gmail.com>wrote:

>
> hello Peter,
>
> Sorry for the late reply. I am writing to thank you. The suggestions you
> gave were of massive work in our research by reducing the BLASTing time.
> Thank you for taking interest,
>
> Sincerely,
> Karthikaja
>   On Mon, Apr 26, 2010 at 5:27 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:
>
>>  On Mon, Apr 26, 2010 at 12:52 PM, Peter <biopython at maubp.freeserve.co.uk>
>> wrote:
>> > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja <cloudycrimson at gmail.com>
>> wrote:
>> >> Hi Peter,
>> >>
>> >> I will seriously consider using the stand alone blast option. And thank
>> you
>> >> so much for the links. :)  I have replaced the repository.
>> >>
>> >> You suspected a problem with the sequences but they work very well when
>> >> given directly in the code. I have attached my fasta file. Please tell
>> me
>> >> how it works with you.
>> >>
>> >> Karthikraja.
>> >
>> > You seem to have made a mistake with the FASTA file, there should be
>> > a read name on the ">" lines with the sequence on the subsequence lines.
>> > E.g. More like this:
>> >
>> >>Seq1
>> > IMYTALPVIGKRHFRPSFTR
>> >>Seq2
>> > RSSRGRGR
>> > (etc)
>> >
>> > As is, your file is valid but describes seven records each with no
>> sequence
>> > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc).
>>
>> P.S. The updated Biopython should have given you this error message:
>>
>> ValueError: Error message from NCBI: Message ID#32 Error: Query
>> contains no data: Query contains no sequence data
>>
>> Peter
>>
>
>


From biopython at maubp.freeserve.co.uk  Fri Apr 30 11:15:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 30 Apr 2010 12:15:05 +0100
Subject: [Biopython] Fwd: Qblast : no hits
In-Reply-To: <l2vddd5adac1004300350q7fd71962ucfebf885901923fa@mail.gmail.com>
References: <q2kddd5adac1004230056jda860530w81401348d56ba90c@mail.gmail.com>
	<y2uddd5adac1004242024h8e7d6f61vd921ffe76720aefc@mail.gmail.com>
	<t2t320fb6e01004250545ybcbea8dco6fdad35a457a88b9@mail.gmail.com>
	<s2yddd5adac1004260138t175044w2ec39fe395d02c4a@mail.gmail.com>
	<h2h320fb6e01004260302j54a5e67dra7fa3feaa10a49bf@mail.gmail.com>
	<q2gddd5adac1004260425q3db04bd9rfdc08b80997ffa63@mail.gmail.com>
	<l2o320fb6e01004260452oe343dda3tfeca71b3d72b7403@mail.gmail.com>
	<p2s320fb6e01004260457kb6f53bd7pd48cdcf63ddeb043@mail.gmail.com>
	<y2jddd5adac1004282358pfa8fac50v963dcafa32416320@mail.gmail.com>
	<l2vddd5adac1004300350q7fd71962ucfebf885901923fa@mail.gmail.com>
Message-ID: <o2n320fb6e01004300415y236acb44z562fb6f11435c3f4@mail.gmail.com>

On Fri, Apr 30, 2010 at 11:50 AM, Karthik Raja <cloudycrimson at gmail.com> wrote:
> hello Peter,
>
> I have done blast for 25 sequences and have got 10 hits for each sequence. I
> have stored the results in an XML file. Now i need to *parse* it and the
> information in the cookbook isn helping me.
>
>>>> from Bio.Blast import NCBIWWW
>>>> result_handle = open("finaltest3.xml")
>>>> from Bio.Blast import NCBIXML
>>>> blast_records = NCBIXML.parse(result_handle)
>>>> for blast_record in blast_records:
>
> I am using the above code. Please tell me how to proceed to get information
> namely "sequence, seq id, e value and alignment".

That should be fairly clear from the tutorial, look at the section titled
"The BLAST record class".

> And I also have another doubt. While using q blast, is it possible to
> restrict the results to only human and mouse hits? If yes, it will be great
> if you could give me an example code or link.

You can ask the NCBI to filter the BLAST results for you with an
Entrez query, one of the optional arguments to the Biopython
qblast function. Something like "mouse[ORGN] OR human[ORGN]"
should work. You can try out the Entrez query on the website to
make sure you have the right syntax and terms.

Peter