From chapmanb at 50mail.com  Sun Apr  1 15:13:56 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 01 Apr 2012 15:13:56 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
Message-ID: <87zkavtgcr.fsf@fastmail.fm>


Lenna;
Thanks for the introduction and glad to hear about your interest in the
variant project. I'm looking forward to seeing your proposal.

The workflow for the variant project involves a biologist querying a VCF
or GVF file with variants from an experiment. They should be able to
easily subset and filter by file components:

- Variant type: Homozygous/Heterozygous variants
- Metrics: depth, strand bias, allele frequency..
- Variants annotated in coding regions causing amino acid changes

As well as rapid subsetting by chromosomal region.

My syggestion would be to leverage external tools as much as possible to
do file manipulation and focus on an API that lets users filter and
extract information pre-contained in the INFO file.

Hope this is helpful as a place to get started. We can provide
additional feedback once you have your proposal ready. Thanks again,
Brad

> Hi all,
> 
> I realize time is short, but I am still in the planning phase of my
> GSoC proposal! I wanted to take a moment to formally introduce myself
> to the dev list.
> 
> I am affiliated with Purdue University, located in Indiana, USA and
> best known for engineering (Neil Armstrong is a famous graduate). I
> hold a bachelor of arts in biology from Mount Holyoke College in
> Massachusetts. I have extensive wet lab experience with genetics; I'm
> currently working in a lab genotyping mice (the research is intestinal
> lipid metabolism). In August, I begin a PhD in interdisciplinary life
> science at Purdue, and I anticipate that my research will fall
> somewhere in the field of bioinformatics/computational biology. I hope
> to use biopython extensively!
> 
> In my spare time, other than programming, I enjoy ballroom dance,
> science fiction novels, board games, and sailing.
> 
> I've been programming for about 6 years and using python for 4; other
> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL
> (primarily MySQL and SQLite), and C++/C. I place a high value on
> object oriented design and execution.
> 
> I understand the basics of formal grammar and have some experience
> with lex/flex as well as PLY (python lex/yacc). My work so far with
> biopython has been on the CIF parsing module. One of my primary goals
> for the genomic variants project would be to implement as much
> polymorphism and abstraction as possible, for the benefit of both
> users and future developers.
> 
> I'm working on a proposal for the genomic variants project, and while
> I understand the basics of molecular biology and genetics, I lack
> firsthand experience with the type of workflow that would occur in the
> context of genomic variants. If anyone can supply a few examples, it
> would be greatly appreciated.
> 
> I hope to have a proposal draft ready for feedback by Monday.
> 
> Regards,
> 
> Lenna Peterson
> github.com/lennax
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From chapmanb at 50mail.com  Sun Apr  1 15:28:32 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 01 Apr 2012 15:28:32 -0400
Subject: [Biopython-dev] GSoC Student Applicant
In-Reply-To: <CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
References: <CADEGkF7D=QTJcChbPE71HRiLi0VXiVZap-sJA2+W38TGPziYpA@mail.gmail.com>
	<874ntgtca7.fsf@fastmail.fm>
	<CADEGkF7fT0ExxfOMQgA8EKWZ-DfqS=K3qAXUmewUxYYeXZO6tg@mail.gmail.com>
	<87r4wa6fxx.fsf@fastmail.fm>
	<CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
Message-ID: <87wr5ztfof.fsf@fastmail.fm>


Bow;

> Thank you for the comments and suggestions. I've added a little bit
> more details to my personal profile and put it up front. My project
> details have also been broken down into single weeks. And I've edited
> the commenting permission.

Thanks for the updates, this is coming along well. My most general
suggestion is to spend more time expanding the week-by-week
timeline. As an example, take this weekly goal:

* Write iterator and random-access parser for EMBOSS water

It would be great to see more specific plans for what exactly you
deliver and implement during the week. Something like:

- Write iterator for EMBOSS water, expanding test suite to ensure
  produced AlignIO objects are compatible with previous BLAST and HMMER
  iterators.

- Expand index functionality to handle EMBOSS water format for random
  access. Test edge cases: initial records, final records, empty
  records.

- Document 'water' parsing with a use case emphasizing differences from
  BLAST and HMMER searching.

Peter probably has more specific thoughts on the actual content but it's
important to think through things in this manner. This will make it
easier to approach weeks during the summer since you'll already have
tasks broken down, and will also demonstrate you've thought about
potential problems and roadblocks and have solutions to overcome them.

> As for my other obligations, I didn't mean to give that impression. I
> added a little bite more detail about the project itself, but I'm not
> sure about the time that I should write. I estimate that at most, for
> each week day, I spend 8 hours doing my Master's project in my lab's
> campus. Since the project started, I usually use the remainder of the
> time (~6 hours/day) for my own personal programming projects. I plan
> to use the personal programming time slot for my GSoC instead, if
> accepted. Should I be this thorough in the proposal?

This is exactly my worry. You're proposing working two full time jobs
all summer long. Not to denigrate your work ethic, but 80 hour weeks are
hard and leave you no time for important things like having a life
outside of work. My suggestion would be to see if you can scale back
your Master's commitments for the summer if accepted into GSoC. This
would definitely improve your proposal since reviewers will worry about
the time commitment.

Hope this all helps,
Brad

From chapmanb at 50mail.com  Sun Apr  1 16:30:26 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 01 Apr 2012 16:30:26 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <4F74855B.9000603@med.nyu.edu>
References: <4F74855B.9000603@med.nyu.edu>
Message-ID: <87obrbtct9.fsf@fastmail.fm>


Andrew;
Thanks for putting this together. It looks great, is well integrated
with AlignIO and it's awesome to see a test suite.

I dug through the code and my small suggestions would be:

- Could you refactor some of the larger functions into separate smaller
  components? A couple of these spread over a ton of lines and it can be
  a bit difficult to follow the logic throughout:

  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L172
  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L399

  As a practical example, here you have a large block which checks the
  SQLite index matches the MAF file and everything looks okay:

  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L199

  This would be clearer if factored into something like:

  if os.path.isfile(sqlite_file):
     try:
        self._record_count = self._verify_record_count(con)
     except ...

- Would you be able to put together a small example for the
  Cookbook or Tutorial documentation? This would be a great way to help
  others get started with the functionality and advertise it.

Thanks again for this,
Brad

> Hi all,
> 
> I would like to start a discussion about what is needed to make the 
> AlignIO.MafIO parser and indexer ready for the next release. If anyone 
> is unfamiliar with MAF (Multiple Alignment Format), it is the file 
> format that eukaryote genome-to-genome multiple alignments produced by 
> multiz are stored in.
> 
> The exact specs are here:
>    http://genome.ucsc.edu/FAQ/FAQformat.html#format5
> 
> Some use cases are discussed in this paper, which implements (I believe) 
> most of the same functionality of the MafIO class in Galaxy:
>    http://www.ncbi.nlm.nih.gov/pubmed/21775304
> 
> The branch of my biopython fork that contains the class:
>    https://github.com/polyatail/biopython/tree/alignio-maf
> 
> The class is implemented as a reader/writer compatible with the AlignIO 
> API, but implements its own indexer (MafIO.MafIndex) based on 
> SeqIO.index_db(). At the time, this seemed like the best way to 
> implement this, as MAF is explicitly designed for genome-to-genome 
> alignments while other formats are not. If we can assume a MAF file 
> contains such an alignment, we can index it by genome coordinates and 
> allow random access to intervals.
> 
> This is especially useful since it is often desirable to retrieve the 
> spliced multiple alignment of a multi-exonic transcript, which can be 
> used to determine sequence conservation, construct a phylogenetic tree 
> for a particular gene, or pull out orthologs of a large number of genes 
> at once.
> 
> The code consists of the reader, writer, and indexer classes in 
> AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to 
> the indexer in Tests/test_MafIO_index.py. I would really appreciate any 
> feedback and suggestions, and if anyone has an opportunity to use this 
> feature it would be great to get some feedback on its operation.
> 
> 
> Thanks!
> Andrew
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From redmine at redmine.open-bio.org  Sun Apr  1 21:40:27 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 2 Apr 2012 01:40:27 +0000
Subject: [Biopython-dev] [Biopython - Feature #3336] (New) Make Phylo.draw
	more customizable
Message-ID: <redmine.issue-3336.20120402014027@redmine.open-bio.org>


Issue #3336 has been reported by Eric Talevich.

----------------------------------------
Feature #3336: Make Phylo.draw more customizable
https://redmine.open-bio.org/issues/3336

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


On and off the mailing lists, I've received requests to make the plots rendered by Phylo.draw more customizable. For example:
http://lists.open-bio.org/pipermail/biopython/2012-March/007851.html

Since Phylo.draw is based on matplotlib/pyplot, it should be possible for essentially everything about the plot to be customizable by the user using pyplot's standard mechanisms -- e.g. adjust the font sizes with rcParams["font.size"].

Other requested features:

* Accept **kwargs in Phylo.draw, and pass it along to pyplot -- but where?
* Format the confidence/support values differently (currently everything is treated as a float), including or perhaps with the addition of arbitrary branch labels (e.g. estimated number of mutations on a branch)
* Return a mapping of clade objects to a tuple or dict of pyplot elements (LineCollection, PatchCollection, etc.)


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Sun Apr  1 22:10:45 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Sun, 1 Apr 2012 22:10:45 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <87zkavtgcr.fsf@fastmail.fm>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
Message-ID: <CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>

Hi Brad,

Thank you so much for your suggestions. My initial evaluation of the
strengths of existing software has led me to strongly agree with your
recommendation to focus on the usability of the API.

I submit this draft of my proposal to the dev list for feedback:

https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit


Lenna


On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Lenna;
> Thanks for the introduction and glad to hear about your interest in the
> variant project. I'm looking forward to seeing your proposal.
>
> The workflow for the variant project involves a biologist querying a VCF
> or GVF file with variants from an experiment. They should be able to
> easily subset and filter by file components:
>
> - Variant type: Homozygous/Heterozygous variants
> - Metrics: depth, strand bias, allele frequency..
> - Variants annotated in coding regions causing amino acid changes
>
> As well as rapid subsetting by chromosomal region.
>
> My syggestion would be to leverage external tools as much as possible to
> do file manipulation and focus on an API that lets users filter and
> extract information pre-contained in the INFO file.
>
> Hope this is helpful as a place to get started. We can provide
> additional feedback once you have your proposal ready. Thanks again,
> Brad
>
>> Hi all,
>>
>> I realize time is short, but I am still in the planning phase of my
>> GSoC proposal! I wanted to take a moment to formally introduce myself
>> to the dev list.
>>
>> I am affiliated with Purdue University, located in Indiana, USA and
>> best known for engineering (Neil Armstrong is a famous graduate). I
>> hold a bachelor of arts in biology from Mount Holyoke College in
>> Massachusetts. I have extensive wet lab experience with genetics; I'm
>> currently working in a lab genotyping mice (the research is intestinal
>> lipid metabolism). In August, I begin a PhD in interdisciplinary life
>> science at Purdue, and I anticipate that my research will fall
>> somewhere in the field of bioinformatics/computational biology. I hope
>> to use biopython extensively!
>>
>> In my spare time, other than programming, I enjoy ballroom dance,
>> science fiction novels, board games, and sailing.
>>
>> I've been programming for about 6 years and using python for 4; other
>> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL
>> (primarily MySQL and SQLite), and C++/C. I place a high value on
>> object oriented design and execution.
>>
>> I understand the basics of formal grammar and have some experience
>> with lex/flex as well as PLY (python lex/yacc). My work so far with
>> biopython has been on the CIF parsing module. One of my primary goals
>> for the genomic variants project would be to implement as much
>> polymorphism and abstraction as possible, for the benefit of both
>> users and future developers.
>>
>> I'm working on a proposal for the genomic variants project, and while
>> I understand the basics of molecular biology and genetics, I lack
>> firsthand experience with the type of workflow that would occur in the
>> context of genomic variants. If anyone can supply a few examples, it
>> would be greatly appreciated.
>>
>> I hope to have a proposal draft ready for feedback by Monday.
>>
>> Regards,
>>
>> Lenna Peterson
>> github.com/lennax
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From p.j.a.cock at googlemail.com  Mon Apr  2 04:26:16 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 2 Apr 2012 09:26:16 +0100
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <87obrbtct9.fsf@fastmail.fm>
References: <4F74855B.9000603@med.nyu.edu>
	<87obrbtct9.fsf@fastmail.fm>
Message-ID: <CAKVJ-_51Oku5hm+VTLccA2h2f=saz-4g79kVRdFTryNtUFK5SA@mail.gmail.com>

On Sun, Apr 1, 2012 at 9:30 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Andrew;
> Thanks for putting this together. It looks great, is well integrated
> with AlignIO and it's awesome to see a test suite.

Indeed, +1 on tests :)

Apologies for not replying earlier - this was flagged in my email
client all of last week.

> I dug through the code and my small suggestions would be:
>
> - Could you refactor some of the larger functions into separate smaller
> ?components? A couple of these spread over a ton of lines and it can be
> ?a bit difficult to follow the logic throughout:
>
> ...
>
> ?As a practical example, here you have a large block which checks the
> ?SQLite index matches the MAF file and everything looks okay:

Maybe I should do the same with the SeqIO SQLite code.

> - Would you be able to put together a small example for the
> ?Cookbook or Tutorial documentation? This would be a great way to help
> ?others get started with the functionality and advertise it.

He already has - very organised :)
http://biopython.org/wiki/Multiple_Alignment_Format

Is there any more about reverse complemented sequences
and how they are handled, for in simple iterators, but more
so when indexing? What I'm getting at here is the non-typical
treatment of start and end being relative to the reverse
complemented sequence for minus strand alignments. Here
most tools/formats always count from the first base on the
forward strand.

Peter


From andrew.sczesnak at med.nyu.edu  Mon Apr  2 20:15:18 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 02 Apr 2012 20:15:18 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <87obrbtct9.fsf@fastmail.fm>
References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm>
Message-ID: <4F7A4116.5000602@med.nyu.edu>

Hi Brad,

Thank you for the feedback. I've tried to work on some of your 
suggestions and will continue doing so.

> - Could you refactor some of the larger functions into separate smaller
>    components? A couple of these spread over a ton of lines and it can be
>    a bit difficult to follow the logic throughout:

Definitely--I see what you mean. I split __init__ into a couple 
functions. I'm still worried about the 100 lines of get_spliced(). It's 
big mostly because I overdid it on the comments, but hopefully that 
helps explain the logic enough that someone else could work on it 
without pulling their hair out.

> - Would you be able to put together a small example for the
>    Cookbook or Tutorial documentation? This would be a great way to help
>    others get started with the functionality and advertise it.

Absolutely. I have a few more ideas for cool demos that integrate with 
other parts of Biopython. What's the best place to put draft text for 
the tutorial?


Thanks,
Andrew

From andrew.sczesnak at med.nyu.edu  Mon Apr  2 20:33:51 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 02 Apr 2012 20:33:51 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <CAKVJ-_51Oku5hm+VTLccA2h2f=saz-4g79kVRdFTryNtUFK5SA@mail.gmail.com>
References: <4F74855B.9000603@med.nyu.edu>	<87obrbtct9.fsf@fastmail.fm>
	<CAKVJ-_51Oku5hm+VTLccA2h2f=saz-4g79kVRdFTryNtUFK5SA@mail.gmail.com>
Message-ID: <4F7A456F.3020306@med.nyu.edu>

Hi Peter,

Thank you for the feedback. I will try to make sure this code is well 
tested before the next release.

> Is there any more about reverse complemented sequences
> and how they are handled, for in simple iterators, but more
> so when indexing? What I'm getting at here is the non-typical
> treatment of start and end being relative to the reverse
> complemented sequence for minus strand alignments. Here
> most tools/formats always count from the first base on the
> forward strand.

I'm not sure I'm understanding you, but I hope I am. In theory it seems 
like strandedness would be an issue, however in practice the reference 
species in a multiz MAF file is always the plus strand. To make sure the 
user isn't trying to pass a MAF file containing blocks with mixed 
strands to MafIndex.get_spliced(), there's a check in there to make sure 
all strands for the reference species are the same. We also assume that 
coordinates specified in a block are always in the ascending direction 
(i.e. they are given as 'start' and 'size' and we assume the coordinates 
are [start, start + size]).

There could be an issue, however, if the best alignment for a particular 
species swaps strands between alignment blocks and/or exons of a 
transcript. However, it might be safe to say that the user is interested 
in the best alignment however it occurs, and not necessarily strand 
consistency.

WRT MultipleSeqAlignment objects produced by get_spliced(), all 
annotation properties are lost upon slicing, so it is up to the user to 
keep track of what's what. I do remember we had talked about a way to 
maintain these annotations, even after slicing. Any thoughts?


Thanks,
Andrew

From p.j.a.cock at googlemail.com  Tue Apr  3 05:03:55 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 3 Apr 2012 10:03:55 +0100
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CAKVJ-_4WZe3ETSaeH=ym7YQbVhEWMfJBo6=G=Tg-S-qJxQN80g@mail.gmail.com>
References: <CAKVJ-_4WZe3ETSaeH=ym7YQbVhEWMfJBo6=G=Tg-S-qJxQN80g@mail.gmail.com>
Message-ID: <CAKVJ-_5JLAwymdA-XgfucAA5hhr7yVqjh5De7Kwr0s4hcN+MRw@mail.gmail.com>

On Wed, Mar 21, 2012 at 3:27 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hello all,
>
> I'm pleased to see that the GSoC SearchIO project idea I put up
> has sparked some interest:
>
> http://biopython.org/wiki/Google_Summer_of_Code
>
> ...

Just a reminder that the GSoC application deadline is this Friday,
6 April. The application website has been open since 26 March,
so I would encourage you to upload your current proposal soon
in case there are server load problems on the last day (you will
still be able to revise the proposal after uploading it).
http://www.google-melange.com/gsoc/homepage/google/gsoc2012

Also, in particular for those of you interested in the SearchIO
project which I would mentor, I will be away Thursday 5 and
Friday 6 April, so you will not be able to ask me for any last
minute feedback.

Good luck,

Peter

From chapmanb at 50mail.com  Tue Apr  3 09:06:36 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 03 Apr 2012 09:06:36 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <4F7A4116.5000602@med.nyu.edu>
References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm>
	<4F7A4116.5000602@med.nyu.edu>
Message-ID: <87hax1hsmb.fsf@fastmail.fm>


Andrew;

> Definitely--I see what you mean. I split __init__ into a couple 
> functions. I'm still worried about the 100 lines of get_spliced(). It's 
> big mostly because I overdid it on the comments, but hopefully that 
> helps explain the logic enough that someone else could work on it 
> without pulling their hair out.

Definitely agreed. It's well-commented which makes it much easier for
others to dig in. Thanks for taking a look at the refactoring.

> Absolutely. I have a few more ideas for cool demos that integrate with 
> other parts of Biopython. What's the best place to put draft text for 
> the tutorial?

Apologies that I'd totally missed your cookbook entry. That looks great,
but more documentation is always better. If you are okay with LaTeX, the
Tutorial is in Doc/Tutorial.tex so you can edit directly. The wiki is
also a good place for docs if you prefer to go that way.

Thanks again for all the work on this. Looking forward to having it in,
Brad

From chapmanb at 50mail.com  Tue Apr  3 10:53:33 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 03 Apr 2012 10:53:33 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
Message-ID: <87r4w4hno2.fsf@fastmail.fm>


Lenna;
Thanks for getting this together, that's a great start. I left some
specific comments but my general suggestion is to get more detailed
about the code specifics. During the summer, you use the weekly timeline
as a todo list so having lots of details make the process so much
easier. Instead of seeing a general item like: "Implement X" you want
"Implement X by extending API from last week to support get_Y using
sqlite3 index table. Test cases A, B, C and D to avoid...".

Having these kind of checklist todos helps make it easy to get started
each week and ensure everything is on track. The additional benefit for
selection is that is helps convince reviewers you've thought about the
technical details and forseen any potential problems.

Hope this helps,
Brad

> Hi Brad,
> 
> Thank you so much for your suggestions. My initial evaluation of the
> strengths of existing software has led me to strongly agree with your
> recommendation to focus on the usability of the API.
> 
> I submit this draft of my proposal to the dev list for feedback:
> 
> https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit
> 
> 
> Lenna
> 
> 
> On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >
> > Lenna;
> > Thanks for the introduction and glad to hear about your interest in the
> > variant project. I'm looking forward to seeing your proposal.
> >
> > The workflow for the variant project involves a biologist querying a VCF
> > or GVF file with variants from an experiment. They should be able to
> > easily subset and filter by file components:
> >
> > - Variant type: Homozygous/Heterozygous variants
> > - Metrics: depth, strand bias, allele frequency..
> > - Variants annotated in coding regions causing amino acid changes
> >
> > As well as rapid subsetting by chromosomal region.
> >
> > My syggestion would be to leverage external tools as much as possible to
> > do file manipulation and focus on an API that lets users filter and
> > extract information pre-contained in the INFO file.
> >
> > Hope this is helpful as a place to get started. We can provide
> > additional feedback once you have your proposal ready. Thanks again,
> > Brad
> >
> >> Hi all,
> >>
> >> I realize time is short, but I am still in the planning phase of my
> >> GSoC proposal! I wanted to take a moment to formally introduce myself
> >> to the dev list.
> >>
> >> I am affiliated with Purdue University, located in Indiana, USA and
> >> best known for engineering (Neil Armstrong is a famous graduate). I
> >> hold a bachelor of arts in biology from Mount Holyoke College in
> >> Massachusetts. I have extensive wet lab experience with genetics; I'm
> >> currently working in a lab genotyping mice (the research is intestinal
> >> lipid metabolism). In August, I begin a PhD in interdisciplinary life
> >> science at Purdue, and I anticipate that my research will fall
> >> somewhere in the field of bioinformatics/computational biology. I hope
> >> to use biopython extensively!
> >>
> >> In my spare time, other than programming, I enjoy ballroom dance,
> >> science fiction novels, board games, and sailing.
> >>
> >> I've been programming for about 6 years and using python for 4; other
> >> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL
> >> (primarily MySQL and SQLite), and C++/C. I place a high value on
> >> object oriented design and execution.
> >>
> >> I understand the basics of formal grammar and have some experience
> >> with lex/flex as well as PLY (python lex/yacc). My work so far with
> >> biopython has been on the CIF parsing module. One of my primary goals
> >> for the genomic variants project would be to implement as much
> >> polymorphism and abstraction as possible, for the benefit of both
> >> users and future developers.
> >>
> >> I'm working on a proposal for the genomic variants project, and while
> >> I understand the basics of molecular biology and genetics, I lack
> >> firsthand experience with the type of workflow that would occur in the
> >> context of genomic variants. If anyone can supply a few examples, it
> >> would be greatly appreciated.
> >>
> >> I hope to have a proposal draft ready for feedback by Monday.
> >>
> >> Regards,
> >>
> >> Lenna Peterson
> >> github.com/lennax
> >> _______________________________________________
> >> Biopython-dev mailing list
> >> Biopython-dev at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From w.arindrarto at gmail.com  Tue Apr  3 11:22:04 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 3 Apr 2012 17:22:04 +0200
Subject: [Biopython-dev] GSoC Student Applicant
In-Reply-To: <87wr5ztfof.fsf@fastmail.fm>
References: <CADEGkF7D=QTJcChbPE71HRiLi0VXiVZap-sJA2+W38TGPziYpA@mail.gmail.com>
	<874ntgtca7.fsf@fastmail.fm>
	<CADEGkF7fT0ExxfOMQgA8EKWZ-DfqS=K3qAXUmewUxYYeXZO6tg@mail.gmail.com>
	<87r4wa6fxx.fsf@fastmail.fm>
	<CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
	<87wr5ztfof.fsf@fastmail.fm>
Message-ID: <CADEGkF5MetS62j2Vf4ReiKMKo_gt=S94jU7huNraVnWFwERRXg@mail.gmail.com>

On Sun, Apr 1, 2012 at 21:28, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Bow;
>
>> Thank you for the comments and suggestions. I've added a little bit
>> more details to my personal profile and put it up front. My project
>> details have also been broken down into single weeks. And I've edited
>> the commenting permission.
>
> Thanks for the updates, this is coming along well. My most general
> suggestion is to spend more time expanding the week-by-week
> timeline. As an example, take this weekly goal:
>
> * Write iterator and random-access parser for EMBOSS water
>
> It would be great to see more specific plans for what exactly you
> deliver and implement during the week. Something like:
>
> - Write iterator for EMBOSS water, expanding test suite to ensure
> ?produced AlignIO objects are compatible with previous BLAST and HMMER
> ?iterators.
>
> - Expand index functionality to handle EMBOSS water format for random
> ?access. Test edge cases: initial records, final records, empty
> ?records.
>
> - Document 'water' parsing with a use case emphasizing differences from
> ?BLAST and HMMER searching.
>
> Peter probably has more specific thoughts on the actual content but it's
> important to think through things in this manner. This will make it
> easier to approach weeks during the summer since you'll already have
> tasks broken down, and will also demonstrate you've thought about
> potential problems and roadblocks and have solutions to overcome them.

Thanks for another feedback, Brad. I am in the process of adding more
detailed descriptions of my weekly tasks.

>> As for my other obligations, I didn't mean to give that impression. I
>> added a little bite more detail about the project itself, but I'm not
>> sure about the time that I should write. I estimate that at most, for
>> each week day, I spend 8 hours doing my Master's project in my lab's
>> campus. Since the project started, I usually use the remainder of the
>> time (~6 hours/day) for my own personal programming projects. I plan
>> to use the personal programming time slot for my GSoC instead, if
>> accepted. Should I be this thorough in the proposal?
>
> This is exactly my worry. You're proposing working two full time jobs
> all summer long. Not to denigrate your work ethic, but 80 hour weeks are
> hard and leave you no time for important things like having a life
> outside of work. My suggestion would be to see if you can scale back
> your Master's commitments for the summer if accepted into GSoC. This
> would definitely improve your proposal since reviewers will worry about
> the time commitment.
>
> Hope this all helps,
> Brad

Ah, that's ok, I understand your concern :). I talked with my
supervisor yesterday regarding this and he understood that I can scale
back the time spent for my current project if accepted. I've revised
this detail as well in the proposal.

Thanks again,
Bow


From p.j.a.cock at googlemail.com  Tue Apr  3 11:32:08 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 3 Apr 2012 16:32:08 +0100
Subject: [Biopython-dev] GSoC Student Applicant
In-Reply-To: <CADEGkF5MetS62j2Vf4ReiKMKo_gt=S94jU7huNraVnWFwERRXg@mail.gmail.com>
References: <CADEGkF7D=QTJcChbPE71HRiLi0VXiVZap-sJA2+W38TGPziYpA@mail.gmail.com>
	<874ntgtca7.fsf@fastmail.fm>
	<CADEGkF7fT0ExxfOMQgA8EKWZ-DfqS=K3qAXUmewUxYYeXZO6tg@mail.gmail.com>
	<87r4wa6fxx.fsf@fastmail.fm>
	<CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
	<87wr5ztfof.fsf@fastmail.fm>
	<CADEGkF5MetS62j2Vf4ReiKMKo_gt=S94jU7huNraVnWFwERRXg@mail.gmail.com>
Message-ID: <CAKVJ-_7q=7bLoMS7S_gx=uQbPLx2dTWpRJzPejM_4zrV6Wetsg@mail.gmail.com>

On Tue, Apr 3, 2012 at 4:22 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> On Sun, Apr 1, 2012 at 21:28, Brad Chapman <chapmanb at 50mail.com> wrote:
>>
>> This is exactly my worry. You're proposing working two full time jobs
>> all summer long. Not to denigrate your work ethic, but 80 hour weeks are
>> hard and leave you no time for important things like having a life
>> outside of work. My suggestion would be to see if you can scale back
>> your Master's commitments for the summer if accepted into GSoC. This
>> would definitely improve your proposal since reviewers will worry about
>> the time commitment.
>>
>> Hope this all helps,
>> Brad
>
> Ah, that's ok, I understand your concern :). I talked with my
> supervisor yesterday regarding this and he understood that I can scale
> back the time spent for my current project if accepted. I've revised
> this detail as well in the proposal.
>
> Thanks again,
> Bow

Excellent - I'm pleased your supervisor is being supportive. That
should help address this concern :)

Peter

From mjldehoon at yahoo.com  Tue Apr  3 14:27:26 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 3 Apr 2012 11:27:26 -0700 (PDT)
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CAKVJ-_5JLAwymdA-XgfucAA5hhr7yVqjh5De7Kwr0s4hcN+MRw@mail.gmail.com>
Message-ID: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com>

While I think that the SearchIO module is a good idea, you may want to consider choosing a different name for this module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO, roughly speaking the class definitions are in the former and the parser is in the latter module. I don't quite understand why these two are separated into distinct modules, as to me conceptually the two belong together. Bio.SearchIO in my understanding will combine both the parsers and the class definitions, which is a good thing, but then I would prefer a name without "IO" in it.

Best,
-Michiel.


--- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> From: Peter Cock <p.j.a.cock at googlemail.com>
> Subject: Re: [Biopython-dev] GSoC SearchIO project
> To: "Biopython-Dev Mailing List" <biopython-dev at lists.open-bio.org>
> Date: Tuesday, April 3, 2012, 5:03 AM
> On Wed, Mar 21, 2012 at 3:27 PM,
> Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > Hello all,
> >
> > I'm pleased to see that the GSoC SearchIO project idea
> I put up
> > has sparked some interest:
> >
> > http://biopython.org/wiki/Google_Summer_of_Code
> >
> > ...
> 
> Just a reminder that the GSoC application deadline is this
> Friday,
> 6 April. The application website has been open since 26
> March,
> so I would encourage you to upload your current proposal
> soon
> in case there are server load problems on the last day (you
> will
> still be able to revise the proposal after uploading it).
> http://www.google-melange.com/gsoc/homepage/google/gsoc2012
> 
> Also, in particular for those of you interested in the
> SearchIO
> project which I would mentor, I will be away Thursday 5 and
> Friday 6 April, so you will not be able to ask me for any
> last
> minute feedback.
> 
> Good luck,
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 

From p.j.a.cock at googlemail.com  Tue Apr  3 15:44:48 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 3 Apr 2012 20:44:48 +0100
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com>
References: <CAKVJ-_5JLAwymdA-XgfucAA5hhr7yVqjh5De7Kwr0s4hcN+MRw@mail.gmail.com>
	<1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com>
Message-ID: <CAKVJ-_6mtOmmHapwmLT+8yc-8ADKWfWsp1yWN_HavZ59KeR71Q@mail.gmail.com>

On Tue, Apr 3, 2012 at 7:27 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> While I think that the SearchIO module is a good idea, you
> may want to consider choosing a different name for this
> module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO,
> roughly speaking the class definitions are in the former and
> the parser is in the latter module. I don't quite understand
> why these two are separated into distinct modules, as to
> me conceptually the two belong together. Bio.SearchIO in
> my understanding will combine both the parsers and the
> class definitions, which is a good thing, but then I would
> prefer a name without "IO" in it.
>
> Best,
> -Michiel.

Yes, I was thinking to have both the parsers and the new
objects under the name module namespace.

The reason for using SearchIO (despite not being PEP8
compatible - something I regret in the naming of SeqIO
and the pattern it set) is to match SeqIO and AlignIO and
BioPerl. Anyone familiar with BioPerl will immediately see
what it is for - and some of the student applicants have
already used BioPerl's SearchIO. Personally I find this
quite a compelling argument.

That said, the name SearchIO isn't the clearest in the
the world for a newcomer - however I haven't come up
with anything significantly better myself. Perhaps there
is a better name out there, which would justify breaking
the pattern? I've considered pairwise and palign, but
neither feels right.

Given a clean slate (Biopython 2?), then yes, I would
agree with consolidating Bio.Align and Bio.AlignIO as
one namespace, probable "align" (lower case). The
situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO
isn't quite so simple - perhaps "seq" (lower case)?
Then (in the absence of any other ideas), SearchIO
would become "search" (lower case).

Peter

From redmine at redmine.open-bio.org  Tue Apr  3 17:13:13 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 3 Apr 2012 21:13:13 +0000
Subject: [Biopython-dev] [Biopython - Bug #3337] (New) 'Bio.trie.trie' is
	not picklable
Message-ID: <redmine.issue-3337.20120403211313@redmine.open-bio.org>


Issue #3337 has been reported by Sergei Lebedev.

----------------------------------------
Bug #3337: 'Bio.trie.trie' is not picklable
https://redmine.open-bio.org/issues/3337

Author: Sergei Lebedev
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Is there any reason for this, or nobody just had the need (or time) to implement pickle interface?


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From MatatTHC at gmx.de  Wed Apr  4 04:46:47 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Wed, 4 Apr 2012 10:46:47 +0200
Subject: [Biopython-dev] SeqIO circular
In-Reply-To: <CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
References: <CALNFT0jq=VTwSDv-4x7ZrHoQRLajCUHY8NGPMw9cDuGnwwNiuw@mail.gmail.com>
	<CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
Message-ID: <CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>

Hi,

are there any news on this? May I help somehow? But I have to admit
that I barely speak perl and have no experience with bioperl. If
someone tells me where to look I might still try it.

Matthias

2012/3/29 Peter Cock <p.j.a.cock at googlemail.com>:
> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
>> Hi,
>>
>> Is it possible to get the property if a genome is circular / linear
>> from SeqIO applied to genbank files? I could not find it.
>>
>> There is also a related bugreport:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>>
>> I used the old parser before and switched to SeqIO which I really like
>> for the possibilities to parse different formats... but I really need
>> the information.
>
> Does anyone happen to have a BioPerl + BioSQL setup installed
> and working? IIRC checking that to make sure however we
> store the circular was compatible was the only real hurdle.
>
> Peter

From arklenna at gmail.com  Wed Apr  4 20:04:30 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Wed, 4 Apr 2012 20:04:30 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <87r4w4hno2.fsf@fastmail.fm>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
	<87r4w4hno2.fsf@fastmail.fm>
Message-ID: <CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>

On Tue, Apr 3, 2012 at 10:53 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Lenna;
> Thanks for getting this together, that's a great start. I left some
> specific comments but my general suggestion is to get more detailed
> about the code specifics. During the summer, you use the weekly timeline
> as a todo list so having lots of details make the process so much
> easier. Instead of seeing a general item like: "Implement X" you want
> "Implement X by extending API from last week to support get_Y using
> sqlite3 index table. Test cases A, B, C and D to avoid...".
>
> Having these kind of checklist todos helps make it easy to get started
> each week and ensure everything is on track. The additional benefit for
> selection is that is helps convince reviewers you've thought about the
> technical details and forseen any potential problems.
>
> Hope this helps,
> Brad
>

Hi all,

I'm linking to a revision of my GSoC proposal:

https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit

Thank you to everyone for your feedback.


Peter,

I didn't realize Biopython has never been tested on IronPython. As I
have no familiarity with .NET or Windows, I'll have to rescind my
offer to test it. Sorry to get your hopes up!


Reece,

I've revised the prose sections and almost completely rewritten the
timeline. This version provides more information about my background,
a more detailed description of the overall project, and more specific
goals.


Brad,

I've tried to go into as much detail as my knowledge of VCF and GVF
structure allows. I laid out a more specific structure for both the
backend and frontend structures for the data. I've revised the unit
tests to be more specific and less dependent on interaction with other
modules and I've tried to anticipate some cases that may produce
unexpected behavior. I also highlighted specific places where the
design should be generalizable.


James,

I hope my revised project description is more focused. Regarding CNV
etc., I did not mean to specifically exclude them by mentioning SNPs,
and I've reworded that paragraph to be more general. I get the
impression that CNV and other structural variants are considerably
more complex to represent and manipulate. I'd be more than happy to
read more about breakpoint theory etc. and to prototype any specific
workflows you might suggest.


Lenna

From eric.talevich at gmail.com  Wed Apr  4 22:53:10 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 4 Apr 2012 22:53:10 -0400
Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices
Message-ID: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>

Hi all,

I'm considering some enhancements to the Phylo.draw function to make it
more customizable for power users. Since the function is based on
matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the
user; however, I'm not fully versed in what pyplot is capable of.

Relevant feature request in Redmine:
https://redmine.open-bio.org/issues/3336

Ideas:

1. Make the draw function return a mapping of clades to a collection of
pyplot graphical elements -- the objects emitted by pyplot during each step
of rendering the plot. Each clade in the tree is mapped to a horizontal
line, a vertical line, a text label (taxon name, normally), and another
text label for the branch (confidence/support, normally). The user can then
set the attributes of these objects as they wish, minimizing the need for
futher extensions to Phylo.draw.

Example:
{<Bio.Phylo.PhyloXML.Clade>: {
        "hline": <matplotlib.collections.LineCollection>,
        "vline": <matplotlib.collections.LineCollection>,
        "taxon_label": <matplotlib.text.Text>,
        "branch_label": <matplotlib.text.Text> },
 ...

If the user needs access to the figure or axis object as well, it's already
easy enough to create these beforehand and pass the 'axis' object to
Phylo.draw.


2. Add an argument 'branch_labels' to Phylo.draw. This will accept either
(a) a dict which maps the tree's Clade objects to string labels, or (b) a
function which accepts a Clade object and returns a string. Default: a
function that formats the clade's 'confidence' or 'confidences' attribute,
matching the current behavior.

Examples:
>>> draw(mytree, branch_labels={mytree.root: "Root", ...})
>>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence)
>>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank)


3. Accept **kwargs in Phylo.draw; pass it right along to pyplot at some
point.

Question: What basic pyplot function accepts **Ikwargs? pyplot.figure and
pyplot.set_subplot don't seem appropriate. An alternative is to use
pyplot.rcParams, either leaving it all to the user or treating the **kwargs
keys as the corresponding entries in rcParams. Syntax gets a little tricky.

(Not a top priority for me, actually, since rcParams works.)


Thoughts? All clear?

Thanks,
Eric

From chapmanb at 50mail.com  Thu Apr  5 06:47:09 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 05 Apr 2012 06:47:09 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
	<87r4w4hno2.fsf@fastmail.fm>
	<CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>
Message-ID: <871uo2cv6a.fsf@fastmail.fm>


Lenna;

> I'm linking to a revision of my GSoC proposal:
> 
> https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit
> 
> Thank you to everyone for your feedback.

This is coming along great, thanks for all the work on it. I've added a
couple of specific suggestions about iterative parsing, which PyVCF
does, and using external tools to make the coding region evaluation work
easier.

One other practical suggestion: you should add a link to the latest
version of your google doc at the top of your proposal on the GSoC
Melange site. You won't be able to edit there after Friday but can
update your google document in case of reviewer suggestions.

Thanks again and best of luck during the review process,
Brad

> 
> 
> Peter,
> 
> I didn't realize Biopython has never been tested on IronPython. As I
> have no familiarity with .NET or Windows, I'll have to rescind my
> offer to test it. Sorry to get your hopes up!
> 
> 
> Reece,
> 
> I've revised the prose sections and almost completely rewritten the
> timeline. This version provides more information about my background,
> a more detailed description of the overall project, and more specific
> goals.
> 
> 
> Brad,
> 
> I've tried to go into as much detail as my knowledge of VCF and GVF
> structure allows. I laid out a more specific structure for both the
> backend and frontend structures for the data. I've revised the unit
> tests to be more specific and less dependent on interaction with other
> modules and I've tried to anticipate some cases that may produce
> unexpected behavior. I also highlighted specific places where the
> design should be generalizable.
> 
> 
> James,
> 
> I hope my revised project description is more focused. Regarding CNV
> etc., I did not mean to specifically exclude them by mentioning SNPs,
> and I've reworded that paragraph to be more general. I get the
> impression that CNV and other structural variants are considerably
> more complex to represent and manipulate. I'd be more than happy to
> read more about breakpoint theory etc. and to prototype any specific
> workflows you might suggest.
> 
> 
> Lenna
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From arklenna at gmail.com  Thu Apr  5 22:50:52 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Thu, 5 Apr 2012 22:50:52 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <871uo2cv6a.fsf@fastmail.fm>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
	<87r4w4hno2.fsf@fastmail.fm>
	<CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>
	<871uo2cv6a.fsf@fastmail.fm>
Message-ID: <CAK610_6PNyQVwhbF7HcTL0k9=cAhLL1t-jhu=KRULWT+DuvO7A@mail.gmail.com>

On Thu, Apr 5, 2012 at 6:47 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Lenna;
>
>> I'm linking to a revision of my GSoC proposal:
>>
>> https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit
>>
>> Thank you to everyone for your feedback.
>
> This is coming along great, thanks for all the work on it. I've added a
> couple of specific suggestions about iterative parsing, which PyVCF
> does, and using external tools to make the coding region evaluation work
> easier.
>
> One other practical suggestion: you should add a link to the latest
> version of your google doc at the top of your proposal on the GSoC
> Melange site. You won't be able to edit there after Friday but can
> update your google document in case of reviewer suggestions.
>
> Thanks again and best of luck during the review process,
> Brad
>


Brad -

Thank you again for your detailed feedback. As per your suggestion, I
have updated my proposal on GSoC Melange to include a link to the
latest version of my proposal.

Lenna

From mjldehoon at yahoo.com  Sat Apr  7 00:43:56 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 6 Apr 2012 21:43:56 -0700 (PDT)
Subject: [Biopython-dev] GSoC SearchIO project
Message-ID: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com>

--- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> The reason for using SearchIO (despite not being PEP8
> compatible - something I regret in the naming of SeqIO
> and the pattern it set) is to match SeqIO and AlignIO and
> BioPerl. Anyone familiar with BioPerl will immediately see
> what it is for - and some of the student applicants have
> already used BioPerl's SearchIO. Personally I find this
> quite a compelling argument.

Sorry but I am not convinced. I doubt that somebody familiar with BioPerl's Align and AlignIO modules will have trouble finding the parser in Biopython if in Biopython there is only a Bio.Align module. Also this means that some modules in Biopython are split up in Module and ModuleIO, whereas most others are not. In this particular case, for consistency you would have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a clean module organization in Biopython instead of strictly following what BioPerl did.

> That said, the name SearchIO isn't the clearest in the
> the world for a newcomer - however I haven't come up
> with anything significantly better myself. Perhaps there
> is a better name out there, which would justify breaking
> the pattern? I've considered pairwise and palign, but
> neither feels right.

How about including this module as a submodule in Bio.Align? If we think of Bio.Align as a general module for alignments, then pairwise alignments fit in it too. It depends a bit on the exact API, but I expect that we can come up with something elegant.

> Given a clean slate (Biopython 2?), then yes, I would
> agree with consolidating Bio.Align and Bio.AlignIO as
> one namespace, probable "align" (lower case). The
> situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO
> isn't quite so simple - perhaps "seq" (lower case)?

There are two steps here: consolidation of some modules, and changing the names of modules to comply with PEP8. The consolidation can happen without waiting for a Biopython 2, as long as there are clear deprecating warnings in the modules that will be removed. Compliance with PEP8 is a bit trickier, since it means relearning all module names, and some systems (Windows?) may not distinguish between lower and upper case.

> Then (in the absence of any other ideas), SearchIO
> would become "search" (lower case).

If we already know now that we will drop the IO from SearchIO at some point, then SearchIO doesn't seem to be a good name.

Best,
-Michiel.


From eric.talevich at gmail.com  Sat Apr  7 12:13:16 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 7 Apr 2012 12:13:16 -0400
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com>
References: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com>
Message-ID: <CAMC681kbybwpcd96PVV=y34nY6jSdnHMqS2XG+_BuoScy42q9A@mail.gmail.com>

On Sat, Apr 7, 2012 at 12:43 AM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> --- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > The reason for using SearchIO (despite not being PEP8
> > compatible - something I regret in the naming of SeqIO
> > and the pattern it set) is to match SeqIO and AlignIO and
> > BioPerl. Anyone familiar with BioPerl will immediately see
> > what it is for - and some of the student applicants have
> > already used BioPerl's SearchIO. Personally I find this
> > quite a compelling argument.
>
> Sorry but I am not convinced. I doubt that somebody familiar with
> BioPerl's Align and AlignIO modules will have trouble finding the parser in
> Biopython if in Biopython there is only a Bio.Align module. Also this means
> that some modules in Biopython are split up in Module and ModuleIO, whereas
> most others are not. In this particular case, for consistency you would
> have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a
> clean module organization in Biopython instead of strictly following what
> BioPerl did.
>

How about Bio.Search, for now?

We had a similar discussion at the end of GSoC 2009, when we decided to
merge Tree and TreeIO (names inspired by BioPerl) to create Phylo (because
not all trees are phylogenies, although there is also a Perl module called
Bio::Phylo). Since the *IO namespaces have only 4 public functions, plus a
<Format>IO.py module for each supported I/O format, it's not too cluttered.

Likewise, at the end of this GSoC it may be more clear whether the new
sub-package should have a different name. (SearchIO seems to have been
plenty effective at drawing attention to the project.) But in any case, I
support putting all the new work under one sub-package, rather than two.


 > That said, the name SearchIO isn't the clearest in the
> > the world for a newcomer - however I haven't come up
> > with anything significantly better myself. Perhaps there
> > is a better name out there, which would justify breaking
> > the pattern? I've considered pairwise and palign, but
> > neither feels right.
>
> How about including this module as a submodule in Bio.Align? If we think
> of Bio.Align as a general module for alignments, then pairwise alignments
> fit in it too. It depends a bit on the exact API, but I expect that we can
> come up with something elegant.
>
>
Does anything in Bio.Align already operate on SeqFeature objects?

Given that BLAST or HMMer output could be interpreted as (1) a series of
annotated features/regions on target sequences, or (2) a series of pairwise
alignments [*], perhaps it would be most effective to support those aspects
separately, through (1) Bio.Search or Bio.Feature [**], and (2) Bio.Align
or Bio.AlignIO.

[*] The multiple sequence alignment produced by HMMer is in a format we
already handle (Stockholm). Some people want to convert BLAST output to a
multiple sequence alignment, too, and while I suppose we could support that
in a literal sense, the result would be worse than the output of pretty
much any other alignment program so I don't think we should.

[**] A Bio.Feature module could involve GFF parsing and the variant
parsers, too. It would contain I/O functions that emit SeqFeatures, of
course.

From redmine at redmine.open-bio.org  Sat Apr  7 13:31:37 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 7 Apr 2012 17:31:37 +0000
Subject: [Biopython-dev] [Biopython - Feature #3338] (New) Convert a protein
	alignment and nucleotide sequences to codon alignment
Message-ID: <redmine.issue-3338.20120407173137@redmine.open-bio.org>


Issue #3338 has been reported by Eric Talevich.

----------------------------------------
Feature #3338: Convert a protein alignment and nucleotide sequences to codon alignment
https://redmine.open-bio.org/issues/3338

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


As discussed on the mailing list:
http://lists.open-bio.org/pipermail/biopython/2012-April/007913.html

This could be implemented in two ways:
1. Wrap PAL2NAL (pal2nal.pl) under Bio.Align.Applications
2. Implement this functionality directly in Python

While PAL2NAL has some convenience features like aligning protein sequences to CDS sequences that don't exactly match, it would be straightforward (and simpler for the user, in most cases) to implement a fussier version of it from scratch somewhere in Biopython.

So, where would be put this function?

Related:
* From a codon alignment, it would again be straightforward to calculate dN/dS ratios for pairs of sequences, much like PAML's yn00 (although that program does more stuff, too). Do we want to do that? Where?
* Are there ways Biopython could support codon alignments better, as distinct from nucleotide alignments?


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From eric.talevich at gmail.com  Sat Apr  7 14:42:02 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 7 Apr 2012 14:42:02 -0400
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
Message-ID: <CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>

On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich <eric.talevich at gmail.com>wrote:

> Hi all,
>
> I'm considering some enhancements to the Phylo.draw function to make it
> more customizable for power users. Since the function is based on
> matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the
> user; however, I'm not fully versed in what pyplot is capable of.
>
> Relevant feature request in Redmine:
> https://redmine.open-bio.org/issues/3336
>
> Ideas:

[...]
> 2. Add an argument 'branch_labels' to Phylo.draw. This will accept either
> (a) a dict which maps the tree's Clade objects to string labels, or (b) a
> function which accepts a Clade object and returns a string. Default: a
> function that formats the clade's 'confidence' or 'confidences' attribute,
> matching the current behavior.
>
> Examples:
> >>> draw(mytree, branch_labels={mytree.root: "Root", ...})
> >>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence)
> >>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank)
>
>
Just committed this feature:
https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d

From lgautier at gmail.com  Sun Apr  8 13:16:31 2012
From: lgautier at gmail.com (Laurent Gautier)
Date: Sun, 08 Apr 2012 19:16:31 +0200
Subject: [Biopython-dev] Sphinx documentation online ?
Message-ID: <4F81C7EF.7030505@gmail.com>

Hi,

I have seen emails exchanges and issues on the tracker regarding moving 
the documentation to Sphinx, but I could not find an instance of the 
documentation for biopython online (I was looking for one to 
cross-reference it with documentation I am writing).

Is this still work-in-progress, or is there an instance online and I 
missed it ?

Best,


Laurent

From eric.talevich at gmail.com  Sun Apr  8 15:25:00 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 8 Apr 2012 15:25:00 -0400
Subject: [Biopython-dev] Sphinx documentation online ?
In-Reply-To: <4F81C7EF.7030505@gmail.com>
References: <4F81C7EF.7030505@gmail.com>
Message-ID: <CAMC681=pQCtiicFy882D1FSc71XYQrWD-5ouU1pzAo47G0gJgQ@mail.gmail.com>

On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier <lgautier at gmail.com> wrote:

> Hi,
>
> I have seen emails exchanges and issues on the tracker regarding moving
> the documentation to Sphinx, but I could not find an instance of the
> documentation for biopython online (I was looking for one to
> cross-reference it with documentation I am writing).
>
> Is this still work-in-progress, or is there an instance online and I
> missed it ?
>
>
Hi Laurent,

I proposed this a while ago and played with Sphinx a little bit, but didn't
get very far. We're still using Epydoc for our generated API documentation:
http://biopython.org/DIST/docs/api/

I do hope to get back to this at some point, or perhaps assist someone else
with migrating Biopython to Sphinx.

-Eric

From lgautier at gmail.com  Sun Apr  8 16:46:45 2012
From: lgautier at gmail.com (Laurent Gautier)
Date: Sun, 08 Apr 2012 22:46:45 +0200
Subject: [Biopython-dev] Sphinx documentation online ?
In-Reply-To: <CAMC681=pQCtiicFy882D1FSc71XYQrWD-5ouU1pzAo47G0gJgQ@mail.gmail.com>
References: <4F81C7EF.7030505@gmail.com>
	<CAMC681=pQCtiicFy882D1FSc71XYQrWD-5ouU1pzAo47G0gJgQ@mail.gmail.com>
Message-ID: <4F81F935.9030702@gmail.com>

On 2012-04-08 21:25, Eric Talevich wrote:
> On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier <lgautier at gmail.com 
> <mailto:lgautier at gmail.com>> wrote:
>
>     Hi,
>
>     I have seen emails exchanges and issues on the tracker regarding
>     moving the documentation to Sphinx, but I could not find an
>     instance of the documentation for biopython online (I was looking
>     for one to cross-reference it with documentation I am writing).
>
>     Is this still work-in-progress, or is there an instance online and
>     I missed it ?
>
>
> Hi Laurent,
>
> I proposed this a while ago and played with Sphinx a little bit, but 
> didn't get very far. We're still using Epydoc for our generated API 
> documentation:
> http://biopython.org/DIST/docs/api/
>
> I do hope to get back to this at some point, or perhaps assist someone 
> else with migrating Biopython to Sphinx.
>
> -Eric
>
>

Hi Eric,

Thanks for the answer. I did see the Epydoc, but I was after Sphinx to 
be able to cross-reference documentations (see 
http://sphinx.pocoo.org/ext/intersphinx.html ).
I'll do with it for the time being.

Best,


Laurent


From eric.talevich at gmail.com  Mon Apr  9 14:25:04 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 9 Apr 2012 14:25:04 -0400
Subject: [Biopython-dev] Method to weight sequences in an alignment
Message-ID: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>

Folks,

I've written a function to weight sequences according to the simple scheme
used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment
objects or lists of plain strings, and could be added as a method with
minimal changes (for Python 2.5 compatibility, mainly). Any interest in
adding it to Biopython?

The code is below.

Cheers,
Eric

[*] Henikoff & Henikoff (1994): Position-based sequence weights.
http://www.ncbi.nlm.nih.gov/pubmed/7966282

----

def sequence_weights(aln):
    """Weight aligned sequences to emphasize more divergent members.

    Returns a list of floating-point numbers between 0 and 1, corresponding
to
    the proportional weight of each sequence in the alignment. The first
list
    is the weight of the first sequence in the alignment, and so on. Weights
    sum to 1.0.

    Method: At each column position, award each different residue an equal
    share of the weight, and then divide that weight equally among the
    sequences sharing the same residue.  For each sequence, sum the
    contributions from each position to give a sequence weight.

    See Henikoff & Henikoff (1994): Position-based sequence weights.
    """
    def col_weight(column):
        """Represent the diversity at a position.

        Award each different residue an equal share of the weight, and then
        divide that weight equally among the sequences sharing the same
        residue.

        So, if in a position of a multiple alignment, r different residues
        are represented, a residue represented in only one sequence
contributes
        a score of 1/r to that sequence, whereas a residue represented in s
        sequences contributes a score of 1/rs to each of the s sequences.
        """
        # Skip columns with all gaps or unique inserts
        if len([c for c in column if c not in '-.']) < 2:
            return [0] * len(column)
        # Count the number of occurrences of each residue type
        # (Treat gaps as a separate, 21st character)
        counts = Counter(column)
        # Get residue weights: 1/rs, where
        # r = nb. residue types, s = count of a particular residue type
        n_residues = len(counts)    # r
        freqs = dict((aa, 1.0 / (n_residues * count))
                for aa, count in counts.iteritems())
        weights = [freqs[aa] for aa in column]
        return weights

    seq_weights = [0] * len(aln)
    col_weights = map(col_weight, zip(*aln))
    # Sum the contributions from each position along each sequence -> total
weight
    for col in col_weights:
        for idx, row_val in enumerate(col):
            seq_weights[idx] += row_val
    # Normalize
    scale = 1.0 / sum(seq_weights)
    seq_weights = [scale * wt for wt in seq_weights]
    return seq_weights

From mjldehoon at yahoo.com  Mon Apr  9 19:27:31 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 9 Apr 2012 16:27:31 -0700 (PDT)
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CAMC681kbybwpcd96PVV=y34nY6jSdnHMqS2XG+_BuoScy42q9A@mail.gmail.com>
Message-ID: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>

Hi Eric, Peter,

> How about Bio.Search, for now?


I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells users something about what the module is for. Bio.Search could be anything (search PubMed? search the Entrez databases? search Google? anyway Bio.Search does not suggest that this module is about pairwise alignments). But Peter previously mentioned that he doesn't like Bio.Pairwise; can we convince you?

>> How
 about including this module as a submodule in Bio.Align?
> Does anything in Bio.Align already operate on SeqFeature objects? 
I was more thinking to have this module as a submodule in Bio.Align for the purpose of module organization rather than reusing or integrating it with Bio.Align. However, if we can make use of Bio.Align, then that could be a good thing.

Best,
-Michiel.

From chapmanb at 50mail.com  Mon Apr  9 20:58:19 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 09 Apr 2012 20:58:19 -0400
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
Message-ID: <87lim4h07o.fsf@fastmail.fm>


Michiel;

> Hi Eric, Peter,
> 
> > How about Bio.Search, for now?
> 
> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells
> users something about what the module is for. Bio.Search could be
> anything (search PubMed? search the Entrez databases? search Google?
> anyway Bio.Search does not suggest that this module is about pairwise
> alignments). But Peter previously mentioned that he doesn't like
> Bio.Pairwise; can we convince you?

I agree with Peter on this one. The module is primarily about searching
a sequence database with an input via multiple methods, not about
pairwise alignment of two sequences with is what Bio.Align.Pairwise
suggests to me.

Brad

From redmine at redmine.open-bio.org  Tue Apr 10 16:29:09 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 10 Apr 2012 20:29:09 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using
	Bio.Clustalw in Tutorial
Message-ID: <redmine.issue-3340.20120410202908@redmine.open-bio.org>


Issue #3340 has been reported by Peter Cock.

----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue Apr 10 16:29:09 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 10 Apr 2012 20:29:09 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using
	Bio.Clustalw in Tutorial
Message-ID: <redmine.issue-3340.20120410202908@redmine.open-bio.org>


Issue #3340 has been reported by Peter Cock.

----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Thu Apr 12 12:01:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Apr 2012 17:01:47 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
Message-ID: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>

Hello all,

The BOSC abstract deadline (tomorrow) has rather crept up on me,
despite Nomi's reminder emails (My excuse is I've been thinking
more about GSoC!). For anyone thinking of submitting a talk, the
abstract limit is just a page - see:
http://www.open-bio.org/wiki/BOSC_2012

I'm hoping to attend BOSC, but will probably not be at ISMB 2012.
I'd be delighted for another Biopython developer to give the project
update talk (and as in previous years, we'll help out with the abstract,
slides, etc). Anyone interested? Giving a talk can be very helpful in
getting travel funding ;)

I know Eric might be a candidate as he will be in Long Beach
(congratulations on getting your ISMB poster accepted Eric!).

Note that dedicated "Bioinformatics Open Source Project Updates"
track is new this year. The talks are likely to be at the shorter end of
the talk length range specified (i.e. closer to 5 minutes than 20 mins)
but that will partly depend on quite how full the final schedule turns
out to be.

The idea (speaking with my BOSC hat on) with the update talks is
to try to highlight what is new and exciting, with only a minimal
introduction for the higher profile projects - most of the audience
will know roughly what BioPerl etc are, and won't be interested
to hear it again ;)

So for the Biopython talk we'd probably want to cover things like
GSoC, work with PyPy and Python3, major new functionality, any
Biopython papers, etc, and a bit on future plans. The talk should be
short but sweet :)

Regards,

Peter

From redmine at redmine.open-bio.org  Thu Apr 12 14:52:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 12 Apr 2012 18:52:35 +0000
Subject: [Biopython-dev] [Biopython - Feature #3341] (New) Improve SffIO to
	parse 3 extra lines present in some SFF files ("Run Name:,
	Analysis Name:, Full Path:)"
Message-ID: <redmine.issue-3341.20120412185235@redmine.open-bio.org>


Issue #3341 has been reported by Martin Mokrej?.

----------------------------------------
Feature #3341: Improve SffIO to parse 3 extra lines present in some SFF files ("Run Name:, Analysis Name:, Full Path:)"
https://redmine.open-bio.org/issues/3341

Author: Martin Mokrej?
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Some file have extra 3 lines per each record in the SFF file. One such file is already in biopython test data:
biopython/Tests/Roche/E3MFGYR02_random_10_reads.sff
biopython/Tests/Roche/paired.sff

The three lines "Run Name:, Analysis Name:, Full Path:" are not parsed into the object and later on, are not written out. Hence, sff round trip read in -> write out breaks (biopython-1.58). These three lines somehow do not appear in every SFF file, and so far I haven't seen these in files extracted from SRA. Seems these only appear in original Roche SFF files.


>E3MFGYR02JWQ7T
  Run Prefix:   R_2008_01_09_16_16_00_
  Region #:     2
  XY Location:  3946_2103

  Run Name:       R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331
  Analysis Name:  /data/2008_02_08/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe
  Full Path:      /data/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe

  Read Header Len:  32
  Name Length:      14
  # of Bases:       265
  Clip Qual Left:   5
  Clip Qual Right:  264
  Clip Adap Left:   0
  Clip Adap Right:  0


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From eric.talevich at gmail.com  Thu Apr 12 18:37:12 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 12 Apr 2012 18:37:12 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
Message-ID: <CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>

On Thu, Apr 12, 2012 at 12:01 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hello all,
>
> The BOSC abstract deadline (tomorrow) has rather crept up on me,
> despite Nomi's reminder emails (My excuse is I've been thinking
> more about GSoC!). For anyone thinking of submitting a talk, the
> abstract limit is just a page - see:
> http://www.open-bio.org/wiki/BOSC_2012
>
> I'm hoping to attend BOSC, but will probably not be at ISMB 2012.
> I'd be delighted for another Biopython developer to give the project
> update talk (and as in previous years, we'll help out with the abstract,
> slides, etc). Anyone interested? Giving a talk can be very helpful in
> getting travel funding ;)
>
> I know Eric might be a candidate as he will be in Long Beach
> (congratulations on getting your ISMB poster accepted Eric!).
>
> Note that dedicated "Bioinformatics Open Source Project Updates"
> track is new this year. The talks are likely to be at the shorter end of
> the talk length range specified (i.e. closer to 5 minutes than 20 mins)
> but that will partly depend on quite how full the final schedule turns
> out to be.
>
> The idea (speaking with my BOSC hat on) with the update talks is
> to try to highlight what is new and exciting, with only a minimal
> introduction for the higher profile projects - most of the audience
> will know roughly what BioPerl etc are, and won't be interested
> to hear it again ;)
>
> So for the Biopython talk we'd probably want to cover things like
> GSoC, work with PyPy and Python3, major new functionality, any
> Biopython papers, etc, and a bit on future plans. The talk should be
> short but sweet :)
>
> Regards,
>
> Peter


OK, here are some potential talking points I scraped from past announcements:

* SeqIO.index_db:
Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
carry the index_db concept to other modules.

* Installation improvements:
pip support (v.1.57); easy_install will automatically handle the numpy
dependency (v.1.59, Feb '12)

* Portability:
Python 3 compatibility (except for a couple C extension modules);
still supporting Jython; now mostly supporting Pypy (except for
modules that use numpy or C extensions)

* Merged Brandon Invergo's independent project pypaml under
Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
support (v.1.59) and the existing support for phylogeny I/O under
Phylo, we can now easily assemble and run complete workflows involving
PAML.
(Similarly for PhyML, with SeqIO's "phylip-relaxed" and
Bio.Phylo.Applications.PhymlCommandline.)

* GenomeDiagram improvements:
New, pretty features. Eye candy for the slides.

* TogoWS

* Next release & future plans:
- Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
- Brad's GFF parser
- Deeper future: see the other mailing list thread

* GSoC 2011 results:
- Mikael Trellet -- Interface
- Michele Silva -- Mocapy++ Python module; also ported two
applications to Biopython
- Justinas D. -- Python-based extension system for Mocapy++

* Summer of Struct:
Jo?o and Eric are working to refactor and merge the vast amount of
Bio.PDB-related code produced during previous GSoCs. (Includes a
planned SeqIO-style API for structures in PDB, mmCIF and PBDML
formats.) Improvements have been trickling in since the last BOSC;
here comes the flood.


From chapmanb at 50mail.com  Thu Apr 12 20:23:03 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 12 Apr 2012 20:23:03 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
Message-ID: <877gxkh448.fsf@fastmail.fm>


Eric and Peter;
Eric -- I'm glad you're taking this on. It'll be great to have a
Biopython presentation at BOSC. The points you mentioned all sound
great, although I would drop some of the more boring ones like the
installation stuff (I can pick on that, since it's mine).

My only other suggestions is to focus the talk around the people who've
provided the improvements. One of the awesome things about Biopython is
the wide contributor base and we still manage to pull everything into a
coherent package thanks to Peter's guiding hand. It would be cool to
emphasize this community as part of the update.

Thanks again for doing this,
Brad

> > Hello all,
> >
> > The BOSC abstract deadline (tomorrow) has rather crept up on me,
> > despite Nomi's reminder emails (My excuse is I've been thinking
> > more about GSoC!). For anyone thinking of submitting a talk, the
> > abstract limit is just a page - see:
> > http://www.open-bio.org/wiki/BOSC_2012
> >
> > I'm hoping to attend BOSC, but will probably not be at ISMB 2012.
> > I'd be delighted for another Biopython developer to give the project
> > update talk (and as in previous years, we'll help out with the abstract,
> > slides, etc). Anyone interested? Giving a talk can be very helpful in
> > getting travel funding ;)
> >
> > I know Eric might be a candidate as he will be in Long Beach
> > (congratulations on getting your ISMB poster accepted Eric!).
> >
> > Note that dedicated "Bioinformatics Open Source Project Updates"
> > track is new this year. The talks are likely to be at the shorter end of
> > the talk length range specified (i.e. closer to 5 minutes than 20 mins)
> > but that will partly depend on quite how full the final schedule turns
> > out to be.
> >
> > The idea (speaking with my BOSC hat on) with the update talks is
> > to try to highlight what is new and exciting, with only a minimal
> > introduction for the higher profile projects - most of the audience
> > will know roughly what BioPerl etc are, and won't be interested
> > to hear it again ;)
> >
> > So for the Biopython talk we'd probably want to cover things like
> > GSoC, work with PyPy and Python3, major new functionality, any
> > Biopython papers, etc, and a bit on future plans. The talk should be
> > short but sweet :)
> >
> > Regards,
> >
> > Peter
> 
> 
> OK, here are some potential talking points I scraped from past announcements:
> 
> * SeqIO.index_db:
> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
> carry the index_db concept to other modules.
> 
> * Installation improvements:
> pip support (v.1.57); easy_install will automatically handle the numpy
> dependency (v.1.59, Feb '12)
> 
> * Portability:
> Python 3 compatibility (except for a couple C extension modules);
> still supporting Jython; now mostly supporting Pypy (except for
> modules that use numpy or C extensions)
> 
> * Merged Brandon Invergo's independent project pypaml under
> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
> support (v.1.59) and the existing support for phylogeny I/O under
> Phylo, we can now easily assemble and run complete workflows involving
> PAML.
> (Similarly for PhyML, with SeqIO's "phylip-relaxed" and
> Bio.Phylo.Applications.PhymlCommandline.)
> 
> * GenomeDiagram improvements:
> New, pretty features. Eye candy for the slides.
> 
> * TogoWS
> 
> * Next release & future plans:
> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
> - Brad's GFF parser
> - Deeper future: see the other mailing list thread
> 
> * GSoC 2011 results:
> - Mikael Trellet -- Interface
> - Michele Silva -- Mocapy++ Python module; also ported two
> applications to Biopython
> - Justinas D. -- Python-based extension system for Mocapy++
> 
> * Summer of Struct:
> Jo?o and Eric are working to refactor and merge the vast amount of
> Bio.PDB-related code produced during previous GSoCs. (Includes a
> planned SeqIO-style API for structures in PDB, mmCIF and PBDML
> formats.) Improvements have been trickling in since the last BOSC;
> here comes the flood.
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From arklenna at gmail.com  Thu Apr 12 23:26:35 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Thu, 12 Apr 2012 23:26:35 -0400
Subject: [Biopython-dev] [biopython] Fix flex library dependency of
	MMCIFlex; closes 2619 (#31)
In-Reply-To: <CAKVJ-_6VAMv6CAT2_An-syJ2ezOC4vT6kwj2vt0wHeCc07KHaw@mail.gmail.com>
References: <biopython/biopython/pull/31@github.com>
	<CAKVJ-_5FqKjoWqKd9SRD0=Sz3=_4BGDWpr1qomBxnpy6NaLnsw@mail.gmail.com>
	<CAK610_6ENA5W=4AUQv6XU8dd0pC0AszhGubN4JPAuLo65ec5oQ@mail.gmail.com>
	<CAKVJ-_647+L7j1TanQKhbDgL-2++hahxH5MQXEdEMyiiJv+VxQ@mail.gmail.com>
	<CAKVJ-_7sPgGaR8q9YF6+Ng2JeCXfJ4D05GDajObvjXHHwQ53Fg@mail.gmail.com>
	<CAK610_7TU99wF7NNeh5ukpdDVv8mhK+hDCtT1N-UOUb72=nPSg@mail.gmail.com>
	<CAKVJ-_4-xL4ZmWpMPfD7Xf1Et4yLmDHTL1az5ALVz9nJ-8hvgg@mail.gmail.com>
	<CAK610_7CHv88EjZcqZEdqo4Z_51FYJcZmGD_vhZ-iTDU-ULVuA@mail.gmail.com>
	<CAKVJ-_4hyUBpckiBQ4wUy_Ow9QT7pMy2tOhCbfoPeWVEbAfQwQ@mail.gmail.com>
	<CAKVJ-_6VAMv6CAT2_An-syJ2ezOC4vT6kwj2vt0wHeCc07KHaw@mail.gmail.com>
Message-ID: <CAK610_4kx7=v_AkTrg-DDsP7OhB+C6XGnBt+BSjMqMkNLeJDrA@mail.gmail.com>

On Thu, Mar 29, 2012 at 10:05 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Lenna,
>
> Have you tried your branch on Windows yet?
>
> It worked for me under my Python 2.5 setup using mingw32,
>
> C:\repositories\biopython>c:\python26\python setup.py install
> ...
> building 'Bio.PDB.mmCIF.MMCIFlex' extension
> creating build\temp.win32-2.5\Release\bio\pdb
> creating build\temp.win32-2.5\Release\bio\pdb\mmcif
> C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio
> -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/lex.yy.c -o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o
> lex.yy.c:1046: warning: 'yyunput' defined but not used
> C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio
> -Ic:\python25\include -Ic:\python25\PC -c
> Bio/PDB/mmCIF/MMCIFlexmodule.c -o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o
> writing build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def
> C:\cygwin\usr\bin\gcc.exe -mno-cygwin -shared -s
> build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def
> -Lc:\python25\libs -Lc:\python25\PCBuild -lpython25 -lmsvcr71 -o
> build\lib.win32-2.5\Bio\PDB\mmCIF\MMCIFlex.pyd
> ...
>
> That worked fine and test_MMCIF.py is happy. However, MSVC v9 is not:
>
> C:\repositories\biopython>c:\python26\python setup.py install
> ...
> building 'Bio.PDB.mmCIF.MMCIFlex' extension
> C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe /c /nologo
> /Ox /MD /W3 /GS- /DNDEBUG -IBio -Ic:\python26\include -Ic:\python26\PC
> /TcBio/PDB/mmCIF/lex.yy.c
> /Fobuild\temp.win32-2.6\Release\Bio/PDB/mmCIF/lex.yy.obj
> lex.yy.c
> Bio/PDB/mmCIF/lex.yy.c(12) : fatal error C1083: Cannot open include
> file: 'unistd.h': No such file or directory
> error: command '"C:\Program Files\Microsoft Visual Studio
> 9.0\VC\BIN\cl.exe"' failed with exit status 2
>
> The same with Python 2.7 and the Microsoft compiler. Switching
> from this in Bio/PDB/mmCIF.yy.c:
>
> #include <unistd.h>
>
> to this:
>
> #include <io.h>
>
> lets it compile (although with some warnings) and test_MMCIF.py passes.
> If should be conditional of course, but I'm unclear if that is the appropriate
> fix or not though.
>
> Peter


Hi Peter,

I installed flex on my Windows VM and used it to generate lex.yy.c. It
puts #include <unistd.h> inside an #ifdef so it may work with MSVC. It
produces a working module for both Debian and Mac OS X (I do get
"defined but not used" warnings for generated functions). I've
cherry-picked it into my pull request.

I know you're quite busy right now with BOSC and GSoC, but let me know
if you get a chance to test it on MSVC.

Lenna

From p.j.a.cock at googlemail.com  Fri Apr 13 07:31:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 12:31:30 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
Message-ID: <CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>

On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> OK, here are some potential talking points I scraped from past announcements:
>
> * SeqIO.index_db:
> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
> carry the index_db concept to other modules.

Biopython 1.57 was already covered at BOSC 2011.

> * Installation improvements:
> pip support (v.1.57); easy_install will automatically handle the numpy
> dependency (v.1.59, Feb '12)

Brad commented on this, perhaps a line in the abstract?

> * Portability:
> Python 3 compatibility (except for a couple C extension modules);
> still supporting Jython; now mostly supporting Pypy (except for
> modules that use numpy or C extensions)

This is something I would want to cover.

> * Merged Brandon Invergo's independent project pypaml under
> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
> support (v.1.59) and the existing support for phylogeny I/O under
> Phylo, we can now easily assemble and run complete workflows involving
> PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and
> Bio.Phylo.Applications.PhymlCommandline.)

Yep.

> * GenomeDiagram improvements:
> New, pretty features. Eye candy for the slides.

Yep. Maybe even an example in the abstract?

> * TogoWS

Yep.

> * Next release & future plans:
> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
> - Brad's GFF parser
> - Deeper future: see the other mailing list thread

Good points - although I don't want to over promise ;)

> * GSoC 2011 results:
> - Mikael Trellet -- Interface
> - Michele Silva -- Mocapy++ Python module; also ported two
> applications to Biopython
> - Justinas D. -- Python-based extension system for Mocapy++

We should have a summary of what they did somewhere, perhaps
as an OBF blog post? I'm hoping to get this year's GSoC students
to write weekly progress reports on a blog or at least by email to
the mailing list.

> * Summer of Struct:
> Jo?o and Eric are working to refactor and merge the vast amount of
> Bio.PDB-related code produced during previous GSoCs. (Includes a
> planned SeqIO-style API for structures in PDB, mmCIF and PBDML
> formats.) Improvements have been trickling in since the last BOSC;
> here comes the flood.

:)

Here's a draft abstract - note we have to fit in a page. Having a logo
or some eye catching image is very effective for standing out in the
abstract book (on screen or on paper).

Comments welcome - but keep in mind the one page limit.

Eric - feel free to turn this into a Google Doc if you prefer.

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.pdf
Type: application/pdf
Size: 199737 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/9ebaac7d/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.tex
Type: application/x-tex
Size: 5037 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/9ebaac7d/attachment-0001.tex>

From eric.talevich at gmail.com  Fri Apr 13 10:31:08 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 13 Apr 2012 10:31:08 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
Message-ID: <CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>

Thanks for this. I'll keep it as LaTeX, since it already looks nice.

1. Several parts say "[to be revised prior to BOSC]" -- I take it we
have the option of updating our abstract shortly before BOSC, and this
is a note to the conference organizers that we intend to do so? To
save space and reduce distraction, should this be a footnote instead?

2. To save space: Do we need the line "Bioinformatics Open Source
Conference (BOSC) ..." after the author names?

3. Again to save space, and make room to cite the Phylo paper: can we
drop the citation for TogoWS, and add a few words of description in
the main text where it's mentioned? (We don't cite PAML, HMMer, etc.)

4. How do you feel about dropping inline citations, and just have a
list of \nocite references at the bottom? In a one-page abstract, it
should be easy enough for readers to figure out what's what.

-E

On Fri, Apr 13, 2012 at 7:31 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> OK, here are some potential talking points I scraped from past announcements:
>>
>> * SeqIO.index_db:
>> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
>> carry the index_db concept to other modules.
>
> Biopython 1.57 was already covered at BOSC 2011.
>
>> * Installation improvements:
>> pip support (v.1.57); easy_install will automatically handle the numpy
>> dependency (v.1.59, Feb '12)
>
> Brad commented on this, perhaps a line in the abstract?
>
>> * Portability:
>> Python 3 compatibility (except for a couple C extension modules);
>> still supporting Jython; now mostly supporting Pypy (except for
>> modules that use numpy or C extensions)
>
> This is something I would want to cover.
>
>> * Merged Brandon Invergo's independent project pypaml under
>> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
>> support (v.1.59) and the existing support for phylogeny I/O under
>> Phylo, we can now easily assemble and run complete workflows involving
>> PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and
>> Bio.Phylo.Applications.PhymlCommandline.)
>
> Yep.
>
>> * GenomeDiagram improvements:
>> New, pretty features. Eye candy for the slides.
>
> Yep. Maybe even an example in the abstract?
>
>> * TogoWS
>
> Yep.
>
>> * Next release & future plans:
>> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
>> - Brad's GFF parser
>> - Deeper future: see the other mailing list thread
>
> Good points - although I don't want to over promise ;)
>
>> * GSoC 2011 results:
>> - Mikael Trellet -- Interface
>> - Michele Silva -- Mocapy++ Python module; also ported two
>> applications to Biopython
>> - Justinas D. -- Python-based extension system for Mocapy++
>
> We should have a summary of what they did somewhere, perhaps
> as an OBF blog post? I'm hoping to get this year's GSoC students
> to write weekly progress reports on a blog or at least by email to
> the mailing list.
>
>> * Summer of Struct:
>> Jo?o and Eric are working to refactor and merge the vast amount of
>> Bio.PDB-related code produced during previous GSoCs. (Includes a
>> planned SeqIO-style API for structures in PDB, mmCIF and PBDML
>> formats.) Improvements have been trickling in since the last BOSC;
>> here comes the flood.
>
> :)
>
> Here's a draft abstract - note we have to fit in a page. Having a logo
> or some eye catching image is very effective for standing out in the
> abstract book (on screen or on paper).
>
> Comments welcome - but keep in mind the one page limit.
>
> Eric - feel free to turn this into a Google Doc if you prefer.
>
> Peter


From p.j.a.cock at googlemail.com  Fri Apr 13 10:42:37 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 15:42:37 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
Message-ID: <CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>

On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Thanks for this. I'll keep it as LaTeX, since it already looks nice.
>
> 1. Several parts say "[to be revised prior to BOSC]" -- I take it we
> have the option of updating our abstract shortly before BOSC, and this
> is a note to the conference organizers that we intend to do so? To
> save space and reduce distraction, should this be a footnote instead?

It is common for BOSC abstracts to be revised following review prior to
acceptance (almost like a tiny paper), and yes, that was my intention.
Do you think something like [to be revised during abstract review]
might be clearer? I think this makes a lot of sense for the project
update talks in particular - but that stage for example we'll have the
GSoC students selected.

> 2. To save space: Do we need the line "Bioinformatics Open Source
> Conference (BOSC) ..." after the author names?

I like it to make the page self contained, useful if we post it as a lone
PDF file. The text could be smaller certainly if required - likewise the
logo could be shrunk a little.

> 3. Again to save space, and make room to cite the Phylo paper: can we
> drop the citation for TogoWS, and add a few words of description in
> the main text where it's mentioned? (We don't cite PAML, HMMer, etc.)

Fair point, I was thinking in terms of audience recognition. PAML
and HMMer are quite well known and relatively old/mature.

If the Phylo paper is accepted in time to be added to abstract then
of course we'd want to include it. But right now using a couple of
lines for a 'submitted' citation seemed overkill to me. But if you can
get it to fit nicely, please go ahead.

> 4. How do you feel about dropping inline citations, and just have a
> list of \nocite references at the bottom? In a one-page abstract, it
> should be easy enough for readers to figure out what's what.

If you prefer, or use the [1] style?

Peter

From eric.talevich at gmail.com  Fri Apr 13 11:40:06 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 13 Apr 2012 11:40:06 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
Message-ID: <CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>

On Fri, Apr 13, 2012 at 10:42 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Thanks for this. I'll keep it as LaTeX, since it already looks nice.
>>
>> 1. Several parts say "[to be revised prior to BOSC]" -- I take it we
>> have the option of updating our abstract shortly before BOSC, and this
>> is a note to the conference organizers that we intend to do so? To
>> save space and reduce distraction, should this be a footnote instead?
>
> It is common for BOSC abstracts to be revised following review prior to
> acceptance (almost like a tiny paper), and yes, that was my intention.
> Do you think something like [to be revised during abstract review]
> might be clearer? I think this makes a lot of sense for the project
> update talks in particular - but that stage for example we'll have the
> GSoC students selected.
>
>> 2. To save space: Do we need the line "Bioinformatics Open Source
>> Conference (BOSC) ..." after the author names?
>
> I like it to make the page self contained, useful if we post it as a lone
> PDF file. The text could be smaller certainly if required - likewise the
> logo could be shrunk a little.
>
>> 3. Again to save space, and make room to cite the Phylo paper: can we
>> drop the citation for TogoWS, and add a few words of description in
>> the main text where it's mentioned? (We don't cite PAML, HMMer, etc.)
>
> Fair point, I was thinking in terms of audience recognition. PAML
> and HMMer are quite well known and relatively old/mature.
>
> If the Phylo paper is accepted in time to be added to abstract then
> of course we'd want to include it. But right now using a couple of
> lines for a 'submitted' citation seemed overkill to me. But if you can
> get it to fit nicely, please go ahead.
>
>> 4. How do you feel about dropping inline citations, and just have a
>> list of \nocite references at the bottom? In a one-page abstract, it
>> should be easy enough for readers to figure out what's what.
>
> If you prefer, or use the [1] style?
>
> Peter

Here's an updated draft. How does it look?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.pdf
Type: application/pdf
Size: 262728 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/7c3bda8f/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.tex
Type: application/x-tex
Size: 5573 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/7c3bda8f/attachment-0001.tex>

From p.j.a.cock at googlemail.com  Fri Apr 13 11:57:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 16:57:27 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
Message-ID: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>

On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Here's an updated draft. How does it look?

Looks fine to me - anyone else? A fresh pair of eyes would be good.

Also does anyone else want to be named as a talk co-author (and
promise to contribute with slides/figures/help for preparing the talk)?
Or should we just put "Eric et al" since he'll be the one on stage?

Peter

From anaryin at gmail.com  Fri Apr 13 12:02:04 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Fri, 13 Apr 2012 18:02:04 +0200
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
Message-ID: <CAJ9sUYM84rSr+F7VmF3nk83RiGz9dcE-sYZyuV5Ne1qSqaVNzQ@mail.gmail.com>

Third paragraph: 'summer' should read 'Summer'.

Good to me! I can help with the slides/figures/help, particularly on the
refactoring part of Bio.PDB to Bio.Struct. Let me know when and I can
easily get on Skype.

cheers!

Jo?o


From zhigang.wu at email.ucr.edu  Fri Apr 13 12:25:34 2012
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Fri, 13 Apr 2012 09:25:34 -0700
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
Message-ID: <CADhJE9sNOM4+5UOfLgF1VxhkyvT9NfT7ZAWNW6v+4ropGa0tHw@mail.gmail.com>

Probably I caught a grammar mistake.

Should we correct  "Biopython 1.60 is expected *to have been* released by
BOSC 2012"  to "Biopython 1.60 is expected *to be* released by BOSC 2012"?

Probably I was wrong. I am not a native speaker. :-)

Zhigang


On Fri, Apr 13, 2012 at 8:57 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> >
> > Here's an updated draft. How does it look?
>
> Looks fine to me - anyone else? A fresh pair of eyes would be good.
>
> Also does anyone else want to be named as a talk co-author (and
> promise to contribute with slides/figures/help for preparing the talk)?
> Or should we just put "Eric et al" since he'll be the one on stage?
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From arklenna at gmail.com  Fri Apr 13 12:31:53 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Fri, 13 Apr 2012 12:31:53 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CADhJE9sNOM4+5UOfLgF1VxhkyvT9NfT7ZAWNW6v+4ropGa0tHw@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
	<CADhJE9sNOM4+5UOfLgF1VxhkyvT9NfT7ZAWNW6v+4ropGa0tHw@mail.gmail.com>
Message-ID: <CAK610_5h6HdVdaNBVb8bQo8Z_oSkzJV2ooGjw+_qhDmLYVKBQw@mail.gmail.com>

On Fri, Apr 13, 2012 at 12:25 PM, Zhigang Wu <zhigang.wu at email.ucr.edu> wrote:
> Probably I caught a grammar mistake.
>
> Should we correct ?"Biopython 1.60 is expected *to have been* released by
> BOSC 2012" ?to "Biopython 1.60 is expected *to be* released by BOSC 2012"?
>
> Probably I was wrong. I am not a native speaker. :-)
>
> Zhigang
>

Hi Zhigang,

Actually, either way is correct - the original way is called the
future perfect tense.

Here's a description of the grammar if you are interested:
http://www.englishpage.com/verbpage/futureperfect.html

Lenna


From eric.talevich at gmail.com  Fri Apr 13 13:17:31 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 13 Apr 2012 13:17:31 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
Message-ID: <CAMC681=A9SDrjpmM3sWfwBXnidMa5q9Qac617-tkXJ5-1Vs_YA@mail.gmail.com>

On Fri, Apr 13, 2012 at 11:57 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> Here's an updated draft. How does it look?
>
> Looks fine to me - anyone else? A fresh pair of eyes would be good.
>
> Also does anyone else want to be named as a talk co-author (and
> promise to contribute with slides/figures/help for preparing the talk)?
> Or should we just put "Eric et al" since he'll be the one on stage?
>
> Peter

I added Jo?o as the fourth author and submitted it.

Cheers,
Eric


From p.j.a.cock at googlemail.com  Fri Apr 13 15:32:32 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 20:32:32 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=A9SDrjpmM3sWfwBXnidMa5q9Qac617-tkXJ5-1Vs_YA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
	<CAMC681=A9SDrjpmM3sWfwBXnidMa5q9Qac617-tkXJ5-1Vs_YA@mail.gmail.com>
Message-ID: <CAKVJ-_5eXqj=0xaNkCjYa+n8m6CmWN5nDLW-7+1LbvdhkHMpiQ@mail.gmail.com>

On Fri, Apr 13, 2012 at 6:17 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> I added Jo?o as the fourth author and submitted it.
>
> Cheers,
> Eric

Thanks Eric,

If there are any other comments or changes, we'll try to integrate
them along with any reviewers' comments.

Peter


From tiagoantao at gmail.com  Mon Apr 16 05:35:21 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 10:35:21 +0100
Subject: [Biopython-dev] plink phasing and others
Message-ID: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>

Hi,

During the last few months I have been in an hell hole writing code
like mad. Maybe some of this code is of interest to share.

I currently have:

1. Code to parse plink output. Pretty trivial stuff, but I bet lots of
people are doing this
2. Code to process admixture results. Admixture is far less used than STRUCTURE
3. Code to deal with phasing formats. Beagle, PHASE and shapeit
4. PCA
5. Some gene ontology stuff

My GO stuff is pretty specific, so I guess it might not be of interest.
All the other components are of fairly widely used things.
Admixture and PCA are standard popgen analysis. Admixture code could
probably be changed to also support STRUCTURE. I am not sure but PCA
might only work on linux.
Plink and phasing are of more general interest. These would be out of
Bio.PopGen.

There is no strange requirement to any of these code with one
exception: admixture and PCA require matplotib.

So that people have an understanding of the impact of these things, I
put the number of scholar citations:
plink - 3315
smartpca - 1673
admixture - 57
structure - 7448
beagle - >300
fastphase - 1935

Unfortunately there is little code to do automated analysis using these tools.

I could start migrating some of this code to biopython (would have to
write documentation, and comment the code better ;) )

-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin

From p.j.a.cock at googlemail.com  Mon Apr 16 06:26:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 11:26:30 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
Message-ID: <CAKVJ-_4u6ZEbz8hWYcC_ZG71+XOwtjqgyO+E4rViQYcpAWARTQ@mail.gmail.com>

2012/4/16 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> During the last few months I have been in an hell hole writing code
> like mad. Maybe some of this code is of interest to share.
>
> I currently have:
>
> 1. Code to parse plink output. Pretty trivial stuff, but I bet lots of
> people are doing this
> 2. Code to process admixture results. Admixture is far less used than STRUCTURE
> 3. Code to deal with phasing formats. Beagle, PHASE and shapeit
> 4. PCA
> 5. Some gene ontology stuff
>
> My GO stuff is pretty specific, so I guess it might not be of interest.
> All the other components are of fairly widely used things.
> Admixture and PCA are standard popgen analysis. Admixture code could
> probably be changed to also support STRUCTURE. I am not sure but PCA
> might only work on linux.
> Plink and phasing are of more general interest. These would be out of
> Bio.PopGen.
>
> There is no strange requirement to any of these code with one
> exception: admixture and PCA require matplotib.
>
> So that people have an understanding of the impact of these things, I
> put the number of scholar citations:
> plink - 3315
> smartpca - 1673
> admixture - 57
> structure - 7448
> beagle - >300
> fastphase - 1935
>
> Unfortunately there is little code to do automated analysis using these tools.
>
> I could start migrating some of this code to biopython (would have to
> write documentation, and comment the code better ;) )

Sounds good. The GO stuff would/should be more general than just
PopGen, and I know other people are looking at this on branches.

When you said PCA, that was principle component analysis, right?
What are you adding on top of NumPy/SciPy/matplotlib?

Peter


From tiagoantao at gmail.com  Mon Apr 16 08:05:34 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 13:05:34 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAKVJ-_4u6ZEbz8hWYcC_ZG71+XOwtjqgyO+E4rViQYcpAWARTQ@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAKVJ-_4u6ZEbz8hWYcC_ZG71+XOwtjqgyO+E4rViQYcpAWARTQ@mail.gmail.com>
Message-ID: <CAA9RGEPn+xfg=ZY2nwL9bXYABDDjJQN+0R3-wWbYv02sYLo79A@mail.gmail.com>

2012/4/16 Peter Cock <p.j.a.cock at googlemail.com>:
> Sounds good. The GO stuff would/should be more general than just
> PopGen, and I know other people are looking at this on branches.

What I do here is things like tree traversing (e.g. find all parent
nodes) and stuff like that. After that I do enrichment analysis
(fisher exact test, fdr, that stuff). Nothing of real interest for
now. I think we can ignore my code here (for now).

> When you said PCA, that was principle component analysis, right?

Yep, I am using eigenstrat/smartpca.

> What are you adding on top of NumPy/SciPy/matplotlib?

PCA plots and admixture plots.
Here is an example of both:
http://2.bp.blogspot.com/-6J6Gsas4uIs/TuELU3Gf4ZI/AAAAAAAAEWQ/CymvlzkX6hQ/s1600/PIIS0002929711004885.gr2_lrg.hi.jpg
TOP: PCA
Bottom: admixture


-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin

From p.j.a.cock at googlemail.com  Mon Apr 16 09:50:18 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 14:50:18 +0100
Subject: [Biopython-dev] [biopython] Fix flex library dependency of
 MMCIFlex; closes 2619 (#31)
In-Reply-To: <CAK610_4kx7=v_AkTrg-DDsP7OhB+C6XGnBt+BSjMqMkNLeJDrA@mail.gmail.com>
References: <biopython/biopython/pull/31@github.com>
	<CAKVJ-_5FqKjoWqKd9SRD0=Sz3=_4BGDWpr1qomBxnpy6NaLnsw@mail.gmail.com>
	<CAK610_6ENA5W=4AUQv6XU8dd0pC0AszhGubN4JPAuLo65ec5oQ@mail.gmail.com>
	<CAKVJ-_647+L7j1TanQKhbDgL-2++hahxH5MQXEdEMyiiJv+VxQ@mail.gmail.com>
	<CAKVJ-_7sPgGaR8q9YF6+Ng2JeCXfJ4D05GDajObvjXHHwQ53Fg@mail.gmail.com>
	<CAK610_7TU99wF7NNeh5ukpdDVv8mhK+hDCtT1N-UOUb72=nPSg@mail.gmail.com>
	<CAKVJ-_4-xL4ZmWpMPfD7Xf1Et4yLmDHTL1az5ALVz9nJ-8hvgg@mail.gmail.com>
	<CAK610_7CHv88EjZcqZEdqo4Z_51FYJcZmGD_vhZ-iTDU-ULVuA@mail.gmail.com>
	<CAKVJ-_4hyUBpckiBQ4wUy_Ow9QT7pMy2tOhCbfoPeWVEbAfQwQ@mail.gmail.com>
	<CAKVJ-_6VAMv6CAT2_An-syJ2ezOC4vT6kwj2vt0wHeCc07KHaw@mail.gmail.com>
	<CAK610_4kx7=v_AkTrg-DDsP7OhB+C6XGnBt+BSjMqMkNLeJDrA@mail.gmail.com>
Message-ID: <CAKVJ-_79nCQ15EBn7g0cirRsLUodZCCjy3=E2-8xEgt4VUmniQ@mail.gmail.com>

On Fri, Apr 13, 2012 at 4:26 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>
> Hi Peter,
>
> I installed flex on my Windows VM and used it to generate lex.yy.c. It
> puts #include <unistd.h> inside an #ifdef so it may work with MSVC. It
> produces a working module for both Debian and Mac OS X (I do get
> "defined but not used" warnings for generated functions). I've
> cherry-picked it into my pull request.
>

I've now tested that on my Windows machine (and Mac and Linux),
and applied the changes to the master branch. Thanks!

We must remember to drop an email to the Debian and RedHat
packaging teams since their old patch to setup.py isn't needed
now (they could control the flex problem by declaring it a build
time dependency).

Peter

From tiagoantao at gmail.com  Mon Apr 16 11:00:13 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 16:00:13 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
Message-ID: <CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>

Just a few practical things:

1. we still do not allow matplotlib dependencies, correct?
2. to what part of the name space should plink and phasing be added?
3. Are we on epidoc or sphinx? Or moving from one to the other?
doctest is acceptable right?
4. What is the current best way to run external applications? There
was an application wrapper class in the past...

From p.j.a.cock at googlemail.com  Mon Apr 16 11:18:10 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 16:18:10 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
Message-ID: <CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>

2012/4/16 Tiago Ant?o <tiagoantao at gmail.com>:
> Just a few practical things:
>
> 1. we still do not allow matplotlib dependencies, correct?

They would be run time dependencies, right? Not compile/build time?
We already have things like 'soft' dependencies on ReportLab and
NetworkX, and even matplotlib. It does complicate the unit tests a
bit to skip anything gracefully.

>
> 2. to what part of the name space should plink and phasing be added?

Unclear to me right now.

> 3. Are we on epidoc or sphinx? Or moving from one to the other?
> doctest is acceptable right?

We're still using LaTeX for the tutorial, and epydoc for the API docs.

Using doctest is acceptable and encouraged for documentation,
but be wary of cross platform differences. If you have a doctest
which has dependencies see test_wise.py rather than adding it
to run_tests.py

> 4. What is the current best way to run external applications? There
> was an application wrapper class in the past...

For simple Unix style applications controlled via the command line,
use the Bio.Application framework as in Bio.Align.Applications or
Bio.Sequencing.Applications, Bio.Phylo.Applications, or
Bio.Emboss.Applications (etc?).

Peter


From p.j.a.cock at googlemail.com  Mon Apr 16 11:20:59 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 16:20:59 +0100
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
	<CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
Message-ID: <CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>

On Sat, Apr 7, 2012 at 7:42 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich <eric.talevich at gmail.com>wrote:
>
>> Hi all,
>>
>> I'm considering some enhancements to the Phylo.draw function to make it
>> more customizable for power users. Since the function is based on
>> matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the
>> user; however, I'm not fully versed in what pyplot is capable of.
>>
>> Relevant feature request in Redmine:
>> https://redmine.open-bio.org/issues/3336
>>
>> Ideas:
>
> [...]
>
> Just committed this feature:
> https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d

Hi Eric,

That seems to have caused a test failure on one of our buildslaves:

======================================================================
ERROR: Run the tree layout algorithm, but don't display it.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py",
line 51, in test_draw
    Phylo.draw(dollo, do_show=False)
  File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py",
line 366, in draw
    fig = plt.figure()
  File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py",
line 270, in figure
    **kwargs)
  File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py",
line 120, in new_figure_manager
    backend_wx._create_wx_app()
  File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py",
line 1377, in _create_wx_app
    wxapp = wx.PySimpleApp()
  File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
line 8078, in __init__
    wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt)
  File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
line 7946, in __init__
    raise SystemExit(msg)
SystemExit: Unable to access the X Display, is $DISPLAY set properly?

----------------------------------------------------------------------

http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio
http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio

Interestingly the same machine is passing the tests under other Python versions.
That would seem to rule out the $DISPLAY environment variable being the cause.
My hunch would be this is something about the Python 2.6 install, perhaps it
is missing some library (wxPython maybe).

Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7
have the same version of matplotlib installed, but only one is failing the test:

$ python2.5
Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named matplotlib

$ python2.6
Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
>>> matplotlib.__version__
'1.0.0'

$ python2.7
Python 2.7 (r27:82500, Jul 13 2010, 14:02:41)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
>>> matplotlib.__version__
'1.0.0'


Peter

From tiagoantao at gmail.com  Mon Apr 16 11:31:50 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 16:31:50 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
	<CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>
Message-ID: <CAA9RGEPBVfr=fvG4REOk0ZNrWPvS14ycJ-1mTkrYBMGNfZUbPw@mail.gmail.com>

2012/4/16 Peter Cock <p.j.a.cock at googlemail.com>:
> For simple Unix style applications controlled via the command line,
> use the Bio.Application framework as in Bio.Align.Applications or
> Bio.Sequencing.Applications, Bio.Phylo.Applications, or
> Bio.Emboss.Applications (etc?).

I wonder if people never had the need to abstract the computing
infrastructure? The current code does local (blocking) execution, but
we see environments with BAS or grids where other models are used. I
am not suggesting any specific solution, but the current approach
seems to me not very scalable. No?


-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin

From p.j.a.cock at googlemail.com  Mon Apr 16 12:08:20 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 17:08:20 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEPBVfr=fvG4REOk0ZNrWPvS14ycJ-1mTkrYBMGNfZUbPw@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
	<CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>
	<CAA9RGEPBVfr=fvG4REOk0ZNrWPvS14ycJ-1mTkrYBMGNfZUbPw@mail.gmail.com>
Message-ID: <CAKVJ-_4u+V9+HXoUo0fGTvnS8pi3QQ6wHbsF8KiHPSjOioL90g@mail.gmail.com>

2012/4/16 Tiago Ant?o <tiagoantao at gmail.com>:
> 2012/4/16 Peter Cock <p.j.a.cock at googlemail.com>:
>> For simple Unix style applications controlled via the command line,
>> use the Bio.Application framework as in Bio.Align.Applications or
>> Bio.Sequencing.Applications, Bio.Phylo.Applications, or
>> Bio.Emboss.Applications (etc?).
>
> I wonder if people never had the need to abstract the computing
> infrastructure? The current code does local (blocking) execution, but
> we see environments with BAS or grids where other models are used. I
> am not suggesting any specific solution, but the current approach
> seems to me not very scalable. No?

I use the current framework with an SGE cluster, str(cline_object)
gives the command line string to submit as the jobs.

It would be nice to have some documented examples using
this in combination with multiprocessing or something... but
I find most of the tools I call are already multi-threaded.

Peter


From andrew.sczesnak at med.nyu.edu  Mon Apr 16 12:48:41 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 16 Apr 2012 12:48:41 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
Message-ID: <4F8C4D69.4040009@med.nyu.edu>

Hi Eric,

I was playing with Bio.Cluster recently and noticed that trees generated 
by that module are not compatible with Bio.Phylo. I think it would be 
useful if output from Cluster could be manipulated with Phylo.

At first glance, it doesn't seem like it would be that tricky to add a 
method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, 
and I would be happy to work on this. Before making an attempt, I wanted 
to get your feedback on whether you think this would be useful and if 
you had anything similar in the works already.


Best,
Andrew

From eric.talevich at gmail.com  Mon Apr 16 18:15:14 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 16 Apr 2012 18:15:14 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <4F8C4D69.4040009@med.nyu.edu>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
Message-ID: <CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>

On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Hi Eric,
>
> I was playing with Bio.Cluster recently and noticed that trees generated by
> that module are not compatible with Bio.Phylo. I think it would be useful if
> output from Cluster could be manipulated with Phylo.
>
> At first glance, it doesn't seem like it would be that tricky to add a
> method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and
> I would be happy to work on this. Before making an attempt, I wanted to get
> your feedback on whether you think this would be useful and if you had
> anything similar in the works already.
>
>
> Best,
> Andrew

Hi Andrew,

Interesting idea. It would be simple enough to add a "from_cluster"
function or class method to either Phylo/BaseTree.py or
Phylo/_utils.py. But as every scientist knows, just because we can
doesn't necessarily mean we should. Do you have a specific use case in
mind?

If the main idea is to use Bio.Cluster to generate trees based on a
measure of sequence distance, we could probably do more to support
that. This code might also be worth posting on wiki "Phylo cookbook"
page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes
on it while we consider merging it into the main distribution.

-Eric

From andrew.sczesnak at med.nyu.edu  Mon Apr 16 18:47:25 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 16 Apr 2012 18:47:25 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
	<CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
Message-ID: <4F8CA17D.4080907@med.nyu.edu>

Eric,

I can describe two use cases from my own experience. First, the MAF 
parser I've been working on can pull the multiple alignment of some gene 
between a bunch of genomes. Thinking of recipes for the cookbook, I 
thought it would be neat to walk the user through constructing a 
distance matrix by hand (though you're right--more could be done to 
support this), clustering with Bio.Cluster and visualizing the result 
with Bio.Phylo. I like this example because it integrates several 
different parts of BioPython along with a lesson about inferring 
distances between sequences.

Second, for another project, I've been generating distance matrices 
based on the shared gene content of bacterial genomes and the 
presence-or-absence of orthologous groups in each. Presently, I ferry 
the matrices to a clustering program and then visualize the resulting 
trees in yet another tool. Looking into ways of streamlining this 
brought me back to Bio.Cluster, Bio.Phylo and the incompatibility of 
their tree objects.

I wonder, what would be the most elegant way of bridging the gap?


Best,
Andrew

On 04/16/2012 06:15 PM, Eric Talevich wrote:
> On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak
> <andrew.sczesnak at med.nyu.edu>  wrote:
>> Hi Eric,
>>
>> I was playing with Bio.Cluster recently and noticed that trees generated by
>> that module are not compatible with Bio.Phylo. I think it would be useful if
>> output from Cluster could be manipulated with Phylo.
>>
>> At first glance, it doesn't seem like it would be that tricky to add a
>> method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and
>> I would be happy to work on this. Before making an attempt, I wanted to get
>> your feedback on whether you think this would be useful and if you had
>> anything similar in the works already.
>>
>>
>> Best,
>> Andrew
>
> Hi Andrew,
>
> Interesting idea. It would be simple enough to add a "from_cluster"
> function or class method to either Phylo/BaseTree.py or
> Phylo/_utils.py. But as every scientist knows, just because we can
> doesn't necessarily mean we should. Do you have a specific use case in
> mind?
>
> If the main idea is to use Bio.Cluster to generate trees based on a
> measure of sequence distance, we could probably do more to support
> that. This code might also be worth posting on wiki "Phylo cookbook"
> page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes
> on it while we consider merging it into the main distribution.
>
> -Eric

From eric.talevich at gmail.com  Tue Apr 17 00:17:26 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 17 Apr 2012 00:17:26 -0400
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
	<CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
	<CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>
Message-ID: <CAMC681mQpjncc-wBctPPDgqRQEbO5RVJFeZ+=ky_dZinTr939g@mail.gmail.com>

On Mon, Apr 16, 2012 at 11:20 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Eric,
>
> That seems to have caused a test failure on one of our buildslaves:
>
> ======================================================================
> ERROR: Run the tree layout algorithm, but don't display it.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> ?File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py",
> line 51, in test_draw
> ? ?Phylo.draw(dollo, do_show=False)
> ?File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py",
> line 366, in draw
> ? ?fig = plt.figure()
> ?File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py",
> line 270, in figure
> ? ?**kwargs)
> ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py",
> line 120, in new_figure_manager
> ? ?backend_wx._create_wx_app()
> ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py",
> line 1377, in _create_wx_app
> ? ?wxapp = wx.PySimpleApp()
> ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
> line 8078, in __init__
> ? ?wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt)
> ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
> line 7946, in __init__
> ? ?raise SystemExit(msg)
> SystemExit: Unable to access the X Display, is $DISPLAY set properly?
>
> ----------------------------------------------------------------------
>
> http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio
> http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio
>
> Interestingly the same machine is passing the tests under other Python versions.
> That would seem to rule out the $DISPLAY environment variable being the cause.
> My hunch would be this is something about the Python 2.6 install, perhaps it
> is missing some library (wxPython maybe).
>
> Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7
> have the same version of matplotlib installed, but only one is failing the test:
>
> $ python2.5
> Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import matplotlib
> Traceback (most recent call last):
> ?File "<stdin>", line 1, in <module>
> ImportError: No module named matplotlib
>
> $ python2.6
> Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import matplotlib
>>>> matplotlib.__version__
> '1.0.0'
>
> $ python2.7
> Python 2.7 (r27:82500, Jul 13 2010, 14:02:41)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import matplotlib
>>>> matplotlib.__version__
> '1.0.0'
>
>
> Peter


Actually, it was this commit which added new unit tests:
https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8

On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not
sure how to debug this, exactly. Do you know a way to prevent
matplotlib from attempting to launch the Wx app, beyond turn off
interactive mode as the test already does?

One idea is to specify a matplotlib backend other than wx. For
example, using this import approach in test_Phylo_depend.py might do
the trick:

try:
    import matplotlib
except ImportError:
    raise MissingExternalDependencyError(
            "Install matplotlib if you want to use Bio.Phylo._utils.")
else:
    # Don't use the Wx backend for matplotlib, b/c that depends on Wx being
    # properly set up on the build machine. Instead, use the simpler postscript
    # backend -- we're not going to display or save the plot anyway, so it
    # doesn't matter much, as long as it's not Wx. I guess.
    matplotlib.use("ps")
    from matplotlib import pyplot


Would you be able to test this on the errant buildbot machine without
having to commit this to the trunk?


Thanks,
Eric


From p.j.a.cock at googlemail.com  Tue Apr 17 05:31:05 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 17 Apr 2012 10:31:05 +0100
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAMC681mQpjncc-wBctPPDgqRQEbO5RVJFeZ+=ky_dZinTr939g@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
	<CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
	<CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>
	<CAMC681mQpjncc-wBctPPDgqRQEbO5RVJFeZ+=ky_dZinTr939g@mail.gmail.com>
Message-ID: <CAKVJ-_74uXOPHy=heMi7=PMc_TrJt7UFiwHsTORpHK_TO86-TQ@mail.gmail.com>

On Tue, Apr 17, 2012 at 5:17 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Actually, it was this commit which added new unit tests:
> https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8
>

OK - thanks for checking.

> On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not
> sure how to debug this, exactly. Do you know a way to prevent
> matplotlib from attempting to launch the Wx app, beyond turn off
> interactive mode as the test already does?

Not sure.

> One idea is to specify a matplotlib backend other than wx. For
> example, using this import approach in test_Phylo_depend.py might do
> the trick:
>
> try:
> ? ?import matplotlib
> except ImportError:
> ? ?raise MissingExternalDependencyError(
> ? ? ? ? ? ?"Install matplotlib if you want to use Bio.Phylo._utils.")
> else:
> ? ?# Don't use the Wx backend for matplotlib, b/c that depends on Wx being
> ? ?# properly set up on the build machine. Instead, use the simpler postscript
> ? ?# backend -- we're not going to display or save the plot anyway, so it
> ? ?# doesn't matter much, as long as it's not Wx. I guess.
> ? ?matplotlib.use("ps")
> ? ?from matplotlib import pyplot
>
>
> Would you be able to test this on the errant buildbot machine without
> having to commit this to the trunk?

Yes, that works (this buildbot is one of 'my' servers so I can run this
directly). Please check that fix in.

Thanks,

Peter


From p.j.a.cock at googlemail.com  Tue Apr 17 11:23:22 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 17 Apr 2012 16:23:22 +0100
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
Message-ID: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>

On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Here are some things that I think are strong
> candidates for 1.60 (not an exclusive list!)
>
> ...
>
> BGZF support: Low level module like Python's gzip,
> support in SeqIO for indexing BGZF compressed files,
> ...

I've just rebased my bgzf branch, which I think is ready to apply to the
trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
https://github.com/peterjc/biopython/tree/bgzf2

Would anyone like to review this please? There are unittests and
plenty of docstrings - but so far nothing in the Tutorial though.

I wrote a blog post late last year explaining what this allows, and
this branch includes the changes to Bio.SeqIO to index BGZF
compressed sequence files this discussed:
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

The probable next step after this is combining it with Andrew Sczesnak's
work on indexing MAF files (they can get pretty big) as explored by 'I.J.'
(who as far as I know hasn't signed up to the biopython-dev list, BCC'd).

Also it would be interesting to explore doing the (de)compression of
blocks on worker threads to take advantage of multiple cores.

Another idea would be too switch from a plain dictionary to an
ordered dictionary for holding cached decompressed blocks,
giving a way to drop the oldest block (although not perhaps as
clever as dropping the lest recently used block, the overhead is
lower). That would require including our own OrderedDict class
on the older Python platforms.

Peter

[*] PyPy testing is complicated by running out of file handles,
an existing issue not something directly from this work. Part
of this is down to different GC under PyPy.

From eric.talevich at gmail.com  Tue Apr 17 11:25:35 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 17 Apr 2012 11:25:35 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <4F8CA17D.4080907@med.nyu.edu>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
	<CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
	<4F8CA17D.4080907@med.nyu.edu>
Message-ID: <CAMC681m9Vd3J-CVfJE8aMq0Pmd66OXoThdJD1LaFguWagNTqqQ@mail.gmail.com>

Andrew,

It would be useful to have a quick and portable function for
distance-based tree estimation in Bio.Phylo, since otherwise it's
necessary to use one of the wrappers for external programs in
Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does
the hierarchical clustering algorithm in Bio.Cluster correspond to any
common tree-estimation algorithm, e.g. UPGMA? If so, then it would
make a lot of sense to provide the glue for using it that way. If you
have done some work in this direction, I would be happy to see it.

-Eric


On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Eric,
>
> I can describe two use cases from my own experience. First, the MAF parser
> I've been working on can pull the multiple alignment of some gene between a
> bunch of genomes. Thinking of recipes for the cookbook, I thought it would
> be neat to walk the user through constructing a distance matrix by hand
> (though you're right--more could be done to support this), clustering with
> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example
> because it integrates several different parts of BioPython along with a
> lesson about inferring distances between sequences.
>
> Second, for another project, I've been generating distance matrices based on
> the shared gene content of bacterial genomes and the presence-or-absence of
> orthologous groups in each. Presently, I ferry the matrices to a clustering
> program and then visualize the resulting trees in yet another tool. Looking
> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and
> the incompatibility of their tree objects.
>
> I wonder, what would be the most elegant way of bridging the gap?
>
>
> Best,
> Andrew
>

From bioinformed at gmail.com  Tue Apr 17 12:11:37 2012
From: bioinformed at gmail.com (Kevin Jacobs <jacobs@bioinformed.com>)
Date: Tue, 17 Apr 2012 12:11:37 -0400
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
Message-ID: <CAD=vDiqJLx=D_t0RVt1nPCTwxjgwpTXsvQprmd_hX5ffrR7PZQ@mail.gmail.com>

On Tue, Apr 17, 2012 at 11:23 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > Here are some things that I think are strong
> > candidates for 1.60 (not an exclusive list!)
> >
> > ...
> >
> > BGZF support: Low level module like Python's gzip,
> > support in SeqIO for indexing BGZF compressed files,
> > ...
>
> I've just rebased my bgzf branch, which I think is ready to apply to the
> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
> https://github.com/peterjc/biopython/tree/bgzf2
>
> Would anyone like to review this please? There are unittests and
> plenty of docstrings - but so far nothing in the Tutorial though.
>
>
Hi Peter,

I've implemented code to create BAM/tabix style index files and perform
lookups, so it has been high on my list to test and validate your BGZF code
(rather having to write my own).  I'm notoriously short on time, but this
is in the critical path for several projects and I'm going to work on it
over the next week or so.

-Kevin

From redmine at redmine.open-bio.org  Tue Apr 17 21:29:29 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 18 Apr 2012 01:29:29 +0000
Subject: [Biopython-dev] [Biopython - Bug #3333] PhyloXML writer fails to
	include is_aligned attribute with mol_seq elements
References: <redmine.issue-3333.20120326222124@redmine.open-bio.org>
Message-ID: <redmine.journal-14810.20120418012929@redmine.open-bio.org>


Issue #3333 has been updated by Eric Talevich.


The answer is: I'm an idiot. The mol_seq attribute was first defined as a complex attribute in the writer (via _handle_complex), but then further down redefined as a simple attribute.

Fix:
https://github.com/biopython/biopython/commit/a93c9892268274c4969131a1d401bb8ee235524a
----------------------------------------
Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements
https://redmine.open-bio.org/issues/3333

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


First reported here:
http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html

Steps to reproduce:

1. Load a tree, convert to PhyloXML
<pre>
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
</pre>

2. Add a sequence
<pre>
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
</pre>

3. Verify that the sequence information has been set -- mol_seq has is_aligned set
<pre>
print tree
</pre>

<pre>
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
</pre>

4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
<pre>
print tree.format('phyloxml')
</pre>

<pre>
...
<phy:clade>
  <phy:name>c</phy:name>
  <phy:branch_length>1.0</phy:branch_length>
  <phy:sequence type="dna">
    <phy:mol_seq>AAA</phy:mol_seq>
  </phy:sequence>
</phy:clade>
...
</pre>


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue Apr 17 21:52:03 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 18 Apr 2012 01:52:03 +0000
Subject: [Biopython-dev] [Biopython - Bug #3333] (Closed) PhyloXML writer
	fails to include is_aligned attribute with mol_seq elements
References: <redmine.issue-3333.20120326222124@redmine.open-bio.org>
Message-ID: <redmine.journal-14811.20120418015203@redmine.open-bio.org>


Issue #3333 has been updated by Eric Talevich.

Status changed from New to Closed
% Done changed from 0 to 100


----------------------------------------
Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements
https://redmine.open-bio.org/issues/3333

Author: Eric Talevich
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


First reported here:
http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html

Steps to reproduce:

1. Load a tree, convert to PhyloXML
<pre>
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
</pre>

2. Add a sequence
<pre>
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
</pre>

3. Verify that the sequence information has been set -- mol_seq has is_aligned set
<pre>
print tree
</pre>

<pre>
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
</pre>

4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
<pre>
print tree.format('phyloxml')
</pre>

<pre>
...
<phy:clade>
  <phy:name>c</phy:name>
  <phy:branch_length>1.0</phy:branch_length>
  <phy:sequence type="dna">
    <phy:mol_seq>AAA</phy:mol_seq>
  </phy:sequence>
</phy:clade>
...
</pre>


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Thu Apr 19 00:27:49 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 19 Apr 2012 04:27:49 +0000
Subject: [Biopython-dev] [Biopython - Feature #3342] (New)
	Phylo.root_with_outgroup: set the length of the outgroup branch
Message-ID: <redmine.issue-3342.20120419042749@redmine.open-bio.org>


Issue #3342 has been reported by Eric Talevich.

----------------------------------------
Feature #3342: Phylo.root_with_outgroup: set the length of the outgroup branch
https://redmine.open-bio.org/issues/3342

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Add an option to the root_with_outgroup method to specify the length of the branch leading from the new root to the outgroup. This should not change the total tree length, i.e. this length is subtracted from the branch on the other side of the root.

This option makes it possible to root the tree in other ways that split the outgroup branch, leaving a bifurcating rather than trifurcating root.

I've attached a patch that implements this feature, plus unit tests for it.

HOWEVER:

A sane API for this method would look like:

>>> tree.root_with_outgroup("apple", "orange", outgroup_branch_length=0.4)

The original function definition included *args for specifying the outgroup taxa in one shot (instead of requiring a separate call to common_ancestor). But while Python 3 permits keyword-only arguments (a defined keyword argument after *args or just *), Python 2 does not. So I made the function calling style shown above work in a very weird way: the function definition has **kwargs instead of outgroup_branch_length=None, and the necessary keyword argument is pulled out of kwargs inside the body of the function. The name of this argument is given in the docstring, so it's still partly discoverable.

Are we cool with this? Or, can anyone think of a better way to handle this?


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Fri Apr 20 04:39:02 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Apr 2012 09:39:02 +0100
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser
	(#33)
In-Reply-To: <biopython/biopython/pull/33@github.com>
References: <biopython/biopython/pull/33@github.com>
Message-ID: <CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>

I've had a quick look on GitHub and it isn't obvious to me how to get
pull request emails CC'd to our dev mailing list... but anyway, Lenna
has been busy:

Peter

---------- Forwarded message ----------
From: Lenna Peterson
<reply+i-4201999-d8628b2a34f52e923e8471a792110c2edfbe13a8-63959 at reply.github.com>
Date: Thu, Apr 19, 2012 at 11:35 PM
Subject: [biopython] Feature: Python implementation of MMCIF parser (#33)
To: Peter Cock <p.j.a.cock at googlemail.com>


I've written a PLY (Python lex-yacc) module that is superimposable
with the C MMCIF module.

I've also partially rewritten the C MMCIF module to be object-oriented.

### Changed files ###

* MMCIFlexmodule.c: Now object-oriented (open file in constructor,
close file in destructor, etc). Docstrings! Added file IO exception.
* MMCIF2Dict.py: Minor changes for new object oriented API
* MMCIFParser: Changed all uses of map() to list comprehensions (more
compatible with 3)

### New files ###

* MMCIFlex.py: PLY-based module for tokenizing input.

### What it needs ###
Addition of PLY dependency to setup.py.
I'm not quite sure how to handle this, as PLY wouldn't be necessary on
a platform with C Python. Thoughts? Which non-CPython implementations
are worth testing?


New C module tested on Python 2.6 on Mac OS X and Debian. I hope it
still works on Windows.
On my machine, the C module processes a 30,000 line test file in 10-15
ms; the Python module takes ~150 ms.

You can merge this Pull Request by running:

?git pull https://github.com/lennax/biopython MMCIF2

Or you can view, comment on it, or merge it online at:

?https://github.com/biopython/biopython/pull/33

-- Commit Summary --

* Ply test in progress.
* Quoted values with spaces are being broken.
* Removed hard inclusion of ply.
* Fixed quoted strings with spaces.
* Changed Parser call to 2Dict. Semicolons break.
* Changed Parser call to 2Dict. Semicolons break.
* Lexes full file w/o error, FIXME loops
* Tweak: comment handling
* Changed token "NAME" to "TAG"
* Using IUCr grammar. FIXME quote/semi
* Fixed quoted strings.
* Semicolon text field fixed, FIXME included \n
* Fixed semi newlines.
* non-eol temp fix, doesn't match single chars
* Lexes full CIF file with no noticed errors.
* Added timing.
* Added states to lexer.
* Lex loops into [header, [items], ...]; \d hacks.
* Enforced semicolon rule.
* Yacc works.
* Re-added values to lexer state 'loop'
* FIXME syntax error/hangs on full file.
* Lexer gathers values, added parse precedence.
* Minor lex cleanup.
* Testing exclusionary lex redo.
* Streamlined rules, no loop yet.
* Still won't yacc 30k line file.
* Merge branch 'master' of git://github.com/biopython/biopython into ply2
* Added __name__ __main__ check.
* Parser redo, still doesn't parse 30k line file.
* Added comments to tokenizer.
* Fixed lex module's callability from yacc.
* Fixed DATA token failure.
* Multiple improvements, still no 30k.
* Moved lexer arguments to constructor.
* Moved data input to constructor, added docs
* Validated to pep8.
* Merge branch 'master' of git://github.com/biopython/biopython into ply2
* Add MMCIF2Dict from ply branch.
* Remove flex header dependency of CIF parser.
* Update MMCIFParser call of MMCIF2Dict.
* PLY lexer works with MMCIF2Dict.
* Cleanup.
* Cleaned up import.
* Updated docstring.
* Subclassed dict.
* Restored MMCIFParser call to MMCIF2Dict.
* Removed main() from lex input.
* Restored newline.
* Fix C prototype warnings.
* Modifying python lexer to be substitutable w/ C.
* Make header for generated C.
* Import C lexer or Python lexer.
* Improvements and documentation.
* Uncomment GLOBAL token definition.
* PLY lexer and C lexer should be interchangeable.
* Improve error reporting of import.
* Turn on ply lex optimize.
* Call instance of Python lexer.
* Working on implementing class in C module.
* Start unit test for MMCIF.
* Minimal unit test for MMCIFParser.
* Revert to old generated C; manually added noyywrap
* Manually added function prototypes to generated C.
* Merge branch 'ply2' into dev
* Merge branch 'ply' into dev
* Merge branch 'c-dev' into dev
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Cleaning up old files.
* More cleanup.
* Merging Parser from MMCIFlex branch.
* Parser and unit test for PyCIFRW
* Python and C lexer APIs are now identical.
* Add copyright and license notices.
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Trying GnuWin32 flex-generated C.
* Win flex generated with new mmcif.lex
* GnuWin32 flex generated C, used dos2unix for CRLF
* Added correct author to flex C module.
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Change map() to list comprehensions for 3 compat.
* Renamed python lexer to match C module.
* Added file IO exception to C module.
* Tweak lexer module import.
* Prep Python CIF lexer for pull request.
* Whitespace tweaks.

-- File Changes --

M Bio/PDB/MMCIF2Dict.py (20)
M Bio/PDB/MMCIFParser.py (8)
A Bio/PDB/mmCIF/MMCIFlex.py (253)
M Bio/PDB/mmCIF/MMCIFlexmodule.c (122)

-- Patch Links --

?https://github.com/biopython/biopython/pull/33.patch
?https://github.com/biopython/biopython/pull/33.diff

---
Reply to this email directly or view it on GitHub:
https://github.com/biopython/biopython/pull/33


From andrew.sczesnak at med.nyu.edu  Fri Apr 20 18:28:43 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Fri, 20 Apr 2012 18:28:43 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <CAMC681m9Vd3J-CVfJE8aMq0Pmd66OXoThdJD1LaFguWagNTqqQ@mail.gmail.com>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
	<CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
	<4F8CA17D.4080907@med.nyu.edu>
	<CAMC681m9Vd3J-CVfJE8aMq0Pmd66OXoThdJD1LaFguWagNTqqQ@mail.gmail.com>
Message-ID: <4F91E31B.9030101@med.nyu.edu>

Eric,

If my understanding is correct, UPGMA is slang for agglomerative 
average-linkage hierarchical clustering which is implemented along with 
single- and complete-linkage in the module. There's no equivalent of 
neighbor-joining or maximum-likelihood and Bio.Cluster probably isn't 
that fast with large numbers of nodes so wrappers are still useful. We 
could probably add an NJ implementation for small matrices pretty easily 
if you think it's worthwhile.

Either way, the glue could be useful for visualizing relationships 
between genes/samples in microarrays (what I gather Bio.Cluster is 
intended for).


Andrew

On 04/17/2012 11:25 AM, Eric Talevich wrote:
> Andrew,
>
> It would be useful to have a quick and portable function for
> distance-based tree estimation in Bio.Phylo, since otherwise it's
> necessary to use one of the wrappers for external programs in
> Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does
> the hierarchical clustering algorithm in Bio.Cluster correspond to any
> common tree-estimation algorithm, e.g. UPGMA? If so, then it would
> make a lot of sense to provide the glue for using it that way. If you
> have done some work in this direction, I would be happy to see it.
>
> -Eric
>
>
> On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak
> <andrew.sczesnak at med.nyu.edu>  wrote:
>> Eric,
>>
>> I can describe two use cases from my own experience. First, the MAF parser
>> I've been working on can pull the multiple alignment of some gene between a
>> bunch of genomes. Thinking of recipes for the cookbook, I thought it would
>> be neat to walk the user through constructing a distance matrix by hand
>> (though you're right--more could be done to support this), clustering with
>> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example
>> because it integrates several different parts of BioPython along with a
>> lesson about inferring distances between sequences.
>>
>> Second, for another project, I've been generating distance matrices based on
>> the shared gene content of bacterial genomes and the presence-or-absence of
>> orthologous groups in each. Presently, I ferry the matrices to a clustering
>> program and then visualize the resulting trees in yet another tool. Looking
>> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and
>> the incompatibility of their tree objects.
>>
>> I wonder, what would be the most elegant way of bridging the gap?
>>
>>
>> Best,
>> Andrew
>>

From andrew.sczesnak at med.nyu.edu  Fri Apr 20 18:35:59 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Fri, 20 Apr 2012 18:35:59 -0400
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
Message-ID: <4F91E4CF.8040602@med.nyu.edu>

Peter,

My colleague was writing some code using MafIndex and commented how long 
it took her to download, decompress and index the human multiz 
alignments from UCSC. It seems like it'd be great to keep the files 
compressed... perhaps if the code works well enough we can convince UCSC 
to host bgzip'd copies (or maybe them available on one of our 
institutions servers).

Is I.J. interested in joining the community? I'd like to look into 
adding BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If 
not, could you put me in touch?


Andrew

On 04/17/2012 11:23 AM, Peter Cock wrote:
> On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock<p.j.a.cock at googlemail.com>  wrote:
>>
>> Here are some things that I think are strong
>> candidates for 1.60 (not an exclusive list!)
>>
>> ...
>>
>> BGZF support: Low level module like Python's gzip,
>> support in SeqIO for indexing BGZF compressed files,
>> ...
>
> I've just rebased my bgzf branch, which I think is ready to apply to the
> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
> https://github.com/peterjc/biopython/tree/bgzf2
>
> Would anyone like to review this please? There are unittests and
> plenty of docstrings - but so far nothing in the Tutorial though.
>
> I wrote a blog post late last year explaining what this allows, and
> this branch includes the changes to Bio.SeqIO to index BGZF
> compressed sequence files this discussed:
> http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
>
> The probable next step after this is combining it with Andrew Sczesnak's
> work on indexing MAF files (they can get pretty big) as explored by 'I.J.'
> (who as far as I know hasn't signed up to the biopython-dev list, BCC'd).
>
> Also it would be interesting to explore doing the (de)compression of
> blocks on worker threads to take advantage of multiple cores.
>
> Another idea would be too switch from a plain dictionary to an
> ordered dictionary for holding cached decompressed blocks,
> giving a way to drop the oldest block (although not perhaps as
> clever as dropping the lest recently used block, the overhead is
> lower). That would require including our own OrderedDict class
> on the older Python platforms.
>
> Peter
>
> [*] PyPy testing is complicated by running out of file handles,
> an existing issue not something directly from this work. Part
> of this is down to different GC under PyPy.

From arklenna at gmail.com  Fri Apr 20 20:57:21 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Fri, 20 Apr 2012 20:57:21 -0400
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF
 parser (#33)
In-Reply-To: <CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
Message-ID: <CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>

On Fri, Apr 20, 2012 at 4:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> I've had a quick look on GitHub and it isn't obvious to me how to get
> pull request emails CC'd to our dev mailing list... but anyway, Lenna
> has been busy:
>
> Peter
>
> ---------- Forwarded message ----------
> From: Lenna Peterson
> <reply+i-4201999-d8628b2a34f52e923e8471a792110c2edfbe13a8-63959 at reply.github.com>
> Date: Thu, Apr 19, 2012 at 11:35 PM
> Subject: [biopython] Feature: Python implementation of MMCIF parser (#33)
> To: Peter Cock <p.j.a.cock at googlemail.com>
>
>
> I've written a PLY (Python lex-yacc) module that is superimposable
> with the C MMCIF module.
>
> I've also partially rewritten the C MMCIF module to be object-oriented.
>
> ### Changed files ###
>
> * MMCIFlexmodule.c: Now object-oriented (open file in constructor,
> close file in destructor, etc). Docstrings! Added file IO exception.
> * MMCIF2Dict.py: Minor changes for new object oriented API
> * MMCIFParser: Changed all uses of map() to list comprehensions (more
> compatible with 3)
>
> ### New files ###
>
> * MMCIFlex.py: PLY-based module for tokenizing input.
>
> ### What it needs ###
> Addition of PLY dependency to setup.py.
> I'm not quite sure how to handle this, as PLY wouldn't be necessary on
> a platform with C Python. Thoughts? Which non-CPython implementations
> are worth testing?
>
>
> New C module tested on Python 2.6 on Mac OS X and Debian. I hope it
> still works on Windows.
> On my machine, the C module processes a 30,000 line test file in 10-15
> ms; the Python module takes ~150 ms.


I've started testing the PLY lexer on PyPy. NumPyPy now implements
more functions needed by PDB; the only things I found to be missing
are random and linalg. This eliminates Superimposer, FragmentMapper,
and Vector.

I played around with trying to spoof "import numpy" to automatically
import numpypy (code here: https://gist.github.com/2432815) but I
don't think that's wise yet.

My last commit to this branch was a few changes to allow the MMCIF
parser to work on NumPy. PyPy won't run `setup.py test` due to global
numpy failure, but if I install this branch and `pypy test_MMCIF.py`,
it passes.

Anybody with more PyPy and/or package structuring experience have thoughts?

Lenna

From p.j.a.cock at googlemail.com  Sat Apr 21 06:32:33 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 21 Apr 2012 11:32:33 +0100
Subject: [Biopython-dev] [biopython] Feature: Python implementation of
	MMCIF parser (#33)
In-Reply-To: <CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
	<CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
Message-ID: <CAKVJ-_48hRJ-8+h4=Mo8e-7Sp2TP+XBFqnywBFmJV-V+cPjryA@mail.gmail.com>

On Saturday, April 21, 2012, Lenna Peterson wrote:

>
> > ### What it needs ###
> > Addition of PLY dependency to setup.py.
> > I'm not quite sure how to handle this, as PLY wouldn't be necessary on
> > a platform with C Python. Thoughts? Which non-CPython implementations
> > are worth testing?


Basically Jython (which we've tried to support for a while) and PyPy
(which I would like to officially support in future). Although a pure
python setup can be useful in other settings, e.g. Windows
development without the compilers otherwise needed.

However, neither of those have NumPy (yet), which we need for
the PDB module that would use the MMCIF parser.

>
> > New C module tested on Python 2.6 on Mac OS X and Debian. I hope it
> > still works on Windows.
> > On my machine, the C module processes a 30,000 line test file in 10-15
> > ms; the Python module takes ~150 ms.


That's a factor of ten slower, but still sounds fast enough perhaps
that we don't really need the C code for usability.

>
> I've started testing the PLY lexer on PyPy. NumPyPy now implements
> more functions needed by PDB; the only things I found to be missing
> are random and linalg. This eliminates Superimposer, FragmentMapper,
> and Vector.
>
> I played around with trying to spoof "import numpy" to automatically
> import numpypy (code here: https://gist.github.com/2432815) but I
> don't think that's wise yet.
>
> My last commit to this branch was a few changes to allow the MMCIF
> parser to work on NumPy. PyPy won't run `setup.py test` due to global
> numpy failure, but if I install this branch and `pypy test_MMCIF.py`,
> it passes.
>
> Anybody with more PyPy and/or package structuring experience

have thoughts?


I filed a few bugs on missing code in PyPy's NumPy re-implementation
(now called numpypy), good to hear they are getting closer to being
enough for us to run Bio.PDB on it. Thank you for exploring this.

Right now with in you shoes for MMCIF parsing I would focus on
the parser failures with certain input files - there is an open bug
on RedMine https://redmine.open-bio.org/issues/2626 and the
Issue of multiple models (Eric can probably advise here),
https://redmine.open-bio.org/issues/2943

And I must close this bug now your earlier work has been
checked in - https://redmine.open-bio.org/issues/2619

Thanks!

Peter

>

From redmine at redmine.open-bio.org  Sat Apr 21 06:39:15 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 21 Apr 2012 10:39:15 +0000
Subject: [Biopython-dev] [Biopython - Bug #2619] (Closed)
	Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py
References: <redmine.issue-2619.20081018153139@redmine.open-bio.org>
Message-ID: <redmine.journal-14814.20120421103915@redmine.open-bio.org>


Issue #2619 has been updated by Peter Cock.

Status changed from New to Closed
% Done changed from 0 to 100

Fixed with Lenna's work - see this commit and its parents:
https://github.com/biopython/biopython/commit/e5ebb85d0614a34e59e7c2118a366512dc4d1320
----------------------------------------
Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py
https://redmine.open-bio.org/issues/2619

Author: Chris Oldfield
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.48
URL: 


MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py.  According to  

http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html

this is because it doesn't compile on Windows.  Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me.

The fix on linux is to uncomment setup.py lines 486 on.  A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance.

Source install of version 1.48, gentoo linux 2008, x86_64.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Apr 21 14:05:01 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 21 Apr 2012 18:05:01 +0000
Subject: [Biopython-dev] [Biopython - Bug #2626] Bio.PDB mmCIFParser parse
	exceptions
References: <redmine.issue-2626.20081023230309@redmine.open-bio.org>
Message-ID: <redmine.journal-14816.20120421180501@redmine.open-bio.org>


Issue #2626 has been updated by Lenna Peterson.

File mmCifParseCheck.py added

I've attempted to rescue this code from overzealous "text formatting".

Attached version appeared to work on one test file; haven't tested the example broken files yet. 
----------------------------------------
Bug #2626: Bio.PDB mmCIFParser parse exceptions
https://redmine.open-bio.org/issues/2626

Author: Chris Oldfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Other
Target version: 1.48
URL: 


I recently ran the mmCIFParser object over all of PDB's mmCIF files and found a large number of files failed to parse correctly (a short script at the end to demonstrate).  Of ~50k mmCIF files, 3891 files failed to parse and another 1980 were missing fields in the mmCIF dictionary.  

A few examples of files that failed to parse: 
http://www.rcsb.org/pdb/files/1alw.cif.gz
http://www.rcsb.org/pdb/files/1det.cif.gz
http://www.rcsb.org/pdb/files/1tmy.cif.gz

A few with missing fields:
http://www.rcsb.org/pdb/files/1mfl.cif.gz
http://www.rcsb.org/pdb/files/1tfj.cif.gz
http://www.rcsb.org/pdb/files/1zn8.cif.gz

The problem seems to be that an error in one mmCIF table, like an extra field, seems to propogate through the rest of the parse.

x86_64 gentoo linux 2008, src BioPython install

__CODE__
import sys
from Bio.PDB import *

if len(sys.argv) != 2:
    print "usage: mmCifParseCheck.py <structFile>"
    sys.exit(0)
structFile = sys.argv[1]

resultString = "";

#parse to structure object
numRes = 0
parser=MMCIFParser()
try:
    structure=parser.get_structure('test',structFile)
    for model in structure:
        for chain in model:
            for residue in chain:
                if(residue.id[0][:2] != "H_"):
                    numRes += 1
except:
    resultString += "parse to structure object failed\n";
else:
    resultString += "parse to structure object succeeded\n";

#parse whole mmCIF file to dict
try:
    mmcif_dict=MMCIF2Dict.MMCIF2Dict(structFile)
except:
    resultString += "parse to dict failed\n";
else:
    resultString += "parse to dict succeeded\n";

#get a required entry
try:
    id = mmcif_dict['_entry.id']
except:
    resultString += "key lookup failed\n";
else:
    resultString += "key lookup succeeded\n";

print resultString
print "number of non-het residues " + str(numRes)


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Apr 21 14:16:07 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 21 Apr 2012 18:16:07 +0000
Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL
	records without model id
References: <redmine.issue-2950.20091117003545@redmine.open-bio.org>
Message-ID: <redmine.journal-14817.20120421181607@redmine.open-bio.org>


Issue #2950 has been updated by Lenna Peterson.


Did this commit close this bug?  https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
----------------------------------------
Bug #2950: Bio.PDBIO.save writes MODEL records without model id
https://redmine.open-bio.org/issues/2950

Author: Barry Finzel
Status: In Progress
Priority: Normal
Assignee: Konstantin Okonechnikov
Category: Main Distribution
Target version: Not Applicable
URL: 


The MODEL record format for PDB files has an integer model identifier
(e.g., "MODEL        1") not currently written to output.
Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Sun Apr 22 02:48:10 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Sun, 22 Apr 2012 02:48:10 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
	(closes 2943)
Message-ID: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>

I've implemented the parser changes (written by Paul Bathen; see bug
report) to allow the MMCIF parser to handle multiple models.

Models are now accessed by a string key of their model number, rather
than an arbitrary index (structure['1'] versus structure[0]).

I updated the MMCIF unit test for the new model access method and
added a test file with multiple models.

I'm not sure if there is documentation to be updated re: accessing the models.

issue: https://redmine.open-bio.org/issues/2943
pull request: https://github.com/biopython/biopython/pull/34

- Lenna

From MatatTHC at gmx.de  Sun Apr 22 06:06:28 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Sun, 22 Apr 2012 12:06:28 +0200
Subject: [Biopython-dev] SeqIO circular
In-Reply-To: <CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>
References: <CALNFT0jq=VTwSDv-4x7ZrHoQRLajCUHY8NGPMw9cDuGnwwNiuw@mail.gmail.com>
	<CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
	<CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>
Message-ID: <CALNFT0hrc+T-0xWesCuK0E5X8=mcDCqXoRRJJ4ms2qAibWXhTg@mail.gmail.com>

Hi,

since this bug seems to be of low priority I decided to try my best to
help a bit and search the web a bit.
It seems that the property is stored in PrimarySeq or Seq  in bioperl.
See for instance:

http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm

Or also:
http://bugzilla.open-bio.org/show_bug.cgi?id=2578

This seems to be realised as boolean variable or function.

Regards,
Matthias

2012/4/4 Matthias Bernt <MatatTHC at gmx.de>:
> Hi,
>
> are there any news on this? May I help somehow? But I have to admit
> that I barely speak perl and have no experience with bioperl. If
> someone tells me where to look I might still try it.
>
> Matthias
>
> 2012/3/29 Peter Cock <p.j.a.cock at googlemail.com>:
>> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
>>> Hi,
>>>
>>> Is it possible to get the property if a genome is circular / linear
>>> from SeqIO applied to genbank files? I could not find it.
>>>
>>> There is also a related bugreport:
>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>>>
>>> I used the old parser before and switched to SeqIO which I really like
>>> for the possibilities to parse different formats... but I really need
>>> the information.
>>
>> Does anyone happen to have a BioPerl + BioSQL setup installed
>> and working? IIRC checking that to make sure however we
>> store the circular was compatible was the only real hurdle.
>>
>> Peter

From redmine at redmine.open-bio.org  Sun Apr 22 14:46:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 22 Apr 2012 18:46:35 +0000
Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL
	records without model id
References: <redmine.issue-2950.20091117003545@redmine.open-bio.org>
Message-ID: <redmine.journal-14818.20120422184635@redmine.open-bio.org>


Issue #2950 has been updated by Eric Talevich.

Assignee deleted (Konstantin Okonechnikov)

Yes it did, thanks. I'll close this bug now.
----------------------------------------
Bug #2950: Bio.PDBIO.save writes MODEL records without model id
https://redmine.open-bio.org/issues/2950

Author: Barry Finzel
Status: In Progress
Priority: Normal
Assignee: 
Category: Main Distribution
Target version: Not Applicable
URL: 


The MODEL record format for PDB files has an integer model identifier
(e.g., "MODEL        1") not currently written to output.
Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Apr 22 14:48:39 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 22 Apr 2012 18:48:39 +0000
Subject: [Biopython-dev] [Biopython - Bug #2951] (Closed) PDBParser assigns
	model 0 to first model no matter what...
References: <redmine.issue-2951.20091117011131@redmine.open-bio.org>
Message-ID: <redmine.journal-14819.20120422184839@redmine.open-bio.org>


Issue #2951 has been updated by Eric Talevich.

Status changed from New to Closed
% Done changed from 0 to 100

Closed with this commit, as pointed out just now by Lenna Peterson:
https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9

----------------------------------------
Bug #2951: PDBParser assigns model 0 to first model no matter what...
https://redmine.open-bio.org/issues/2951

Author: TallPaul empty
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.52
URL: 


I'm not sure if this is a bug or a feature, but PDBParser assigns the first model it sees as model 0 then increments that. This means someone thinking they are studying model X is actually studying X+1, and that assumes that authors always use sequential model numbers without skips. If authors CAN skip model number, ie, MODEL 2, then MODEL 4, then MODEL 5... then in biopython these be models 0,1, and 2 in the structure... yuck. 

If this needs to be maintained for posterity, I would suggest adding another field to capture the TRUE model number if it exists.

See lines 106 and 122 here:
http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#106

Paul


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Apr 22 14:49:43 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 22 Apr 2012 18:49:43 +0000
Subject: [Biopython-dev] [Biopython - Bug #2950] (Closed) Bio.PDBIO.save
	writes MODEL records without model id
References: <redmine.issue-2950.20091117003545@redmine.open-bio.org>
Message-ID: <redmine.journal-14820.20120422184943@redmine.open-bio.org>


Issue #2950 has been updated by Eric Talevich.

Status changed from In Progress to Closed
% Done changed from 20 to 100

Closed the blocker, too. Thanks again to Konstantin.
----------------------------------------
Bug #2950: Bio.PDBIO.save writes MODEL records without model id
https://redmine.open-bio.org/issues/2950

Author: Barry Finzel
Status: Closed
Priority: Normal
Assignee: 
Category: Main Distribution
Target version: Not Applicable
URL: 


The MODEL record format for PDB files has an integer model identifier
(e.g., "MODEL        1") not currently written to output.
Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Mon Apr 23 01:35:23 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Mon, 23 Apr 2012 01:35:23 -0400
Subject: [Biopython-dev] pull request: Bio.SCOP.Raf chem dict updater
Message-ID: <CAK610_4RDX1DFf0YQ4pXW5EfUwaXG8vKUj37kco+58qn9=017w@mail.gmail.com>

I've adapted Hongbo Zhu's code to extract the three to one letter
codes directly from the PDB Chemical Component dictionary.

Existing calls of `from Raf import to_one_letter_code` should work as expected.

pull request: https://github.com/biopython/biopython/pull/35
issue: https://redmine.open-bio.org/issues/3169

Lenna

From redmine at redmine.open-bio.org  Mon Apr 23 13:00:15 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 23 Apr 2012 17:00:15 +0000
Subject: [Biopython-dev] [Biopython - Bug #2943] (Closed) MMCIFParser only
	handling a single model.
References: <redmine.issue-2943.20091103125836@redmine.open-bio.org>
Message-ID: <redmine.journal-14823.20120423170015@redmine.open-bio.org>


Issue #2943 has been updated by Peter Cock.

Status changed from New to Closed
% Done changed from 0 to 100

This should be working on the trunk now ready for Biopython 1.60 - thanks Lenna. See this commit and those preceding it:
https://github.com/biopython/biopython/commit/2ac67cd14682a4bbad9e09654485914f9495138d

If we've missed anything please reopen this bug. Thanks Paul!
----------------------------------------
Bug #2943: MMCIFParser only handling a single model.
https://redmine.open-bio.org/issues/2943

Author: TallPaul empty
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.52
URL: 


MMCIFParser as-written only handles a single model in a protein. Any protein that has multiple modesl with repeating chains and residues will get an exception since the residue ID will already exist. Please make the following changes in MMCIFParser.py:

Change the __doc__ setting:
#Optional __DOC__ change if the new MMCIFlex is not used nor the changes
#to MMCIF2Dict based on the new MMCIFlex.
#Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python
__doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" 

Regardles of the DOC changes:
Insert the following model_list line 
        occupancy_list=mmcif_dict["_atom_site.occupancy"]
        fieldname_list=mmcif_dict["_atom_site.group_PDB"]
        #Added by Paul T. Bathen Nov 2009
        model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"]
        try:
 
Make the following changes:
        #Modified by Paul T. Bathen Nov 2009: comment out this line
        #current_model_id=0
        structure_builder=self._structure_builder
        structure_builder.init_structure(structure_id)
        #Modified by Paul T. Bathen Nov 2009: comment out this line
        #structure_builder.init_model(current_model_id)
        structure_builder.init_seg(" ")
        #Added by Paul T. Bathen Nov 2009
        current_model_id = -1

Make the following changes in the for loop:
            #Note by Paul T. Bathen: should MMCIFParser include 
            #the HOH and WAT stmts in PDBParser immediately below?
            #if fieldname=="HETATM":
            #    if resname=="HOH" or resname=="WAT":
            #        hetero_flag="W"
            #    else:
            #        hetero_flag="H"

            if fieldname=="HETATM":
                hetatm_flag="H"
            else:
                hetatm_flag=" "
 
            #Added by Paul T. Bathen Nov 2009
            model_id = model_list[i]
            if current_model_id != model_id:
                current_model_id = model_id
                structure_builder.init_model(current_model_id)
            #end of addition

After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2beg.ent and both parsed with the same number of models, chains, and residues. 

Paul


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Mon Apr 23 13:02:01 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 23 Apr 2012 18:02:01 +0100
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
Message-ID: <CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>

On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson <arklenna at gmail.com> wrote:
> I've implemented the parser changes (written by Paul Bathen; see bug
> report) to allow the MMCIF parser to handle multiple models.
>
> Models are now accessed by a string key of their model number, rather
> than an arbitrary index (structure['1'] versus structure[0]).
>
> I updated the MMCIF unit test for the new model access method and
> added a test file with multiple models.
>
> I'm not sure if there is documentation to be updated re: accessing the models.
>
> issue: https://redmine.open-bio.org/issues/2943
> pull request: https://github.com/biopython/biopython/pull/34

I've applied that to the trunk, thank you, but on reading this, why are the
model keys strings and not integers? Does MMCIF allow odd keys or
something?

Peter

From eric.talevich at gmail.com  Mon Apr 23 16:10:27 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 23 Apr 2012 16:10:27 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
Message-ID: <CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>

On Mon, Apr 23, 2012 at 1:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>> I've implemented the parser changes (written by Paul Bathen; see bug
>> report) to allow the MMCIF parser to handle multiple models.
>>
>> Models are now accessed by a string key of their model number, rather
>> than an arbitrary index (structure['1'] versus structure[0]).
>>
>> I updated the MMCIF unit test for the new model access method and
>> added a test file with multiple models.
>>
>> I'm not sure if there is documentation to be updated re: accessing the models.
>>
>> issue: https://redmine.open-bio.org/issues/2943
>> pull request: https://github.com/biopython/biopython/pull/34
>
> I've applied that to the trunk, thank you, but on reading this, why are the
> model keys strings and not integers? Does MMCIF allow odd keys or
> something?
>

Ack, I didn't look at that closely enough. Check out this patch to see
the current situation:
https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9

The models associated with a structure are numbered with a sequential
integer id, starting from 0. It's always been like that in our PDB
parser and we haven't changed it. To ensure that model numbers
specified in the PDB file are preserved when writing the PDB back to
file, the above patch introduced a new attribute on the Model object
called serial_num (also an integer, equal to model.id unless specified
otherwise). That attribute is only used when writing a new PDB file;
Model.__getitem__ still uses Model.id as before.

Perhaps that's surprising now that we read the serial numbers, but it
kept backward compatibility. Plus, it preserves list-like behavior
(item access via integers), even though the models are actually stored
in a dict.

So!

In the mmCIF parser, the calls to structure_builder.init_model should
be given two arguments instead of one: an integer id counting from 0,
and then another integer (probably) containing the model "serial
number" specified in the mmCIF file. In the event that an mmCIF file
doesn't specify the model number, the serial number should be the same
as the sequential id.

Cool? This will also help us convert between PDB and mmCIF formats in
the future.

As for accessing the models by their serial number, using string keys
seems like an effective workaround, but still obviously a workaround
rather than an ideal situation. Let's discuss that a little more,
perhaps file another bug when we've reached some consensus.

Best,
Eric

From eric.talevich at gmail.com  Mon Apr 23 16:32:11 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 23 Apr 2012 16:32:11 -0400
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF
 parser (#33)
In-Reply-To: <CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
	<CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
Message-ID: <CAMC681mRNgjVNMSWBANMh8Ztjj=cS-jBVAL6ntjXVJYZZekk3w@mail.gmail.com>

On Fri, Apr 20, 2012 at 8:57 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>
> I've started testing the PLY lexer on PyPy. NumPyPy now implements
> more functions needed by PDB; the only things I found to be missing
> are random and linalg. This eliminates Superimposer, FragmentMapper,
> and Vector.
>
> I played around with trying to spoof "import numpy" to automatically
> import numpypy (code here: https://gist.github.com/2432815) but I
> don't think that's wise yet.
>
> My last commit to this branch was a few changes to allow the MMCIF
> parser to work on NumPy. PyPy won't run `setup.py test` due to global
> numpy failure, but if I install this branch and `pypy test_MMCIF.py`,
> it passes.
>
> Anybody with more PyPy and/or package structuring experience have thoughts?
>
> Lenna

Would it be more or less error-prone to simply replace every numpy
import with this (after testing each module on PyPy):

try:
    import numpy
except:
    import numpypy as numpy

Or similarly, use this as one of our compatibility utilities:

from Bio import numpy
# Some conditional junk in Bio/__init__.py or setup.py to reveal this
module to PyPy and CPython as needed


In either case, here's the relatively short list of modules that would
need to be modified:

Bio/Affy/CelFile.py
Bio/Cluster/__init__.py
Bio/KDTree/KDTree.py
Bio/LogisticRegression.py
Bio/MarkovModel.py
Bio/MaxEntropy.py
Bio/NaiveBayes.py
Bio/PDB/Atom.py
Bio/PDB/FragmentMapper.py
Bio/PDB/MMCIFParser.py
Bio/PDB/NeighborSearch.py
Bio/PDB/PDBParser.py
Bio/PDB/ResidueDepth.py
Bio/PDB/Superimposer.py
Bio/PDB/Vector.py
Bio/SVDSuperimposer/SVDSuperimposer.py
Bio/Statistics/lowess.py
Bio/SubsMat/__init__.py
Bio/kNN.py

From p.j.a.cock at googlemail.com  Mon Apr 23 16:47:02 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 23 Apr 2012 21:47:02 +0100
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF
 parser (#33)
In-Reply-To: <CAMC681mRNgjVNMSWBANMh8Ztjj=cS-jBVAL6ntjXVJYZZekk3w@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
	<CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
	<CAMC681mRNgjVNMSWBANMh8Ztjj=cS-jBVAL6ntjXVJYZZekk3w@mail.gmail.com>
Message-ID: <CAKVJ-_4Vh306dDL-9uTag=RO238SeMs3B_CZwWxkA7oX=9rHkw@mail.gmail.com>

On Mon, Apr 23, 2012 at 9:32 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Would it be more or less error-prone to simply replace every numpy
> import with this (after testing each module on PyPy):
>
> try:
> ? ?import numpy
> except:
> ? ?import numpypy as numpy
>

Maybe, but right now do any of our NumPy using modules
pass under PyPy? I don't believe so... but I haven't tried
a PyPy nightly build lately.

It was unfortunate that originally PyPy's micronumpy
pretended to by numpy, so that you'd write "import numpy"
and think it worked but be surprised later when something
fundamental like the dot function was missing, or 2D arrays.
That lead to a few nasty try/import lines in our unit tests.

Let's wait and see how PyPy's numpy support improves
before rushing to change any of our numpy imports. I am
hopefully that Bio.PDB will be fine in their next release,
whereas things using the NumPy C API will probably not be.

Peter


From arklenna at gmail.com  Mon Apr 23 19:05:03 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Mon, 23 Apr 2012 19:05:03 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
Message-ID: <CAK610_69XxiUEaTcLaK_RqrLCG5CmXMy-ytgjSdgbZy6c-VHhw@mail.gmail.com>

On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Ack, I didn't look at that closely enough. Check out this patch to see
> the current situation:
> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>
> The models associated with a structure are numbered with a sequential
> integer id, starting from 0. It's always been like that in our PDB
> parser and we haven't changed it. To ensure that model numbers
> specified in the PDB file are preserved when writing the PDB back to
> file, the above patch introduced a new attribute on the Model object
> called serial_num (also an integer, equal to model.id unless specified
> otherwise). That attribute is only used when writing a new PDB file;
> Model.__getitem__ still uses Model.id as before.
>
> Perhaps that's surprising now that we read the serial numbers, but it
> kept backward compatibility. Plus, it preserves list-like behavior
> (item access via integers), even though the models are actually stored
> in a dict.
>
> So!
>
> In the mmCIF parser, the calls to structure_builder.init_model should
> be given two arguments instead of one: an integer id counting from 0,
> and then another integer (probably) containing the model "serial
> number" specified in the mmCIF file. In the event that an mmCIF file
> doesn't specify the model number, the serial number should be the same
> as the sequential id.
>
> Cool? This will also help us convert between PDB and mmCIF formats in
> the future.


Got it. I'm working on implementing the serial_number/model_number
dichotomy for MMCIF.


> As for accessing the models by their serial number, using string keys
> seems like an effective workaround, but still obviously a workaround
> rather than an ideal situation. Let's discuss that a little more,
> perhaps file another bug when we've reached some consensus.


Er, I made and then lost (still haven't *quite* gotten the hang of git
rebase) a patch that applied int() to the MMCIF model numbers. I'll
add that back so both model and serial numbers are ints.


Lenna

From arklenna at gmail.com  Tue Apr 24 00:25:12 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Tue, 24 Apr 2012 00:25:12 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
Message-ID: <CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>

On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Ack, I didn't look at that closely enough. Check out this patch to see
> the current situation:
> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>
> The models associated with a structure are numbered with a sequential
> integer id, starting from 0. It's always been like that in our PDB
> parser and we haven't changed it. To ensure that model numbers
> specified in the PDB file are preserved when writing the PDB back to
> file, the above patch introduced a new attribute on the Model object
> called serial_num (also an integer, equal to model.id unless specified
> otherwise). That attribute is only used when writing a new PDB file;
> Model.__getitem__ still uses Model.id as before.
>
> Perhaps that's surprising now that we read the serial numbers, but it
> kept backward compatibility. Plus, it preserves list-like behavior
> (item access via integers), even though the models are actually stored
> in a dict.
>
> So!
>
> In the mmCIF parser, the calls to structure_builder.init_model should
> be given two arguments instead of one: an integer id counting from 0,
> and then another integer (probably) containing the model "serial
> number" specified in the mmCIF file. In the event that an mmCIF file
> doesn't specify the model number, the serial number should be the same
> as the sequential id.
>
> Cool? This will also help us convert between PDB and mmCIF formats in
> the future.
>
> As for accessing the models by their serial number, using string keys
> seems like an effective workaround, but still obviously a workaround
> rather than an ideal situation. Let's discuss that a little more,
> perhaps file another bug when we've reached some consensus.
>
> Best,
> Eric


Hi Eric,

I believe I've implemented the model_id/serial_id system found in PDB:

https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d

Please let me know if you think that looks right. I couldn't find an
mmCIF file without a model column to test, but I believe in that case
it will assign model_id and serial_id to 0. Would that be the correct
behavior?

I also modified the unit test to check the model serial_num.
https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6

Currently serial_num is int() of the CIF model column. Regarding
access by string serial_num, I am concerned that the int/string access
would be too subtle (structure[0] == structure['1']; structure[1] ==
structure['2']?). Perhaps an accessor function? i.e.
structure.get_model('1')

Let me know if you think I should write get_model() or something along
those lines.

Lenna

From eric.talevich at gmail.com  Tue Apr 24 11:38:50 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 24 Apr 2012 11:38:50 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
Message-ID: <CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>

On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
> On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> Ack, I didn't look at that closely enough. Check out this patch to see
>> the current situation:
>> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>>
>> The models associated with a structure are numbered with a sequential
>> integer id, starting from 0. It's always been like that in our PDB
>> parser and we haven't changed it. To ensure that model numbers
>> specified in the PDB file are preserved when writing the PDB back to
>> file, the above patch introduced a new attribute on the Model object
>> called serial_num (also an integer, equal to model.id unless specified
>> otherwise). That attribute is only used when writing a new PDB file;
>> Model.__getitem__ still uses Model.id as before.
>>
>> Perhaps that's surprising now that we read the serial numbers, but it
>> kept backward compatibility. Plus, it preserves list-like behavior
>> (item access via integers), even though the models are actually stored
>> in a dict.
>>
>> So!
>>
>> In the mmCIF parser, the calls to structure_builder.init_model should
>> be given two arguments instead of one: an integer id counting from 0,
>> and then another integer (probably) containing the model "serial
>> number" specified in the mmCIF file. In the event that an mmCIF file
>> doesn't specify the model number, the serial number should be the same
>> as the sequential id.
>>
>> Cool? This will also help us convert between PDB and mmCIF formats in
>> the future.
>>
>> As for accessing the models by their serial number, using string keys
>> seems like an effective workaround, but still obviously a workaround
>> rather than an ideal situation. Let's discuss that a little more,
>> perhaps file another bug when we've reached some consensus.
>>
>> Best,
>> Eric
>
>
> Hi Eric,
>
> I believe I've implemented the model_id/serial_id system found in PDB:
>
> https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
>
> Please let me know if you think that looks right. I couldn't find an
> mmCIF file without a model column to test, but I believe in that case
> it will assign model_id and serial_id to 0. Would that be the correct
> behavior?
>
> I also modified the unit test to check the model serial_num.
> https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
>
> Currently serial_num is int() of the CIF model column. Regarding
> access by string serial_num, I am concerned that the int/string access
> would be too subtle (structure[0] == structure['1']; structure[1] ==
> structure['2']?). Perhaps an accessor function? i.e.
> structure.get_model('1')
>
> Let me know if you think I should write get_model() or something along
> those lines.
>
> Lenna

I left another nitpick on b453a, but besides that it looks exactly right to me.

The string/int distinction would indeed be weird, especially for newer
Python users coming from Perl or Javascript. I don't see a direct
analogue for get_model(serial_num) in the other Entities (Residue,
Chain, Model, Structure), so I'm inclined to put off the decision for
now (i.e. leave it out of this patch set).

-Eric

From p.j.a.cock at googlemail.com  Tue Apr 24 11:58:10 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 24 Apr 2012 16:58:10 +0100
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <4F91E4CF.8040602@med.nyu.edu>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
	<4F91E4CF.8040602@med.nyu.edu>
Message-ID: <CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>

On Fri, Apr 20, 2012 at 11:35 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Peter,
>
> My colleague was writing some code using MafIndex and commented how long it
> took her to download, decompress and index the human multiz alignments from
> UCSC. It seems like it'd be great to keep the files compressed... perhaps if
> the code works well enough we can convince UCSC to host bgzip'd copies (or
> maybe them available on one of our institutions servers).

That does sound good - it is a perfect example of where BGZF is a more
useful alternative to standard GZIP. Some numbers on how much of a
size penalty it imposes would help though...

> Is I.J. interested in joining the community? I'd like to look into adding
> BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If not, could
> you put me in touch?

Perhaps he's just busy at the moment (BCC'd again)?

It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py
and I'm willing to do this myself for MAF (while going over your index work -
something I want to do anyway). The only potential catch is avoiding offset
arithmetic.

Peter

From arklenna at gmail.com  Tue Apr 24 13:56:37 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Tue, 24 Apr 2012 13:56:37 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
Message-ID: <CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>

On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> >>
> >> Ack, I didn't look at that closely enough. Check out this patch to see
> >> the current situation:
> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
> >>
> >> The models associated with a structure are numbered with a sequential
> >> integer id, starting from 0. It's always been like that in our PDB
> >> parser and we haven't changed it. To ensure that model numbers
> >> specified in the PDB file are preserved when writing the PDB back to
> >> file, the above patch introduced a new attribute on the Model object
> >> called serial_num (also an integer, equal to model.id unless specified
> >> otherwise). That attribute is only used when writing a new PDB file;
> >> Model.__getitem__ still uses Model.id as before.
> >>
> >> Perhaps that's surprising now that we read the serial numbers, but it
> >> kept backward compatibility. Plus, it preserves list-like behavior
> >> (item access via integers), even though the models are actually stored
> >> in a dict.
> >>
> >> So!
> >>
> >> In the mmCIF parser, the calls to structure_builder.init_model should
> >> be given two arguments instead of one: an integer id counting from 0,
> >> and then another integer (probably) containing the model "serial
> >> number" specified in the mmCIF file. In the event that an mmCIF file
> >> doesn't specify the model number, the serial number should be the same
> >> as the sequential id.
> >>
> >> Cool? This will also help us convert between PDB and mmCIF formats in
> >> the future.
> >>
> >> As for accessing the models by their serial number, using string keys
> >> seems like an effective workaround, but still obviously a workaround
> >> rather than an ideal situation. Let's discuss that a little more,
> >> perhaps file another bug when we've reached some consensus.
> >>
> >> Best,
> >> Eric
> >
> >
> > Hi Eric,
> >
> > I believe I've implemented the model_id/serial_id system found in PDB:
> >
> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
> >
> > Please let me know if you think that looks right. I couldn't find an
> > mmCIF file without a model column to test, but I believe in that case
> > it will assign model_id and serial_id to 0. Would that be the correct
> > behavior?
> >
> > I also modified the unit test to check the model serial_num.
> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
> >
> > Currently serial_num is int() of the CIF model column. Regarding
> > access by string serial_num, I am concerned that the int/string access
> > would be too subtle (structure[0] == structure['1']; structure[1] ==
> > structure['2']?). Perhaps an accessor function? i.e.
> > structure.get_model('1')
> >
> > Let me know if you think I should write get_model() or something along
> > those lines.
> >
> > Lenna
>
> I left another nitpick on b453a, but besides that it looks exactly right to me.
>
> The string/int distinction would indeed be weird, especially for newer
> Python users coming from Perl or Javascript. I don't see a direct
> analogue for get_model(serial_num) in the other Entities (Residue,
> Chain, Model, Structure), so I'm inclined to put off the decision for
> now (i.e. leave it out of this patch set).
>
> -Eric


Eric,

Okay, I've changed the bad model num generic warning to a
PDBConstructionException.

New pull request to get MMCIF to the same state as PDB:
https://github.com/biopython/biopython/pull/36

So are chains accessed by 0, 1, 2 or by A, B, C?

Lenna

From anaryin at gmail.com  Tue Apr 24 13:59:10 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 24 Apr 2012 19:59:10 +0200
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
	<CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
Message-ID: <CAJ9sUYPzMhcfwB2DPQwiSTn=146S=2aThzeqbj1sfEJUNOgG3w@mail.gmail.com>

Hi Lenna,

IMO, chains should be accessed by A, B, C I'd say, doesn't make sense
numerically.

Congrats on the GSOC application and on the good work so far!

Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


No dia 24 de Abril de 2012 19:56, Lenna Peterson <arklenna at gmail.com>escreveu:

> On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> >
> > On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com>
> wrote:
> > > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <
> eric.talevich at gmail.com> wrote:
> > >>
> > >> Ack, I didn't look at that closely enough. Check out this patch to see
> > >> the current situation:
> > >>
> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
> > >>
> > >> The models associated with a structure are numbered with a sequential
> > >> integer id, starting from 0. It's always been like that in our PDB
> > >> parser and we haven't changed it. To ensure that model numbers
> > >> specified in the PDB file are preserved when writing the PDB back to
> > >> file, the above patch introduced a new attribute on the Model object
> > >> called serial_num (also an integer, equal to model.id unless
> specified
> > >> otherwise). That attribute is only used when writing a new PDB file;
> > >> Model.__getitem__ still uses Model.id as before.
> > >>
> > >> Perhaps that's surprising now that we read the serial numbers, but it
> > >> kept backward compatibility. Plus, it preserves list-like behavior
> > >> (item access via integers), even though the models are actually stored
> > >> in a dict.
> > >>
> > >> So!
> > >>
> > >> In the mmCIF parser, the calls to structure_builder.init_model should
> > >> be given two arguments instead of one: an integer id counting from 0,
> > >> and then another integer (probably) containing the model "serial
> > >> number" specified in the mmCIF file. In the event that an mmCIF file
> > >> doesn't specify the model number, the serial number should be the same
> > >> as the sequential id.
> > >>
> > >> Cool? This will also help us convert between PDB and mmCIF formats in
> > >> the future.
> > >>
> > >> As for accessing the models by their serial number, using string keys
> > >> seems like an effective workaround, but still obviously a workaround
> > >> rather than an ideal situation. Let's discuss that a little more,
> > >> perhaps file another bug when we've reached some consensus.
> > >>
> > >> Best,
> > >> Eric
> > >
> > >
> > > Hi Eric,
> > >
> > > I believe I've implemented the model_id/serial_id system found in PDB:
> > >
> > >
> https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
> > >
> > > Please let me know if you think that looks right. I couldn't find an
> > > mmCIF file without a model column to test, but I believe in that case
> > > it will assign model_id and serial_id to 0. Would that be the correct
> > > behavior?
> > >
> > > I also modified the unit test to check the model serial_num.
> > >
> https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
> > >
> > > Currently serial_num is int() of the CIF model column. Regarding
> > > access by string serial_num, I am concerned that the int/string access
> > > would be too subtle (structure[0] == structure['1']; structure[1] ==
> > > structure['2']?). Perhaps an accessor function? i.e.
> > > structure.get_model('1')
> > >
> > > Let me know if you think I should write get_model() or something along
> > > those lines.
> > >
> > > Lenna
> >
> > I left another nitpick on b453a, but besides that it looks exactly right
> to me.
> >
> > The string/int distinction would indeed be weird, especially for newer
> > Python users coming from Perl or Javascript. I don't see a direct
> > analogue for get_model(serial_num) in the other Entities (Residue,
> > Chain, Model, Structure), so I'm inclined to put off the decision for
> > now (i.e. leave it out of this patch set).
> >
> > -Eric
>
>
> Eric,
>
> Okay, I've changed the bad model num generic warning to a
> PDBConstructionException.
>
> New pull request to get MMCIF to the same state as PDB:
> https://github.com/biopython/biopython/pull/36
>
> So are chains accessed by 0, 1, 2 or by A, B, C?
>
> Lenna
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From eric.talevich at gmail.com  Tue Apr 24 14:20:16 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 24 Apr 2012 14:20:16 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
	<CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
Message-ID: <CAMC681nrx7iX7tdSr3F0L+Mm6g5LWw20gSQOj18Qga1dvh2T0w@mail.gmail.com>

On Tue, Apr 24, 2012 at 1:56 PM, Lenna Peterson <arklenna at gmail.com> wrote:
> On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> >>
>> >> Ack, I didn't look at that closely enough. Check out this patch to see
>> >> the current situation:
>> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>> >>
>> >> The models associated with a structure are numbered with a sequential
>> >> integer id, starting from 0. It's always been like that in our PDB
>> >> parser and we haven't changed it. To ensure that model numbers
>> >> specified in the PDB file are preserved when writing the PDB back to
>> >> file, the above patch introduced a new attribute on the Model object
>> >> called serial_num (also an integer, equal to model.id unless specified
>> >> otherwise). That attribute is only used when writing a new PDB file;
>> >> Model.__getitem__ still uses Model.id as before.
>> >>
>> >> Perhaps that's surprising now that we read the serial numbers, but it
>> >> kept backward compatibility. Plus, it preserves list-like behavior
>> >> (item access via integers), even though the models are actually stored
>> >> in a dict.
>> >>
>> >> So!
>> >>
>> >> In the mmCIF parser, the calls to structure_builder.init_model should
>> >> be given two arguments instead of one: an integer id counting from 0,
>> >> and then another integer (probably) containing the model "serial
>> >> number" specified in the mmCIF file. In the event that an mmCIF file
>> >> doesn't specify the model number, the serial number should be the same
>> >> as the sequential id.
>> >>
>> >> Cool? This will also help us convert between PDB and mmCIF formats in
>> >> the future.
>> >>
>> >> As for accessing the models by their serial number, using string keys
>> >> seems like an effective workaround, but still obviously a workaround
>> >> rather than an ideal situation. Let's discuss that a little more,
>> >> perhaps file another bug when we've reached some consensus.
>> >>
>> >> Best,
>> >> Eric
>> >
>> >
>> > Hi Eric,
>> >
>> > I believe I've implemented the model_id/serial_id system found in PDB:
>> >
>> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
>> >
>> > Please let me know if you think that looks right. I couldn't find an
>> > mmCIF file without a model column to test, but I believe in that case
>> > it will assign model_id and serial_id to 0. Would that be the correct
>> > behavior?
>> >
>> > I also modified the unit test to check the model serial_num.
>> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
>> >
>> > Currently serial_num is int() of the CIF model column. Regarding
>> > access by string serial_num, I am concerned that the int/string access
>> > would be too subtle (structure[0] == structure['1']; structure[1] ==
>> > structure['2']?). Perhaps an accessor function? i.e.
>> > structure.get_model('1')
>> >
>> > Let me know if you think I should write get_model() or something along
>> > those lines.
>> >
>> > Lenna
>>
>> I left another nitpick on b453a, but besides that it looks exactly right to me.
>>
>> The string/int distinction would indeed be weird, especially for newer
>> Python users coming from Perl or Javascript. I don't see a direct
>> analogue for get_model(serial_num) in the other Entities (Residue,
>> Chain, Model, Structure), so I'm inclined to put off the decision for
>> now (i.e. leave it out of this patch set).
>>
>> -Eric
>
>
> Eric,
>
> Okay, I've changed the bad model num generic warning to a
> PDBConstructionException.
>
> New pull request to get MMCIF to the same state as PDB:
> https://github.com/biopython/biopython/pull/36
>
> So are chains accessed by 0, 1, 2 or by A, B, C?
>
> Lenna

Cool, I just merged the pull request. Thanks!

As Jo?o said, chains are accessed by the letter ID via __getitem__
(implemented in Bio.PDB.Entity). You can get at them either way
through the child_list and child_dict attributes, too. Kind of a
thrill. I suppose we could eventually refactor the Entity-based
classes to use a single data structure (OrderedDict, namedtuple, numpy
array with named columns/rows?) in place of child_dict and child_list,
and clean up some of the redundant accessors.

-E


From anaryin at gmail.com  Tue Apr 24 14:25:15 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 24 Apr 2012 20:25:15 +0200
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681nrx7iX7tdSr3F0L+Mm6g5LWw20gSQOj18Qga1dvh2T0w@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
	<CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
	<CAMC681nrx7iX7tdSr3F0L+Mm6g5LWw20gSQOj18Qga1dvh2T0w@mail.gmail.com>
Message-ID: <CAJ9sUYPG57JbOq-=ax5aqquVOHagsxQNQVbf=z8UOg5LsuJ0hQ@mail.gmail.com>

I cannot agree more with Eric on this. Child dict and child list should be
for sure refactored into something unique and easier to understand (and
use). Also because we should take care of that memory leak... (try running
the parser over a lot of PDBs and you will see memory going up).

Cheers,

Jo?o


From p.j.a.cock at googlemail.com  Tue Apr 24 16:07:03 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 24 Apr 2012 21:07:03 +0100
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
	<4F91E4CF.8040602@med.nyu.edu>
	<CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>
	<67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu>
Message-ID: <CAKVJ-_6J6LPVggPojU2mutOVob=v6oVv9-Gx=E=R1fQEg5zVkg@mail.gmail.com>

On Tue, Apr 24, 2012 at 7:24 PM, Irwin Jungreis <ILJungr at csail.mit.edu> wrote:
> Hello Andrew and Peter.
>

Hi again Irwin,

> The size penalty of bgz versus gzip for .maf files is quite small. For
> example, compressing the 6-way C. elegans alignment .maf files is 108.9 MB
> with gzip and 112 MB with bgz, a difference of less than 3%. (Each is
> smaller than the uncompressed file by a factor of about 4 or 5.)

That's good - and given the nature of the MAF format in line with
what I was hoping for - see also the overheads I got for FASTA,
SwissProt and UniProt XML here:
http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html

> I am not very familiar with biopython, so I've been using my own utilities.
> To work with alignments I create an index file consisting of a 32-byte
> record for each maf block. Each record ?contains the block start on the
> reference species chromosome, the block length on the reference species, and
> the virtual offset of the block start in the .maf file. I then have a
> utility that will extract the alignment for a given set of spliced regions,
> e.g., chrX:11568015-11569059+chrX:11569364-11569395 on the '-' strand, and
> output it as a list of pairs (assembly name, base string).
>
> I'd be happy to share, but I have no idea how this would fit into the
> existing biopython infrastructure.
>
> Best,
> Irwin

Ah - I must have misinterpreted your earlier email (off list). I'd
assumed you were using Andrew's Biopython branch which
indexes MAF files using an SQLite database of offsets. But
in practice the principle is the same - BGZF lets you have
good compression of MAF files and random access. Thank
you for clarifying this.

If you use Python at all perhaps you'd have some feedback
on Andrew's indexing plans? That would be great - Andrew's
done a great job explaining the proposed code usage here:
http://biopython.org/wiki/Multiple_Alignment_Format

Regards,

Peter


From redmine at redmine.open-bio.org  Tue Apr 24 22:33:04 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 25 Apr 2012 02:33:04 +0000
Subject: [Biopython-dev] [Biopython - Feature #3344] (New) Bio.PDB.Entity
	classes need a __contains__ method
Message-ID: <redmine.issue-3344.20120425023304@redmine.open-bio.org>


Issue #3344 has been reported by Eric Talevich.

----------------------------------------
Feature #3344: Bio.PDB.Entity classes need a __contains__ method
https://redmine.open-bio.org/issues/3344

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


The various objects constructed by Bio.PDB have list-like and dict-like behaviors, for the most part. However, the not all of the relevant magic methods have been implemented. (E.g. `residue["CA"]` works, but `"CA" in residue` does not.)

We could do more to support the list-like and dict-like behaviors, but let's start with __contains__.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Apr 25 23:36:04 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 26 Apr 2012 03:36:04 +0000
Subject: [Biopython-dev] [Biopython - Bug #3169] (Closed) to_one_letter_code
	in Bio.SCOP.Raf is old
References: <redmine.issue-3169.20110112094643@redmine.open-bio.org>
Message-ID: <redmine.journal-14828.20120426033604@redmine.open-bio.org>


Issue #3169 has been updated by Eric Talevich.

Status changed from New to Closed
% Done changed from 0 to 100

We've committed this fix now:
https://github.com/biopython/biopython/pull/35
----------------------------------------
Bug #3169: to_one_letter_code in Bio.SCOP.Raf is old
https://redmine.open-bio.org/issues/3169

Author: Hongbo Zhu
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.56
URL: 


Hi, 

The dictionary to_one_letter_code in Bio.SCOP.Raf is a bit old now. The current dictionary is based on a table taken from the RAF release notes of ASTRAL. This is an old table and some new three-letter codes in the PDB are not found in it (e.g. M3L in 2X4W). ASTRAL does not use the table since v1.73. Rather, PDB Chemical Component Dictionary is used. See http://astral.berkeley.edu/seq.cgi?get=raf-edit-comments;ver=1.75

"Beginning with ASTRAL 1.73, the PDB's chemical dictionary is used to translate chemically modified residues, instead of the translation table from ASTRAL 1.55."

The PDB Chemical Component Dictionary can be obtained from: http://deposit.pdb.org/cc_dict_tut.html .

I have parsed the dictionary and there are 12054 three-letter codes (as of Jan 2011). Among them, most correspond to a one-letter code '?'. Still, there are 1245 three-letter codes corresponding to a one-letter code other than '?' (the list is attached in the end). Therefore, I suggest to update the to_one_letter_code dictionary in Bio.SCOP.Raf.

Best regards,
hongbo zhu

to_one_letter_code = {
    '00C':'C','01W':'X','0A0':'D','0A1':'Y','0A2':'K',
    '0A8':'C','0AA':'V','0AB':'V','0AC':'G','0AD':'G',
    '0AF':'W','0AG':'L','0AH':'S','0AK':'D','0AM':'A',
    '0AP':'C','0AU':'U','0AV':'A','0AZ':'P','0BN':'F',
    '0C ':'C','0CS':'A','0DC':'C','0DG':'G','0DT':'T',
    '0G ':'G','0NC':'A','0SP':'A','0U ':'U','0YG':'YG',
    '10C':'C','125':'U','126':'U','127':'U','128':'N',
    '12A':'A','143':'C','175':'ASG','193':'X','1AP':'A',
    '1MA':'A','1MG':'G','1PA':'F','1PI':'A','1PR':'N',
    '1SC':'C','1TQ':'W','1TY':'Y','200':'F','23F':'F',
    '23S':'X','26B':'T','2AD':'X','2AG':'G','2AO':'X',
    '2AR':'A','2AS':'X','2AT':'T','2AU':'U','2BD':'I',
    '2BT':'T','2BU':'A','2CO':'C','2DA':'A','2DF':'N',
    '2DM':'N','2DO':'X','2DT':'T','2EG':'G','2FE':'N',
    '2FI':'N','2FM':'M','2GT':'T','2HF':'H','2LU':'L',
    '2MA':'A','2MG':'G','2ML':'L','2MR':'R','2MT':'P',
    '2MU':'U','2NT':'T','2OM':'U','2OT':'T','2PI':'X',
    '2PR':'G','2SA':'N','2SI':'X','2ST':'T','2TL':'T',
    '2TY':'Y','2VA':'V','32S':'X','32T':'X','3AH':'H',
    '3AR':'X','3CF':'F','3DA':'A','3DR':'N','3GA':'A',
    '3MD':'D','3ME':'U','3NF':'Y','3TY':'X','3XH':'G',
    '4AC':'N','4BF':'Y','4CF':'F','4CY':'M','4DP':'W',
    '4F3':'GYG','4FB':'P','4FW':'W','4HT':'W','4IN':'X',
    '4MF':'N','4MM':'X','4OC':'C','4PC':'C','4PD':'C',
    '4PE':'C','4PH':'F','4SC':'C','4SU':'U','4TA':'N',
    '5AA':'A','5AT':'T','5BU':'U','5CG':'G','5CM':'C',
    '5CS':'C','5FA':'A','5FC':'C','5FU':'U','5HP':'E',
    '5HT':'T','5HU':'U','5IC':'C','5IT':'T','5IU':'U',
    '5MC':'C','5MD':'N','5MU':'U','5NC':'C','5PC':'C',
    '5PY':'T','5SE':'U','5ZA':'TWG','64T':'T','6CL':'K',
    '6CT':'T','6CW':'W','6HA':'A','6HC':'C','6HG':'G',
    '6HN':'K','6HT':'T','6IA':'A','6MA':'A','6MC':'A',
    '6MI':'N','6MT':'A','6MZ':'N','6OG':'G','70U':'U',
    '7DA':'A','7GU':'G','7JA':'I','7MG':'G','8AN':'A',
    '8FG':'G','8MG':'G','8OG':'G','9NE':'E','9NF':'F',
    '9NR':'R','9NV':'V','A  ':'A','A1P':'N','A23':'A',
    'A2L':'A','A2M':'A','A34':'A','A35':'A','A38':'A',
    'A39':'A','A3A':'A','A3P':'A','A40':'A','A43':'A',
    'A44':'A','A47':'A','A5L':'A','A5M':'C','A5O':'A',
    'A66':'X','AA3':'A','AA4':'A','AAR':'R','AB7':'X',
    'ABA':'A','ABR':'A','ABS':'A','ABT':'N','ACB':'D',
    'ACL':'R','AD2':'A','ADD':'X','ADX':'N','AEA':'X',
    'AEI':'D','AET':'A','AFA':'N','AFF':'N','AFG':'G',
    'AGM':'R','AGT':'X','AHB':'N','AHH':'X','AHO':'A',
    'AHP':'A','AHS':'X','AHT':'X','AIB':'A','AKL':'D',
    'ALA':'A','ALC':'A','ALG':'R','ALM':'A','ALN':'A',
    'ALO':'T','ALQ':'X','ALS':'A','ALT':'A','ALY':'K',
    'AP7':'A','APE':'X','APH':'A','API':'K','APK':'K',
    'APM':'X','APP':'X','AR2':'R','AR4':'E','ARG':'R',
    'ARM':'R','ARO':'R','ARV':'X','AS ':'A','AS2':'D',
    'AS9':'X','ASA':'D','ASB':'D','ASI':'D','ASK':'D',
    'ASL':'D','ASM':'X','ASN':'N','ASP':'D','ASQ':'D',
    'ASU':'N','ASX':'B','ATD':'T','ATL':'T','ATM':'T',
    'AVC':'A','AVN':'X','AYA':'A','AYG':'AYG','AZK':'K',
    'AZS':'S','AZY':'Y','B1F':'F','B1P':'N','B2A':'A',
    'B2F':'F','B2I':'I','B2V':'V','B3A':'A','B3D':'D',
    'B3E':'E','B3K':'K','B3L':'X','B3M':'X','B3Q':'X',
    'B3S':'S','B3T':'X','B3U':'H','B3X':'N','B3Y':'Y',
    'BB6':'C','BB7':'C','BB9':'C','BBC':'C','BCS':'C',
    'BCX':'C','BE2':'X','BFD':'D','BG1':'S','BGM':'G',
    'BHD':'D','BIF':'F','BIL':'X','BIU':'I','BJH':'X',
    'BLE':'L','BLY':'K','BMP':'N','BMT':'T','BNN':'A',
    'BNO':'X','BOE':'T','BOR':'R','BPE':'C','BRU':'U',
    'BSE':'S','BT5':'N','BTA':'L','BTC':'C','BTR':'W',
    'BUC':'C','BUG':'V','BVP':'U','BZG':'N','C  ':'C',
    'C12':'TYG','C1X':'K','C25':'C','C2L':'C','C2S':'C',
    'C31':'C','C32':'C','C34':'C','C36':'C','C37':'C',
    'C38':'C','C3Y':'C','C42':'C','C43':'C','C45':'C',
    'C46':'C','C49':'C','C4R':'C','C4S':'C','C5C':'C',
    'C66':'X','C6C':'C','C99':'TFG','CAF':'C','CAL':'X',
    'CAR':'C','CAS':'C','CAV':'X','CAY':'C','CB2':'C',
    'CBR':'C','CBV':'C','CCC':'C','CCL':'K','CCS':'C',
    'CCY':'CYG','CDE':'X','CDV':'X','CDW':'C','CEA':'C',
    'CFL':'C','CFY':'FCYG','CG1':'G','CGA':'E','CGU':'E',
    'CH ':'C','CH6':'MYG','CH7':'KYG','CHF':'X','CHG':'X',
    'CHP':'G','CHS':'X','CIR':'R','CJO':'GYG','CLE':'L',
    'CLG':'K','CLH':'K','CLV':'AFG','CM0':'N','CME':'C',
    'CMH':'C','CML':'C','CMR':'C','CMT':'C','CNU':'U',
    'CP1':'C','CPC':'X','CPI':'X','CQR':'GYG','CR0':'TLG',
    'CR2':'GYG','CR5':'G','CR7':'KYG','CR8':'HYG','CRF':'TWG',
    'CRG':'THG','CRK':'MYG','CRO':'GYG','CRQ':'QYG','CRU':'E',
    'CRW':'ASG','CRX':'ASG','CS0':'C','CS1':'C','CS3':'C',
    'CS4':'C','CS8':'N','CSA':'C','CSB':'C','CSD':'C',
    'CSE':'C','CSF':'C','CSH':'SHG','CSI':'G','CSJ':'C',
    'CSL':'C','CSO':'C','CSP':'C','CSR':'C','CSS':'C',
    'CSU':'C','CSW':'C','CSX':'C','CSY':'SYG','CSZ':'C',
    'CTE':'W','CTG':'T','CTH':'T','CUC':'X','CWR':'S',
    'CXM':'M','CY0':'C','CY1':'C','CY3':'C','CY4':'C',
    'CYA':'C','CYD':'C','CYF':'C','CYG':'C','CYJ':'X',
    'CYM':'C','CYQ':'C','CYR':'C','CYS':'C','CZ2':'C',
    'CZO':'GYG','CZZ':'C','D11':'T','D1P':'N','D3 ':'N',
    'D33':'N','D3P':'G','D3T':'T','D4M':'T','D4P':'X',
    'DA ':'A','DA2':'X','DAB':'A','DAH':'F','DAL':'A',
    'DAR':'R','DAS':'D','DBB':'T','DBM':'N','DBS':'S',
    'DBU':'T','DBY':'Y','DBZ':'A','DC ':'C','DC2':'C',
    'DCG':'G','DCI':'X','DCL':'X','DCT':'C','DCY':'C',
    'DDE':'H','DDG':'G','DDN':'U','DDX':'N','DFC':'C',
    'DFG':'G','DFI':'X','DFO':'X','DFT':'N','DG ':'G',
    'DGH':'G','DGI':'G','DGL':'E','DGN':'Q','DHA':'A',
    'DHI':'H','DHL':'X','DHN':'V','DHP':'X','DHU':'U',
    'DHV':'V','DI ':'I','DIL':'I','DIR':'R','DIV':'V',
    'DLE':'L','DLS':'K','DLY':'K','DM0':'K','DMH':'N',
    'DMK':'D','DMT':'X','DN ':'N','DNE':'L','DNG':'L',
    'DNL':'K','DNM':'L','DNP':'A','DNR':'C','DNS':'K',
    'DOA':'X','DOC':'C','DOH':'D','DON':'L','DPB':'T',
    'DPH':'F','DPL':'P','DPP':'A','DPQ':'Y','DPR':'P',
    'DPY':'N','DRM':'U','DRP':'N','DRT':'T','DRZ':'N',
    'DSE':'S','DSG':'N','DSN':'S','DSP':'D','DT ':'T',
    'DTH':'T','DTR':'W','DTY':'Y','DU ':'U','DVA':'V',
    'DXD':'N','DXN':'N','DYG':'DYG','DYS':'C','DZM':'A',
    'E  ':'A','E1X':'A','EDA':'A','EDC':'G','EFC':'C',
    'EHP':'F','EIT':'T','ENP':'N','ESB':'Y','ESC':'M',
    'EXY':'L','EY5':'N','EYS':'X','F2F':'F','FA2':'A',
    'FA5':'N','FAG':'N','FAI':'N','FCL':'F','FFD':'N',
    'FGL':'G','FGP':'S','FHL':'X','FHO':'K','FHU':'U',
    'FLA':'A','FLE':'L','FLT':'Y','FME':'M','FMG':'G',
    'FMU':'N','FOE':'C','FOX':'G','FP9':'P','FPA':'F',
    'FRD':'X','FT6':'W','FTR':'W','FTY':'Y','FZN':'K',
    'G  ':'G','G25':'G','G2L':'G','G2S':'G','G31':'G',
    'G32':'G','G33':'G','G36':'G','G38':'G','G42':'G',
    'G46':'G','G47':'G','G48':'G','G49':'G','G4P':'N',
    'G7M':'G','GAO':'G','GAU':'E','GCK':'C','GCM':'X',
    'GDP':'G','GDR':'G','GFL':'G','GGL':'E','GH3':'G',
    'GHG':'Q','GHP':'G','GL3':'G','GLH':'Q','GLM':'X',
    'GLN':'Q','GLQ':'E','GLU':'E','GLX':'Z','GLY':'G',
    'GLZ':'G','GMA':'E','GMS':'G','GMU':'U','GN7':'G',
    'GND':'X','GNE':'N','GOM':'G','GPL':'K','GS ':'G',
    'GSC':'G','GSR':'G','GSS':'G','GSU':'E','GT9':'C',
    'GTP':'G','GVL':'X','GYC':'CYG','GYS':'SYG','H2U':'U',
    'H5M':'P','HAC':'A','HAR':'R','HBN':'H','HCS':'X',
    'HDP':'U','HEU':'U','HFA':'X','HGL':'X','HHI':'H',
    'HHK':'AK','HIA':'H','HIC':'H','HIP':'H','HIQ':'H',
    'HIS':'H','HL2':'L','HLU':'L','HMF':'A','HMR':'R',
    'HOL':'N','HPC':'F','HPE':'F','HPQ':'F','HQA':'A',
    'HRG':'R','HRP':'W','HS8':'H','HS9':'H','HSE':'S',
    'HSL':'S','HSO':'H','HTI':'C','HTN':'N','HTR':'W',
    'HV5':'A','HVA':'V','HY3':'P','HYP':'P','HZP':'P',
    'I  ':'I','I2M':'I','I58':'K','I5C':'C','IAM':'A',
    'IAR':'R','IAS':'D','IC ':'C','IEL':'K','IEY':'HYG',
    'IG ':'G','IGL':'G','IGU':'G','IIC':'SHG','IIL':'I',
    'ILE':'I','ILG':'E','ILX':'I','IMC':'C','IML':'I',
    'IOY':'F','IPG':'G','IPN':'N','IRN':'N','IT1':'K',
    'IU ':'U','IYR':'Y','IYT':'T','JJJ':'C','JJK':'C',
    'JJL':'C','JW5':'N','K1R':'C','KAG':'G','KCX':'K',
    'KGC':'K','KOR':'M','KPI':'K','KST':'K','KYQ':'K',
    'L2A':'X','LA2':'K','LAA':'D','LAL':'A','LBY':'K',
    'LC ':'C','LCA':'A','LCC':'N','LCG':'G','LCH':'N',
    'LCK':'K','LCX':'K','LDH':'K','LED':'L','LEF':'L',
    'LEH':'L','LEI':'V','LEM':'L','LEN':'L','LET':'X',
    'LEU':'L','LG ':'G','LGP':'G','LHC':'X','LHU':'U',
    'LKC':'N','LLP':'K','LLY':'K','LME':'E','LMQ':'Q',
    'LMS':'N','LP6':'K','LPD':'P','LPG':'G','LPL':'X',
    'LPS':'S','LSO':'X','LTA':'X','LTR':'W','LVG':'G',
    'LVN':'V','LYM':'K','LYN':'K','LYR':'K','LYS':'K',
    'LYX':'K','LYZ':'K','M0H':'C','M1G':'G','M2G':'G',
    'M2L':'K','M2S':'M','M3L':'K','M5M':'C','MA ':'A',
    'MA6':'A','MA7':'A','MAA':'A','MAD':'A','MAI':'R',
    'MBQ':'Y','MBZ':'N','MC1':'S','MCG':'X','MCL':'K',
    'MCS':'C','MCY':'C','MDH':'X','MDO':'ASG','MDR':'N',
    'MEA':'F','MED':'M','MEG':'E','MEN':'N','MEP':'U',
    'MEQ':'Q','MET':'M','MEU':'G','MF3':'X','MFC':'GYG',
    'MG1':'G','MGG':'R','MGN':'Q','MGQ':'A','MGV':'G',
    'MGY':'G','MHL':'L','MHO':'M','MHS':'H','MIA':'A',
    'MIS':'S','MK8':'L','ML3':'K','MLE':'L','MLL':'L',
    'MLY':'K','MLZ':'K','MME':'M','MMT':'T','MND':'N',
    'MNL':'L','MNU':'U','MNV':'V','MOD':'X','MP8':'P',
    'MPH':'X','MPJ':'X','MPQ':'G','MRG':'G','MSA':'G',
    'MSE':'M','MSL':'M','MSO':'M','MSP':'X','MT2':'M',
    'MTR':'T','MTU':'A','MTY':'Y','MVA':'V','N  ':'N',
    'N10':'S','N2C':'X','N5I':'N','N5M':'C','N6G':'G',
    'N7P':'P','NA8':'A','NAL':'A','NAM':'A','NB8':'N',
    'NBQ':'Y','NC1':'S','NCB':'A','NCX':'N','NCY':'X',
    'NDF':'F','NDN':'U','NEM':'H','NEP':'H','NF2':'N',
    'NFA':'F','NHL':'E','NIT':'X','NIY':'Y','NLE':'L',
    'NLN':'L','NLO':'L','NLP':'L','NLQ':'Q','NMC':'G',
    'NMM':'R','NMS':'T','NMT':'T','NNH':'R','NP3':'N',
    'NPH':'C','NRP':'LYG','NRQ':'MYG','NSK':'X','NTY':'Y',
    'NVA':'V','NYC':'TWG','NYG':'NYG','NYM':'N','NYS':'C',
    'NZH':'H','O12':'X','O2C':'N','O2G':'G','OAD':'N',
    'OAS':'S','OBF':'X','OBS':'X','OCS':'C','OCY':'C',
    'ODP':'N','OHI':'H','OHS':'D','OIC':'X','OIP':'I',
    'OLE':'X','OLT':'T','OLZ':'S','OMC':'C','OMG':'G',
    'OMT':'M','OMU':'U','ONE':'U','ONL':'X','OPR':'R',
    'ORN':'A','ORQ':'R','OSE':'S','OTB':'X','OTH':'T',
    'OTY':'Y','OXX':'D','P  ':'G','P1L':'C','P1P':'N',
    'P2T':'T','P2U':'U','P2Y':'P','P5P':'A','PAQ':'Y',
    'PAS':'D','PAT':'W','PAU':'A','PBB':'C','PBF':'F',
    'PBT':'N','PCA':'E','PCC':'P','PCE':'X','PCS':'F',
    'PDL':'X','PDU':'U','PEC':'C','PF5':'F','PFF':'F',
    'PFX':'X','PG1':'S','PG7':'G','PG9':'G','PGL':'X',
    'PGN':'G','PGP':'G','PGY':'G','PHA':'F','PHD':'D',
    'PHE':'F','PHI':'F','PHL':'F','PHM':'F','PIV':'X',
    'PLE':'L','PM3':'F','PMT':'C','POM':'P','PPN':'F',
    'PPU':'A','PPW':'G','PQ1':'N','PR3':'C','PR5':'A',
    'PR9':'P','PRN':'A','PRO':'P','PRS':'P','PSA':'F',
    'PSH':'H','PST':'T','PSU':'U','PSW':'C','PTA':'X',
    'PTH':'Y','PTM':'Y','PTR':'Y','PU ':'A','PUY':'N',
    'PVH':'H','PVL':'X','PYA':'A','PYO':'U','PYX':'C',
    'PYY':'N','QLG':'QLG','QUO':'G','R  ':'A','R1A':'C',
    'R1B':'C','R1F':'C','R7A':'C','RC7':'HYG','RCY':'C',
    'RIA':'A','RMP':'A','RON':'X','RT ':'T','RTP':'N',
    'S1H':'S','S2C':'C','S2D':'A','S2M':'T','S2P':'A',
    'S4A':'A','S4C':'C','S4G':'G','S4U':'U','S6G':'G',
    'SAC':'S','SAH':'C','SAR':'G','SBL':'S','SC ':'C',
    'SCH':'C','SCS':'C','SCY':'C','SD2':'X','SDG':'G',
    'SDP':'S','SEB':'S','SEC':'A','SEG':'A','SEL':'S',
    'SEM':'X','SEN':'S','SEP':'S','SER':'S','SET':'S',
    'SGB':'S','SHC':'C','SHP':'G','SHR':'K','SIB':'C',
    'SIC':'DC','SLA':'P','SLR':'P','SLZ':'K','SMC':'C',
    'SME':'M','SMF':'F','SMP':'A','SMT':'T','SNC':'C',
    'SNN':'N','SOC':'C','SOS':'N','SOY':'S','SPT':'T',
    'SRA':'A','SSU':'U','STY':'Y','SUB':'X','SUI':'DG',
    'SUN':'S','SUR':'U','SVA':'S','SVX':'S','SVZ':'X',
    'SYS':'C','T  ':'T','T11':'F','T23':'T','T2S':'T',
    'T2T':'N','T31':'U','T32':'T','T36':'T','T37':'T',
    'T38':'T','T39':'T','T3P':'T','T41':'T','T48':'T',
    'T49':'T','T4S':'T','T5O':'U','T5S':'T','T66':'X',
    'T6A':'A','TA3':'T','TA4':'X','TAF':'T','TAL':'N',
    'TAV':'D','TBG':'V','TBM':'T','TC1':'C','TCP':'T',
    'TCQ':'X','TCR':'W','TCY':'A','TDD':'L','TDY':'T',
    'TFE':'T','TFO':'A','TFQ':'F','TFT':'T','TGP':'G',
    'TH6':'T','THC':'T','THO':'X','THR':'T','THX':'N',
    'THZ':'R','TIH':'A','TLB':'N','TLC':'T','TLN':'U',
    'TMB':'T','TMD':'T','TNB':'C','TNR':'S','TOX':'W',
    'TP1':'T','TPC':'C','TPG':'G','TPH':'X','TPL':'W',
    'TPO':'T','TPQ':'Y','TQQ':'W','TRF':'W','TRG':'K',
    'TRN':'W','TRO':'W','TRP':'W','TRQ':'W','TRW':'W',
    'TRX':'W','TS ':'N','TST':'X','TT ':'N','TTD':'T',
    'TTI':'U','TTM':'T','TTQ':'W','TTS':'Y','TY2':'Y',
    'TY3':'Y','TYB':'Y','TYI':'Y','TYN':'Y','TYO':'Y',
    'TYQ':'Y','TYR':'Y','TYS':'Y','TYT':'Y','TYU':'N',
    'TYX':'X','TYY':'Y','TZB':'X','TZO':'X','U  ':'U',
    'U25':'U','U2L':'U','U2N':'U','U2P':'U','U31':'U',
    'U33':'U','U34':'U','U36':'U','U37':'U','U8U':'U',
    'UAR':'U','UCL':'U','UD5':'U','UDP':'N','UFP':'N',
    'UFR':'U','UFT':'U','UMA':'A','UMP':'U','UMS':'U',
    'UN1':'X','UN2':'X','UNK':'X','UR3':'U','URD':'U',
    'US1':'U','US2':'U','US3':'T','US5':'U','USM':'U',
    'V1A':'C','VAD':'V','VAF':'V','VAL':'V','VB1':'K',
    'VDL':'X','VLL':'X','VLM':'X','VMS':'X','VOL':'X',
    'X  ':'G','X2W':'E','X4A':'N','X9Q':'AFG','XAD':'A',
    'XAE':'N','XAL':'A','XAR':'N','XCL':'C','XCP':'X',
    'XCR':'C','XCS':'N','XCT':'C','XCY':'C','XGA':'N',
    'XGL':'G','XGR':'G','XGU':'G','XTH':'T','XTL':'T',
    'XTR':'T','XTS':'G','XTY':'N','XUA':'A','XUG':'G',
    'XX1':'K','XXY':'THG','XYG':'DYG','Y  ':'A','YCM':'C',
    'YG ':'G','YOF':'Y','YRR':'N','YYG':'G','Z  ':'C',
    'ZAD':'A','ZAL':'A','ZBC':'C','ZCY':'C','ZDU':'U',
    'ZFB':'X','ZGU':'G','ZHP':'N','ZTH':'T','ZZJ':'A' }


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Thu Apr 26 23:59:13 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 27 Apr 2012 03:59:13 +0000
Subject: [Biopython-dev] [Biopython - Bug #3346] (New) patch for legacy
	parser to support BLASTX 2.2.25+
Message-ID: <redmine.issue-3346.20120427035913@redmine.open-bio.org>


Issue #3346 has been reported by John Comeau.

----------------------------------------
Bug #3346: patch for legacy parser to support BLASTX 2.2.25+
https://redmine.open-bio.org/issues/3346

Author: John Comeau
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


it may also work with 2.2.26+, I have not tested. patched parser passes regression tests as per Peter Cock's instructions.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From andrew.sczesnak at med.nyu.edu  Fri Apr 27 15:57:19 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Fri, 27 Apr 2012 15:57:19 -0400
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>	<4F91E4CF.8040602@med.nyu.edu>
	<CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>
Message-ID: <4F9AFA1F.6030103@med.nyu.edu>

Peter,

> It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py
> and I'm willing to do this myself for MAF (while going over your index work -
> something I want to do anyway). The only potential catch is avoiding offset
> arithmetic.

I have no problem with you doing this if you're willing. It would be 
great to have some code review of MafIndex as well.


Best,
Andrew

From MatatTHC at gmx.de  Sat Apr 28 03:15:35 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Sat, 28 Apr 2012 09:15:35 +0200
Subject: [Biopython-dev] SeqIO circular
In-Reply-To: <CALNFT0hrc+T-0xWesCuK0E5X8=mcDCqXoRRJJ4ms2qAibWXhTg@mail.gmail.com>
References: <CALNFT0jq=VTwSDv-4x7ZrHoQRLajCUHY8NGPMw9cDuGnwwNiuw@mail.gmail.com>
	<CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
	<CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>
	<CALNFT0hrc+T-0xWesCuK0E5X8=mcDCqXoRRJJ4ms2qAibWXhTg@mail.gmail.com>
Message-ID: <CALNFT0h1udmBBF+TZrXhv22q5SBNJE5RBmtr+bMfmAsQMabX2g@mail.gmail.com>

Dear developers,

I would like to suggest a quick "fix" for the problem. Currently the
parser just returns true per default for the circular property. This
is a wrong piece of information for all circular sequences.
Furthermore its not possible to detect if the parser did return true
because it is its default value or if its really from the data. So I
suggest to return None if the parser does not parse the information.

What do you think? This should be possible with minimal effort.

The user could then implement a workaround on its own (like using the
old parser as fallback, or just searching the first line of t)

Regards,
Matthias

2012/4/22 Matthias Bernt <MatatTHC at gmx.de>:
> Hi,
>
> since this bug seems to be of low priority I decided to try my best to
> help a bit and search the web a bit.
> It seems that the property is stored in PrimarySeq or Seq ?in bioperl.
> See for instance:
>
> http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm
> http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm
>
> Or also:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>
> This seems to be realised as boolean variable or function.
>
> Regards,
> Matthias
>
> 2012/4/4 Matthias Bernt <MatatTHC at gmx.de>:
>> Hi,
>>
>> are there any news on this? May I help somehow? But I have to admit
>> that I barely speak perl and have no experience with bioperl. If
>> someone tells me where to look I might still try it.
>>
>> Matthias
>>
>> 2012/3/29 Peter Cock <p.j.a.cock at googlemail.com>:
>>> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
>>>> Hi,
>>>>
>>>> Is it possible to get the property if a genome is circular / linear
>>>> from SeqIO applied to genbank files? I could not find it.
>>>>
>>>> There is also a related bugreport:
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>>>>
>>>> I used the old parser before and switched to SeqIO which I really like
>>>> for the possibilities to parse different formats... but I really need
>>>> the information.
>>>
>>> Does anyone happen to have a BioPerl + BioSQL setup installed
>>> and working? IIRC checking that to make sure however we
>>> store the circular was compatible was the only real hurdle.
>>>
>>> Peter


From w.arindrarto at gmail.com  Sat Apr 28 08:08:35 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Sat, 28 Apr 2012 14:08:35 +0200
Subject: [Biopython-dev] Google Summer of Code Project: SearchIO in Biopython
Message-ID: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>

Hello everyone,

This is Wibowo Arindrarto (or Bow, for short), one of the Google Summer of
Code students who will work on Biopython over this summer.

I will be working with Peter to add support for parsing search outputs from
programs like BLAST and HMMER to Biopython, so that it's easier to extract
information from their outputs. Having used some of these programs quite a
lot myself, I'm really looking forward to implementing the feature.
However, I do understand that it won't be just me who will use the module,
but also many other Biopython user. So for everyone who is interested in
giving a say, input, or critiques along the way, feel free to do so :).

The official coding period starts in about a month from now. Until then, I
will be doing all the preparatory work required so that coding will proceed
as smooth as possible. These will include preparing the test cases and
preparing the SearchIO attribute / object naming convention as well as
discussing anything related to its proposed implementation.

Finally, here are some links related to the project that might interest you.

1. My main biopython branch for development:
https://github.com/bow/biopython/tree/searchio. Since I will be building on
top of Peter's SearchIO branch (
https://github.com/peterjc/biopython/tree/search-io-test), right now it
only contains Peter's branch rebased against the latest master.

2. My GSoC proposal, which outlines my plans and timeline for the project:
http://bit.ly/searchio-proposal

3. The proposed SearchIO naming convention (not 100% complete as of now,
but will be filled along the way): http://bit.ly/searchio-terms. One of the
main goals of the project is to implement a common interface for BLAST et
al, which requires SearchIO to have common attribute names that refers to
different search output attributes. The link contains my proposed naming
convention, which is still very open to change and discussion. Feel free to
comment on the document and add your own ideas.

4. My blog, in which I will write weekly posts about the project's
progress: http://bow.web.id/blog

5. An extra repo for all other auxiliary files and scripts that doesn't go
into Biopython's code: https://github.com/bow/gsoc.

That's it for now. Thanks for taking time to read it :). I'm looking
forward to a productive summer with Biopython.

Have a nice weekend,
Bow

From p.j.a.cock at googlemail.com  Sun Apr 29 07:00:42 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 29 Apr 2012 12:00:42 +0100
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
Message-ID: <CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>

Hi Bow,

Thanks for updating the list. I'm replying just on the dev list
as I'm focusing on implementation discussion in this reply.

On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> 1. My main biopython branch for development:
> https://github.com/bow/biopython/tree/searchio. Since I will be building on
> top of Peter's SearchIO branch (
> https://github.com/peterjc/biopython/tree/search-io-test), right now it
> only contains Peter's branch rebased against the latest master.

Just to be clear - you don't have to start from that branch ;)
http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html

As I said before, that may not be the best approach. The idea
behind that code was to focus on the HSPs (in BLAST terms),
and for the low level parsers to iterate over each HSP. Higher
level wrappers can then batch these up by query/subject, or
into the larger grouping of all the results for one query -
which was the exposed high level Bio.SearchIO.parse
function.

That branch introduced a SearchResult object which was
essentially something like a list or dict (like an OrderedDict
in some ways), with some (unnecessary?) error checking for
consistent contents (all from the same query). It also introduced
a TopMatches object which was essentially list list (again,
with some error checking for consistent contents).

The advantage of using simple objects (OrderedDict
and list) is simplicity and hopefully performance. But
specific classes have the advantage of allowing more
user friendly str/repr etc.

The idea on this branch of focusing on iteration over the
HSPs at the low level was it allowed a lot of flexibility, and
the low level parser could be used in conjunction with
indexing to see to a particular HSP and parse it, or goto
the results for a particular query+match and parse its
HSPs  (not implemented on my old branch, but that was
the plan).

However, while this makes perfect sense for say the BLAST
tabular output, it isn't quite such a good match for all the
possible datatypes.

For instance, BLAST plain text/html includes an e-value for
a query/subject combination which is calculated from all the
HSPs for that query/subject (taking into account order etc -
I'd have to check the O'Reilly BLAST book for the details).
This isn't in the tabular output, but the point is that it isn't a
property of the individual HSPs, but of the match (group of
HSPs).

I think we need to consider the other main formats, and if
all their important information lies at the HSP level or not.
Perhaps iteration at the query+match level (groups of
HSPs) would be best overall?

Bow - If some of that doesn't make sense, I can try to clarify
by email on the list, and/or we can talk about it at our next
video chat. Also see if you can get the BLAST book from
your library - it will probably be quite useful in this project
even though it describes the 'legacy' BLAST suite:

"BLAST" by Ian Korf, Mark Yandell, Joseph Bedell
Publisher: O'Reilly Media, Released: July 2003

Regards,

Peter

From w.arindrarto at gmail.com  Sun Apr 29 12:42:14 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Sun, 29 Apr 2012 18:42:14 +0200
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
Message-ID: <CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>

On Sun, Apr 29, 2012 at 13:00, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Hi Bow,
>
> Thanks for updating the list. I'm replying just on the dev list
> as I'm focusing on implementation discussion in this reply.
>
> On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > 1. My main biopython branch for development:
> > https://github.com/bow/biopython/tree/searchio. Since I will be building
> > on
> > top of Peter's SearchIO branch (
> > https://github.com/peterjc/biopython/tree/search-io-test), right now it
> > only contains Peter's branch rebased against the latest master.
>
> Just to be clear - you don't have to start from that branch ;)
> http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html

Ok :). I wasn't so sure about how much code from your previous branch
that I will end up using, so I decided to rebase everything and then
see later how much of it can be used. But it's also easier to start clean :).

> As I said before, that may not be the best approach. The idea
> behind that code was to focus on the HSPs (in BLAST terms),
> and for the low level parsers to iterate over each HSP. Higher
> level wrappers can then batch these up by query/subject, or
> into the larger grouping of all the results for one query -
> which was the exposed high level Bio.SearchIO.parse
> function.
>
> That branch introduced a SearchResult object which was
> essentially something like a list or dict (like an OrderedDict
> in some ways), with some (unnecessary?) error checking for
> consistent contents (all from the same query). It also introduced
> a TopMatches object which was essentially list list (again,
> with some error checking for consistent contents).
>
> The advantage of using simple objects (OrderedDict
> and list) is simplicity and hopefully performance. But
> specific classes have the advantage of allowing more
> user friendly str/repr etc.
>
> The idea on this branch of focusing on iteration over the
> HSPs at the low level was it allowed a lot of flexibility, and
> the low level parser could be used in conjunction with
> indexing to see to a particular HSP and parse it, or goto
> the results for a particular query+match and parse its
> HSPs ?(not implemented on my old branch, but that was
> the plan).
>
> However, while this makes perfect sense for say the BLAST
> tabular output, it isn't quite such a good match for all the
> possible datatypes.
>
> For instance, BLAST plain text/html includes an e-value for
> a query/subject combination which is calculated from all the
> HSPs for that query/subject (taking into account order etc -
> I'd have to check the O'Reilly BLAST book for the details).
> This isn't in the tabular output, but the point is that it isn't a
> property of the individual HSPs, but of the match (group of
> HSPs).
>
> I think we need to consider the other main formats, and if
> all their important information lies at the HSP level or not.
> Perhaps iteration at the query+match level (groups of
> HSPs) would be best overall?
>
> Bow - If some of that doesn't make sense, I can try to clarify
> by email on the list, and/or we can talk about it at our next
> video chat. Also see if you can get the BLAST book from
> your library - it will probably be quite useful in this project
> even though it describes the 'legacy' BLAST suite:
>
> "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell
> Publisher: O'Reilly Media, Released: July 2003
>
> Regards,
>
> Peter

I think I got the gist of it (please correct me if I'm wrong). Some
information about the search, such as the sequence-wide e-value, may
not be present in the HSP level. Ignoring them could let us focus on a
perhaps simpler and more flexible implementation with better
performance, but at the cost of usefulness of the data itself since we
are throwing away information.

What I have in mind now is actually closer to iteration on the
query+subject level. To be clear first, the hierarchy of the objects
that I propose is this:

* Search object, to represent the entire search session.
* Result object, to represent a search with one query against the
database. Depending on the number of queries, we could have one to
several Result objects contained in a Search.
* Hit object, to represent a sequence hit. Depending on the search, we
could also have multiple Hits in one Result object.
* and finally, HSP object, to represent individual alignments.

Iteration is done on the Results level, so the information is parsed
on the search query level, not just a single HSPs (I wrote a  very
short description about what I'm planning the objects to be in here as
well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
information parsing over performance and simplicity of the
format-specific parsers, this is the way to go. There are other
formats, too, that contains sequence-level search information not
present in the alignment (e.g. HMMER text output). What do you think
about this?

Thanks for the BLAST book suggestion. I'll see if I can find it in my
library in the mean time.

regards,
Bow


From p.j.a.cock at googlemail.com  Mon Apr 30 05:49:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 30 Apr 2012 10:49:27 +0100
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
	<CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
Message-ID: <CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>

On Sun, Apr 29, 2012 at 5:42 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
>
> I think I got the gist of it (please correct me if I'm wrong). Some
> information about the search, such as the sequence-wide e-value, may
> not be present in the HSP level. Ignoring them could let us focus on a
> perhaps simpler and more flexible implementation with better
> performance, but at the cost of usefulness of the data itself since we
> are throwing away information.

Yes.

> What I have in mind now is actually closer to iteration on the
> query+subject level. To be clear first, the hierarchy of the objects
> that I propose is this:
>
> * Search object, to represent the entire search session.
> * Result object, to represent a search with one query against the
> database. Depending on the number of queries, we could have one to
> several Result objects contained in a Search.
> * Hit object, to represent a sequence hit. Depending on the search, we
> could also have multiple Hits in one Result object.
> * and finally, HSP object, to represent individual alignments.
>
> Iteration is done on the Results level, so the information is parsed
> on the search query level, not just a single HSPs (I wrote a ?very
> short description about what I'm planning the objects to be in here as
> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
> information parsing over performance and simplicity of the
> format-specific parsers, this is the way to go. There are other
> formats, too, that contains sequence-level search information not
> present in the alignment (e.g. HMMER text output). What do you think
> about this?

That sounds good .

If iteration is done on the Results level, when/how would your
Search object be used?

Peter


From w.arindrarto at gmail.com  Mon Apr 30 06:08:52 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 30 Apr 2012 12:08:52 +0200
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
	<CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
	<CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>
Message-ID: <CADEGkF5C-dq7jd+JmcFzsY-X2Poiqs8vDZNt3zrUS5kaewA0tw@mail.gmail.com>

>> What I have in mind now is actually closer to iteration on the
>> query+subject level. To be clear first, the hierarchy of the objects
>> that I propose is this:
>>
>> * Search object, to represent the entire search session.
>> * Result object, to represent a search with one query against the
>> database. Depending on the number of queries, we could have one to
>> several Result objects contained in a Search.
>> * Hit object, to represent a sequence hit. Depending on the search, we
>> could also have multiple Hits in one Result object.
>> * and finally, HSP object, to represent individual alignments.
>>
>> Iteration is done on the Results level, so the information is parsed
>> on the search query level, not just a single HSPs (I wrote a ?very
>> short description about what I'm planning the objects to be in here as
>> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
>> information parsing over performance and simplicity of the
>> format-specific parsers, this is the way to go. There are other
>> formats, too, that contains sequence-level search information not
>> present in the alignment (e.g. HMMER text output). What do you think
>> about this?
>
> That sounds good .
>
> If iteration is done on the Results level, when/how would your
> Search object be used?
>
> Peter

I'm thinking of using the Search object as the object returned by
SearchIO.parse or SearchIO.read. That way, we can store attributes
common to the different search queries in it. For example:

>>> search  = SearchIO.parse('blast_result.xml', 'blast-xml')
>>> search.format
'blast-xml'
>>> search.algorithm
'blastx'
>>> search.version
'2.2.26+'
>>> search.database
'refseq_protein'
>>> search.results
<generator object results at ....>

And iteration over the results would be done like this (for example):
>>> for result in search.results:
... print result.query, print len(result)

Additionaly, we can also define __iter__ and next for Search so we can
just do the following:
>>> for result in search:
... print result.query, print len(result)

What do you think?


Bow


From p.j.a.cock at googlemail.com  Mon Apr 30 06:57:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 30 Apr 2012 11:57:27 +0100
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CADEGkF5C-dq7jd+JmcFzsY-X2Poiqs8vDZNt3zrUS5kaewA0tw@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
	<CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
	<CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>
	<CADEGkF5C-dq7jd+JmcFzsY-X2Poiqs8vDZNt3zrUS5kaewA0tw@mail.gmail.com>
Message-ID: <CAKVJ-_7R8DK7KtCKOEGLr_wMR5ci2ZiAXHZnXWvZuN3-=whv9w@mail.gmail.com>

On Mon, Apr 30, 2012 at 11:08 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
>
> I'm thinking of using the Search object as the object returned by
> SearchIO.parse or SearchIO.read. That way, we can store attributes
> common to the different search queries in it. For example:
>
>>>> search ?= SearchIO.parse('blast_result.xml', 'blast-xml')
>>>> search.format
> 'blast-xml'
>>>> search.algorithm
> 'blastx'
>>>> search.version
> '2.2.26+'
>>>> search.database
> 'refseq_protein'
>>>> search.results
> <generator object results at ....>
>
> And iteration over the results would be done like this (for example):
>>>> for result in search.results:
> ... print result.query, print len(result)
>
> Additionaly, we can also define __iter__ and next for Search so we can
> just do the following:
>>>> for result in search:
> ... print result.query, print len(result)
>
> What do you think?

I think you'll get in a mess with multiple iterators all sharing the
same handle and competing over using it - but maybe I'm not
grasping what you have in mind.

Initially keep it simple: The primary public API would be

for result in Bio.SearchIO.parse(...):
     print result.query, print len(result)

where each iteration gives a complete result set for one query.

Peter

P.S. With SearchIO subject to name space discussions ;)


From chapmanb at 50mail.com  Sun Apr  1 19:13:56 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 01 Apr 2012 15:13:56 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
Message-ID: <87zkavtgcr.fsf@fastmail.fm>


Lenna;
Thanks for the introduction and glad to hear about your interest in the
variant project. I'm looking forward to seeing your proposal.

The workflow for the variant project involves a biologist querying a VCF
or GVF file with variants from an experiment. They should be able to
easily subset and filter by file components:

- Variant type: Homozygous/Heterozygous variants
- Metrics: depth, strand bias, allele frequency..
- Variants annotated in coding regions causing amino acid changes

As well as rapid subsetting by chromosomal region.

My syggestion would be to leverage external tools as much as possible to
do file manipulation and focus on an API that lets users filter and
extract information pre-contained in the INFO file.

Hope this is helpful as a place to get started. We can provide
additional feedback once you have your proposal ready. Thanks again,
Brad

> Hi all,
> 
> I realize time is short, but I am still in the planning phase of my
> GSoC proposal! I wanted to take a moment to formally introduce myself
> to the dev list.
> 
> I am affiliated with Purdue University, located in Indiana, USA and
> best known for engineering (Neil Armstrong is a famous graduate). I
> hold a bachelor of arts in biology from Mount Holyoke College in
> Massachusetts. I have extensive wet lab experience with genetics; I'm
> currently working in a lab genotyping mice (the research is intestinal
> lipid metabolism). In August, I begin a PhD in interdisciplinary life
> science at Purdue, and I anticipate that my research will fall
> somewhere in the field of bioinformatics/computational biology. I hope
> to use biopython extensively!
> 
> In my spare time, other than programming, I enjoy ballroom dance,
> science fiction novels, board games, and sailing.
> 
> I've been programming for about 6 years and using python for 4; other
> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL
> (primarily MySQL and SQLite), and C++/C. I place a high value on
> object oriented design and execution.
> 
> I understand the basics of formal grammar and have some experience
> with lex/flex as well as PLY (python lex/yacc). My work so far with
> biopython has been on the CIF parsing module. One of my primary goals
> for the genomic variants project would be to implement as much
> polymorphism and abstraction as possible, for the benefit of both
> users and future developers.
> 
> I'm working on a proposal for the genomic variants project, and while
> I understand the basics of molecular biology and genetics, I lack
> firsthand experience with the type of workflow that would occur in the
> context of genomic variants. If anyone can supply a few examples, it
> would be greatly appreciated.
> 
> I hope to have a proposal draft ready for feedback by Monday.
> 
> Regards,
> 
> Lenna Peterson
> github.com/lennax
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From chapmanb at 50mail.com  Sun Apr  1 19:28:32 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 01 Apr 2012 15:28:32 -0400
Subject: [Biopython-dev] GSoC Student Applicant
In-Reply-To: <CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
References: <CADEGkF7D=QTJcChbPE71HRiLi0VXiVZap-sJA2+W38TGPziYpA@mail.gmail.com>
	<874ntgtca7.fsf@fastmail.fm>
	<CADEGkF7fT0ExxfOMQgA8EKWZ-DfqS=K3qAXUmewUxYYeXZO6tg@mail.gmail.com>
	<87r4wa6fxx.fsf@fastmail.fm>
	<CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
Message-ID: <87wr5ztfof.fsf@fastmail.fm>


Bow;

> Thank you for the comments and suggestions. I've added a little bit
> more details to my personal profile and put it up front. My project
> details have also been broken down into single weeks. And I've edited
> the commenting permission.

Thanks for the updates, this is coming along well. My most general
suggestion is to spend more time expanding the week-by-week
timeline. As an example, take this weekly goal:

* Write iterator and random-access parser for EMBOSS water

It would be great to see more specific plans for what exactly you
deliver and implement during the week. Something like:

- Write iterator for EMBOSS water, expanding test suite to ensure
  produced AlignIO objects are compatible with previous BLAST and HMMER
  iterators.

- Expand index functionality to handle EMBOSS water format for random
  access. Test edge cases: initial records, final records, empty
  records.

- Document 'water' parsing with a use case emphasizing differences from
  BLAST and HMMER searching.

Peter probably has more specific thoughts on the actual content but it's
important to think through things in this manner. This will make it
easier to approach weeks during the summer since you'll already have
tasks broken down, and will also demonstrate you've thought about
potential problems and roadblocks and have solutions to overcome them.

> As for my other obligations, I didn't mean to give that impression. I
> added a little bite more detail about the project itself, but I'm not
> sure about the time that I should write. I estimate that at most, for
> each week day, I spend 8 hours doing my Master's project in my lab's
> campus. Since the project started, I usually use the remainder of the
> time (~6 hours/day) for my own personal programming projects. I plan
> to use the personal programming time slot for my GSoC instead, if
> accepted. Should I be this thorough in the proposal?

This is exactly my worry. You're proposing working two full time jobs
all summer long. Not to denigrate your work ethic, but 80 hour weeks are
hard and leave you no time for important things like having a life
outside of work. My suggestion would be to see if you can scale back
your Master's commitments for the summer if accepted into GSoC. This
would definitely improve your proposal since reviewers will worry about
the time commitment.

Hope this all helps,
Brad


From chapmanb at 50mail.com  Sun Apr  1 20:30:26 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 01 Apr 2012 16:30:26 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <4F74855B.9000603@med.nyu.edu>
References: <4F74855B.9000603@med.nyu.edu>
Message-ID: <87obrbtct9.fsf@fastmail.fm>


Andrew;
Thanks for putting this together. It looks great, is well integrated
with AlignIO and it's awesome to see a test suite.

I dug through the code and my small suggestions would be:

- Could you refactor some of the larger functions into separate smaller
  components? A couple of these spread over a ton of lines and it can be
  a bit difficult to follow the logic throughout:

  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L172
  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L399

  As a practical example, here you have a large block which checks the
  SQLite index matches the MAF file and everything looks okay:

  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L199

  This would be clearer if factored into something like:

  if os.path.isfile(sqlite_file):
     try:
        self._record_count = self._verify_record_count(con)
     except ...

- Would you be able to put together a small example for the
  Cookbook or Tutorial documentation? This would be a great way to help
  others get started with the functionality and advertise it.

Thanks again for this,
Brad

> Hi all,
> 
> I would like to start a discussion about what is needed to make the 
> AlignIO.MafIO parser and indexer ready for the next release. If anyone 
> is unfamiliar with MAF (Multiple Alignment Format), it is the file 
> format that eukaryote genome-to-genome multiple alignments produced by 
> multiz are stored in.
> 
> The exact specs are here:
>    http://genome.ucsc.edu/FAQ/FAQformat.html#format5
> 
> Some use cases are discussed in this paper, which implements (I believe) 
> most of the same functionality of the MafIO class in Galaxy:
>    http://www.ncbi.nlm.nih.gov/pubmed/21775304
> 
> The branch of my biopython fork that contains the class:
>    https://github.com/polyatail/biopython/tree/alignio-maf
> 
> The class is implemented as a reader/writer compatible with the AlignIO 
> API, but implements its own indexer (MafIO.MafIndex) based on 
> SeqIO.index_db(). At the time, this seemed like the best way to 
> implement this, as MAF is explicitly designed for genome-to-genome 
> alignments while other formats are not. If we can assume a MAF file 
> contains such an alignment, we can index it by genome coordinates and 
> allow random access to intervals.
> 
> This is especially useful since it is often desirable to retrieve the 
> spliced multiple alignment of a multi-exonic transcript, which can be 
> used to determine sequence conservation, construct a phylogenetic tree 
> for a particular gene, or pull out orthologs of a large number of genes 
> at once.
> 
> The code consists of the reader, writer, and indexer classes in 
> AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to 
> the indexer in Tests/test_MafIO_index.py. I would really appreciate any 
> feedback and suggestions, and if anyone has an opportunity to use this 
> feature it would be great to get some feedback on its operation.
> 
> 
> Thanks!
> Andrew
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From redmine at redmine.open-bio.org  Mon Apr  2 01:40:27 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 2 Apr 2012 01:40:27 +0000
Subject: [Biopython-dev] [Biopython - Feature #3336] (New) Make Phylo.draw
	more customizable
Message-ID: <redmine.issue-3336.20120402014027@redmine.open-bio.org>


Issue #3336 has been reported by Eric Talevich.

----------------------------------------
Feature #3336: Make Phylo.draw more customizable
https://redmine.open-bio.org/issues/3336

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


On and off the mailing lists, I've received requests to make the plots rendered by Phylo.draw more customizable. For example:
http://lists.open-bio.org/pipermail/biopython/2012-March/007851.html

Since Phylo.draw is based on matplotlib/pyplot, it should be possible for essentially everything about the plot to be customizable by the user using pyplot's standard mechanisms -- e.g. adjust the font sizes with rcParams["font.size"].

Other requested features:

* Accept **kwargs in Phylo.draw, and pass it along to pyplot -- but where?
* Format the confidence/support values differently (currently everything is treated as a float), including or perhaps with the addition of arbitrary branch labels (e.g. estimated number of mutations on a branch)
* Return a mapping of clade objects to a tuple or dict of pyplot elements (LineCollection, PatchCollection, etc.)


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Mon Apr  2 02:10:45 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Sun, 1 Apr 2012 22:10:45 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <87zkavtgcr.fsf@fastmail.fm>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
Message-ID: <CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>

Hi Brad,

Thank you so much for your suggestions. My initial evaluation of the
strengths of existing software has led me to strongly agree with your
recommendation to focus on the usability of the API.

I submit this draft of my proposal to the dev list for feedback:

https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit


Lenna


On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Lenna;
> Thanks for the introduction and glad to hear about your interest in the
> variant project. I'm looking forward to seeing your proposal.
>
> The workflow for the variant project involves a biologist querying a VCF
> or GVF file with variants from an experiment. They should be able to
> easily subset and filter by file components:
>
> - Variant type: Homozygous/Heterozygous variants
> - Metrics: depth, strand bias, allele frequency..
> - Variants annotated in coding regions causing amino acid changes
>
> As well as rapid subsetting by chromosomal region.
>
> My syggestion would be to leverage external tools as much as possible to
> do file manipulation and focus on an API that lets users filter and
> extract information pre-contained in the INFO file.
>
> Hope this is helpful as a place to get started. We can provide
> additional feedback once you have your proposal ready. Thanks again,
> Brad
>
>> Hi all,
>>
>> I realize time is short, but I am still in the planning phase of my
>> GSoC proposal! I wanted to take a moment to formally introduce myself
>> to the dev list.
>>
>> I am affiliated with Purdue University, located in Indiana, USA and
>> best known for engineering (Neil Armstrong is a famous graduate). I
>> hold a bachelor of arts in biology from Mount Holyoke College in
>> Massachusetts. I have extensive wet lab experience with genetics; I'm
>> currently working in a lab genotyping mice (the research is intestinal
>> lipid metabolism). In August, I begin a PhD in interdisciplinary life
>> science at Purdue, and I anticipate that my research will fall
>> somewhere in the field of bioinformatics/computational biology. I hope
>> to use biopython extensively!
>>
>> In my spare time, other than programming, I enjoy ballroom dance,
>> science fiction novels, board games, and sailing.
>>
>> I've been programming for about 6 years and using python for 4; other
>> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL
>> (primarily MySQL and SQLite), and C++/C. I place a high value on
>> object oriented design and execution.
>>
>> I understand the basics of formal grammar and have some experience
>> with lex/flex as well as PLY (python lex/yacc). My work so far with
>> biopython has been on the CIF parsing module. One of my primary goals
>> for the genomic variants project would be to implement as much
>> polymorphism and abstraction as possible, for the benefit of both
>> users and future developers.
>>
>> I'm working on a proposal for the genomic variants project, and while
>> I understand the basics of molecular biology and genetics, I lack
>> firsthand experience with the type of workflow that would occur in the
>> context of genomic variants. If anyone can supply a few examples, it
>> would be greatly appreciated.
>>
>> I hope to have a proposal draft ready for feedback by Monday.
>>
>> Regards,
>>
>> Lenna Peterson
>> github.com/lennax
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From p.j.a.cock at googlemail.com  Mon Apr  2 08:26:16 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 2 Apr 2012 09:26:16 +0100
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <87obrbtct9.fsf@fastmail.fm>
References: <4F74855B.9000603@med.nyu.edu>
	<87obrbtct9.fsf@fastmail.fm>
Message-ID: <CAKVJ-_51Oku5hm+VTLccA2h2f=saz-4g79kVRdFTryNtUFK5SA@mail.gmail.com>

On Sun, Apr 1, 2012 at 9:30 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Andrew;
> Thanks for putting this together. It looks great, is well integrated
> with AlignIO and it's awesome to see a test suite.

Indeed, +1 on tests :)

Apologies for not replying earlier - this was flagged in my email
client all of last week.

> I dug through the code and my small suggestions would be:
>
> - Could you refactor some of the larger functions into separate smaller
> ?components? A couple of these spread over a ton of lines and it can be
> ?a bit difficult to follow the logic throughout:
>
> ...
>
> ?As a practical example, here you have a large block which checks the
> ?SQLite index matches the MAF file and everything looks okay:

Maybe I should do the same with the SeqIO SQLite code.

> - Would you be able to put together a small example for the
> ?Cookbook or Tutorial documentation? This would be a great way to help
> ?others get started with the functionality and advertise it.

He already has - very organised :)
http://biopython.org/wiki/Multiple_Alignment_Format

Is there any more about reverse complemented sequences
and how they are handled, for in simple iterators, but more
so when indexing? What I'm getting at here is the non-typical
treatment of start and end being relative to the reverse
complemented sequence for minus strand alignments. Here
most tools/formats always count from the first base on the
forward strand.

Peter


From andrew.sczesnak at med.nyu.edu  Tue Apr  3 00:15:18 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 02 Apr 2012 20:15:18 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <87obrbtct9.fsf@fastmail.fm>
References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm>
Message-ID: <4F7A4116.5000602@med.nyu.edu>

Hi Brad,

Thank you for the feedback. I've tried to work on some of your 
suggestions and will continue doing so.

> - Could you refactor some of the larger functions into separate smaller
>    components? A couple of these spread over a ton of lines and it can be
>    a bit difficult to follow the logic throughout:

Definitely--I see what you mean. I split __init__ into a couple 
functions. I'm still worried about the 100 lines of get_spliced(). It's 
big mostly because I overdid it on the comments, but hopefully that 
helps explain the logic enough that someone else could work on it 
without pulling their hair out.

> - Would you be able to put together a small example for the
>    Cookbook or Tutorial documentation? This would be a great way to help
>    others get started with the functionality and advertise it.

Absolutely. I have a few more ideas for cool demos that integrate with 
other parts of Biopython. What's the best place to put draft text for 
the tutorial?


Thanks,
Andrew


From andrew.sczesnak at med.nyu.edu  Tue Apr  3 00:33:51 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 02 Apr 2012 20:33:51 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <CAKVJ-_51Oku5hm+VTLccA2h2f=saz-4g79kVRdFTryNtUFK5SA@mail.gmail.com>
References: <4F74855B.9000603@med.nyu.edu>	<87obrbtct9.fsf@fastmail.fm>
	<CAKVJ-_51Oku5hm+VTLccA2h2f=saz-4g79kVRdFTryNtUFK5SA@mail.gmail.com>
Message-ID: <4F7A456F.3020306@med.nyu.edu>

Hi Peter,

Thank you for the feedback. I will try to make sure this code is well 
tested before the next release.

> Is there any more about reverse complemented sequences
> and how they are handled, for in simple iterators, but more
> so when indexing? What I'm getting at here is the non-typical
> treatment of start and end being relative to the reverse
> complemented sequence for minus strand alignments. Here
> most tools/formats always count from the first base on the
> forward strand.

I'm not sure I'm understanding you, but I hope I am. In theory it seems 
like strandedness would be an issue, however in practice the reference 
species in a multiz MAF file is always the plus strand. To make sure the 
user isn't trying to pass a MAF file containing blocks with mixed 
strands to MafIndex.get_spliced(), there's a check in there to make sure 
all strands for the reference species are the same. We also assume that 
coordinates specified in a block are always in the ascending direction 
(i.e. they are given as 'start' and 'size' and we assume the coordinates 
are [start, start + size]).

There could be an issue, however, if the best alignment for a particular 
species swaps strands between alignment blocks and/or exons of a 
transcript. However, it might be safe to say that the user is interested 
in the best alignment however it occurs, and not necessarily strand 
consistency.

WRT MultipleSeqAlignment objects produced by get_spliced(), all 
annotation properties are lost upon slicing, so it is up to the user to 
keep track of what's what. I do remember we had talked about a way to 
maintain these annotations, even after slicing. Any thoughts?


Thanks,
Andrew


From p.j.a.cock at googlemail.com  Tue Apr  3 09:03:55 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 3 Apr 2012 10:03:55 +0100
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CAKVJ-_4WZe3ETSaeH=ym7YQbVhEWMfJBo6=G=Tg-S-qJxQN80g@mail.gmail.com>
References: <CAKVJ-_4WZe3ETSaeH=ym7YQbVhEWMfJBo6=G=Tg-S-qJxQN80g@mail.gmail.com>
Message-ID: <CAKVJ-_5JLAwymdA-XgfucAA5hhr7yVqjh5De7Kwr0s4hcN+MRw@mail.gmail.com>

On Wed, Mar 21, 2012 at 3:27 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hello all,
>
> I'm pleased to see that the GSoC SearchIO project idea I put up
> has sparked some interest:
>
> http://biopython.org/wiki/Google_Summer_of_Code
>
> ...

Just a reminder that the GSoC application deadline is this Friday,
6 April. The application website has been open since 26 March,
so I would encourage you to upload your current proposal soon
in case there are server load problems on the last day (you will
still be able to revise the proposal after uploading it).
http://www.google-melange.com/gsoc/homepage/google/gsoc2012

Also, in particular for those of you interested in the SearchIO
project which I would mentor, I will be away Thursday 5 and
Friday 6 April, so you will not be able to ask me for any last
minute feedback.

Good luck,

Peter


From chapmanb at 50mail.com  Tue Apr  3 13:06:36 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 03 Apr 2012 09:06:36 -0400
Subject: [Biopython-dev] MAF Parser/Indexer
In-Reply-To: <4F7A4116.5000602@med.nyu.edu>
References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm>
	<4F7A4116.5000602@med.nyu.edu>
Message-ID: <87hax1hsmb.fsf@fastmail.fm>


Andrew;

> Definitely--I see what you mean. I split __init__ into a couple 
> functions. I'm still worried about the 100 lines of get_spliced(). It's 
> big mostly because I overdid it on the comments, but hopefully that 
> helps explain the logic enough that someone else could work on it 
> without pulling their hair out.

Definitely agreed. It's well-commented which makes it much easier for
others to dig in. Thanks for taking a look at the refactoring.

> Absolutely. I have a few more ideas for cool demos that integrate with 
> other parts of Biopython. What's the best place to put draft text for 
> the tutorial?

Apologies that I'd totally missed your cookbook entry. That looks great,
but more documentation is always better. If you are okay with LaTeX, the
Tutorial is in Doc/Tutorial.tex so you can edit directly. The wiki is
also a good place for docs if you prefer to go that way.

Thanks again for all the work on this. Looking forward to having it in,
Brad


From chapmanb at 50mail.com  Tue Apr  3 14:53:33 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 03 Apr 2012 10:53:33 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
Message-ID: <87r4w4hno2.fsf@fastmail.fm>


Lenna;
Thanks for getting this together, that's a great start. I left some
specific comments but my general suggestion is to get more detailed
about the code specifics. During the summer, you use the weekly timeline
as a todo list so having lots of details make the process so much
easier. Instead of seeing a general item like: "Implement X" you want
"Implement X by extending API from last week to support get_Y using
sqlite3 index table. Test cases A, B, C and D to avoid...".

Having these kind of checklist todos helps make it easy to get started
each week and ensure everything is on track. The additional benefit for
selection is that is helps convince reviewers you've thought about the
technical details and forseen any potential problems.

Hope this helps,
Brad

> Hi Brad,
> 
> Thank you so much for your suggestions. My initial evaluation of the
> strengths of existing software has led me to strongly agree with your
> recommendation to focus on the usability of the API.
> 
> I submit this draft of my proposal to the dev list for feedback:
> 
> https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit
> 
> 
> Lenna
> 
> 
> On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >
> > Lenna;
> > Thanks for the introduction and glad to hear about your interest in the
> > variant project. I'm looking forward to seeing your proposal.
> >
> > The workflow for the variant project involves a biologist querying a VCF
> > or GVF file with variants from an experiment. They should be able to
> > easily subset and filter by file components:
> >
> > - Variant type: Homozygous/Heterozygous variants
> > - Metrics: depth, strand bias, allele frequency..
> > - Variants annotated in coding regions causing amino acid changes
> >
> > As well as rapid subsetting by chromosomal region.
> >
> > My syggestion would be to leverage external tools as much as possible to
> > do file manipulation and focus on an API that lets users filter and
> > extract information pre-contained in the INFO file.
> >
> > Hope this is helpful as a place to get started. We can provide
> > additional feedback once you have your proposal ready. Thanks again,
> > Brad
> >
> >> Hi all,
> >>
> >> I realize time is short, but I am still in the planning phase of my
> >> GSoC proposal! I wanted to take a moment to formally introduce myself
> >> to the dev list.
> >>
> >> I am affiliated with Purdue University, located in Indiana, USA and
> >> best known for engineering (Neil Armstrong is a famous graduate). I
> >> hold a bachelor of arts in biology from Mount Holyoke College in
> >> Massachusetts. I have extensive wet lab experience with genetics; I'm
> >> currently working in a lab genotyping mice (the research is intestinal
> >> lipid metabolism). In August, I begin a PhD in interdisciplinary life
> >> science at Purdue, and I anticipate that my research will fall
> >> somewhere in the field of bioinformatics/computational biology. I hope
> >> to use biopython extensively!
> >>
> >> In my spare time, other than programming, I enjoy ballroom dance,
> >> science fiction novels, board games, and sailing.
> >>
> >> I've been programming for about 6 years and using python for 4; other
> >> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL
> >> (primarily MySQL and SQLite), and C++/C. I place a high value on
> >> object oriented design and execution.
> >>
> >> I understand the basics of formal grammar and have some experience
> >> with lex/flex as well as PLY (python lex/yacc). My work so far with
> >> biopython has been on the CIF parsing module. One of my primary goals
> >> for the genomic variants project would be to implement as much
> >> polymorphism and abstraction as possible, for the benefit of both
> >> users and future developers.
> >>
> >> I'm working on a proposal for the genomic variants project, and while
> >> I understand the basics of molecular biology and genetics, I lack
> >> firsthand experience with the type of workflow that would occur in the
> >> context of genomic variants. If anyone can supply a few examples, it
> >> would be greatly appreciated.
> >>
> >> I hope to have a proposal draft ready for feedback by Monday.
> >>
> >> Regards,
> >>
> >> Lenna Peterson
> >> github.com/lennax
> >> _______________________________________________
> >> Biopython-dev mailing list
> >> Biopython-dev at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From w.arindrarto at gmail.com  Tue Apr  3 15:22:04 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 3 Apr 2012 17:22:04 +0200
Subject: [Biopython-dev] GSoC Student Applicant
In-Reply-To: <87wr5ztfof.fsf@fastmail.fm>
References: <CADEGkF7D=QTJcChbPE71HRiLi0VXiVZap-sJA2+W38TGPziYpA@mail.gmail.com>
	<874ntgtca7.fsf@fastmail.fm>
	<CADEGkF7fT0ExxfOMQgA8EKWZ-DfqS=K3qAXUmewUxYYeXZO6tg@mail.gmail.com>
	<87r4wa6fxx.fsf@fastmail.fm>
	<CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
	<87wr5ztfof.fsf@fastmail.fm>
Message-ID: <CADEGkF5MetS62j2Vf4ReiKMKo_gt=S94jU7huNraVnWFwERRXg@mail.gmail.com>

On Sun, Apr 1, 2012 at 21:28, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Bow;
>
>> Thank you for the comments and suggestions. I've added a little bit
>> more details to my personal profile and put it up front. My project
>> details have also been broken down into single weeks. And I've edited
>> the commenting permission.
>
> Thanks for the updates, this is coming along well. My most general
> suggestion is to spend more time expanding the week-by-week
> timeline. As an example, take this weekly goal:
>
> * Write iterator and random-access parser for EMBOSS water
>
> It would be great to see more specific plans for what exactly you
> deliver and implement during the week. Something like:
>
> - Write iterator for EMBOSS water, expanding test suite to ensure
> ?produced AlignIO objects are compatible with previous BLAST and HMMER
> ?iterators.
>
> - Expand index functionality to handle EMBOSS water format for random
> ?access. Test edge cases: initial records, final records, empty
> ?records.
>
> - Document 'water' parsing with a use case emphasizing differences from
> ?BLAST and HMMER searching.
>
> Peter probably has more specific thoughts on the actual content but it's
> important to think through things in this manner. This will make it
> easier to approach weeks during the summer since you'll already have
> tasks broken down, and will also demonstrate you've thought about
> potential problems and roadblocks and have solutions to overcome them.

Thanks for another feedback, Brad. I am in the process of adding more
detailed descriptions of my weekly tasks.

>> As for my other obligations, I didn't mean to give that impression. I
>> added a little bite more detail about the project itself, but I'm not
>> sure about the time that I should write. I estimate that at most, for
>> each week day, I spend 8 hours doing my Master's project in my lab's
>> campus. Since the project started, I usually use the remainder of the
>> time (~6 hours/day) for my own personal programming projects. I plan
>> to use the personal programming time slot for my GSoC instead, if
>> accepted. Should I be this thorough in the proposal?
>
> This is exactly my worry. You're proposing working two full time jobs
> all summer long. Not to denigrate your work ethic, but 80 hour weeks are
> hard and leave you no time for important things like having a life
> outside of work. My suggestion would be to see if you can scale back
> your Master's commitments for the summer if accepted into GSoC. This
> would definitely improve your proposal since reviewers will worry about
> the time commitment.
>
> Hope this all helps,
> Brad

Ah, that's ok, I understand your concern :). I talked with my
supervisor yesterday regarding this and he understood that I can scale
back the time spent for my current project if accepted. I've revised
this detail as well in the proposal.

Thanks again,
Bow


From p.j.a.cock at googlemail.com  Tue Apr  3 15:32:08 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 3 Apr 2012 16:32:08 +0100
Subject: [Biopython-dev] GSoC Student Applicant
In-Reply-To: <CADEGkF5MetS62j2Vf4ReiKMKo_gt=S94jU7huNraVnWFwERRXg@mail.gmail.com>
References: <CADEGkF7D=QTJcChbPE71HRiLi0VXiVZap-sJA2+W38TGPziYpA@mail.gmail.com>
	<874ntgtca7.fsf@fastmail.fm>
	<CADEGkF7fT0ExxfOMQgA8EKWZ-DfqS=K3qAXUmewUxYYeXZO6tg@mail.gmail.com>
	<87r4wa6fxx.fsf@fastmail.fm>
	<CADEGkF6hKe6jPNn7dsKL8S2FBt0Ae96ziReN--KDHrEwu-FfaA@mail.gmail.com>
	<87wr5ztfof.fsf@fastmail.fm>
	<CADEGkF5MetS62j2Vf4ReiKMKo_gt=S94jU7huNraVnWFwERRXg@mail.gmail.com>
Message-ID: <CAKVJ-_7q=7bLoMS7S_gx=uQbPLx2dTWpRJzPejM_4zrV6Wetsg@mail.gmail.com>

On Tue, Apr 3, 2012 at 4:22 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> On Sun, Apr 1, 2012 at 21:28, Brad Chapman <chapmanb at 50mail.com> wrote:
>>
>> This is exactly my worry. You're proposing working two full time jobs
>> all summer long. Not to denigrate your work ethic, but 80 hour weeks are
>> hard and leave you no time for important things like having a life
>> outside of work. My suggestion would be to see if you can scale back
>> your Master's commitments for the summer if accepted into GSoC. This
>> would definitely improve your proposal since reviewers will worry about
>> the time commitment.
>>
>> Hope this all helps,
>> Brad
>
> Ah, that's ok, I understand your concern :). I talked with my
> supervisor yesterday regarding this and he understood that I can scale
> back the time spent for my current project if accepted. I've revised
> this detail as well in the proposal.
>
> Thanks again,
> Bow

Excellent - I'm pleased your supervisor is being supportive. That
should help address this concern :)

Peter


From mjldehoon at yahoo.com  Tue Apr  3 18:27:26 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 3 Apr 2012 11:27:26 -0700 (PDT)
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CAKVJ-_5JLAwymdA-XgfucAA5hhr7yVqjh5De7Kwr0s4hcN+MRw@mail.gmail.com>
Message-ID: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com>

While I think that the SearchIO module is a good idea, you may want to consider choosing a different name for this module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO, roughly speaking the class definitions are in the former and the parser is in the latter module. I don't quite understand why these two are separated into distinct modules, as to me conceptually the two belong together. Bio.SearchIO in my understanding will combine both the parsers and the class definitions, which is a good thing, but then I would prefer a name without "IO" in it.

Best,
-Michiel.


--- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> From: Peter Cock <p.j.a.cock at googlemail.com>
> Subject: Re: [Biopython-dev] GSoC SearchIO project
> To: "Biopython-Dev Mailing List" <biopython-dev at lists.open-bio.org>
> Date: Tuesday, April 3, 2012, 5:03 AM
> On Wed, Mar 21, 2012 at 3:27 PM,
> Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > Hello all,
> >
> > I'm pleased to see that the GSoC SearchIO project idea
> I put up
> > has sparked some interest:
> >
> > http://biopython.org/wiki/Google_Summer_of_Code
> >
> > ...
> 
> Just a reminder that the GSoC application deadline is this
> Friday,
> 6 April. The application website has been open since 26
> March,
> so I would encourage you to upload your current proposal
> soon
> in case there are server load problems on the last day (you
> will
> still be able to revise the proposal after uploading it).
> http://www.google-melange.com/gsoc/homepage/google/gsoc2012
> 
> Also, in particular for those of you interested in the
> SearchIO
> project which I would mentor, I will be away Thursday 5 and
> Friday 6 April, so you will not be able to ask me for any
> last
> minute feedback.
> 
> Good luck,
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From p.j.a.cock at googlemail.com  Tue Apr  3 19:44:48 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 3 Apr 2012 20:44:48 +0100
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com>
References: <CAKVJ-_5JLAwymdA-XgfucAA5hhr7yVqjh5De7Kwr0s4hcN+MRw@mail.gmail.com>
	<1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com>
Message-ID: <CAKVJ-_6mtOmmHapwmLT+8yc-8ADKWfWsp1yWN_HavZ59KeR71Q@mail.gmail.com>

On Tue, Apr 3, 2012 at 7:27 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> While I think that the SearchIO module is a good idea, you
> may want to consider choosing a different name for this
> module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO,
> roughly speaking the class definitions are in the former and
> the parser is in the latter module. I don't quite understand
> why these two are separated into distinct modules, as to
> me conceptually the two belong together. Bio.SearchIO in
> my understanding will combine both the parsers and the
> class definitions, which is a good thing, but then I would
> prefer a name without "IO" in it.
>
> Best,
> -Michiel.

Yes, I was thinking to have both the parsers and the new
objects under the name module namespace.

The reason for using SearchIO (despite not being PEP8
compatible - something I regret in the naming of SeqIO
and the pattern it set) is to match SeqIO and AlignIO and
BioPerl. Anyone familiar with BioPerl will immediately see
what it is for - and some of the student applicants have
already used BioPerl's SearchIO. Personally I find this
quite a compelling argument.

That said, the name SearchIO isn't the clearest in the
the world for a newcomer - however I haven't come up
with anything significantly better myself. Perhaps there
is a better name out there, which would justify breaking
the pattern? I've considered pairwise and palign, but
neither feels right.

Given a clean slate (Biopython 2?), then yes, I would
agree with consolidating Bio.Align and Bio.AlignIO as
one namespace, probable "align" (lower case). The
situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO
isn't quite so simple - perhaps "seq" (lower case)?
Then (in the absence of any other ideas), SearchIO
would become "search" (lower case).

Peter


From redmine at redmine.open-bio.org  Tue Apr  3 21:13:13 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 3 Apr 2012 21:13:13 +0000
Subject: [Biopython-dev] [Biopython - Bug #3337] (New) 'Bio.trie.trie' is
	not picklable
Message-ID: <redmine.issue-3337.20120403211313@redmine.open-bio.org>


Issue #3337 has been reported by Sergei Lebedev.

----------------------------------------
Bug #3337: 'Bio.trie.trie' is not picklable
https://redmine.open-bio.org/issues/3337

Author: Sergei Lebedev
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Is there any reason for this, or nobody just had the need (or time) to implement pickle interface?


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From MatatTHC at gmx.de  Wed Apr  4 08:46:47 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Wed, 4 Apr 2012 10:46:47 +0200
Subject: [Biopython-dev] SeqIO circular
In-Reply-To: <CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
References: <CALNFT0jq=VTwSDv-4x7ZrHoQRLajCUHY8NGPMw9cDuGnwwNiuw@mail.gmail.com>
	<CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
Message-ID: <CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>

Hi,

are there any news on this? May I help somehow? But I have to admit
that I barely speak perl and have no experience with bioperl. If
someone tells me where to look I might still try it.

Matthias

2012/3/29 Peter Cock <p.j.a.cock at googlemail.com>:
> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
>> Hi,
>>
>> Is it possible to get the property if a genome is circular / linear
>> from SeqIO applied to genbank files? I could not find it.
>>
>> There is also a related bugreport:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>>
>> I used the old parser before and switched to SeqIO which I really like
>> for the possibilities to parse different formats... but I really need
>> the information.
>
> Does anyone happen to have a BioPerl + BioSQL setup installed
> and working? IIRC checking that to make sure however we
> store the circular was compatible was the only real hurdle.
>
> Peter


From arklenna at gmail.com  Thu Apr  5 00:04:30 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Wed, 4 Apr 2012 20:04:30 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <87r4w4hno2.fsf@fastmail.fm>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
	<87r4w4hno2.fsf@fastmail.fm>
Message-ID: <CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>

On Tue, Apr 3, 2012 at 10:53 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Lenna;
> Thanks for getting this together, that's a great start. I left some
> specific comments but my general suggestion is to get more detailed
> about the code specifics. During the summer, you use the weekly timeline
> as a todo list so having lots of details make the process so much
> easier. Instead of seeing a general item like: "Implement X" you want
> "Implement X by extending API from last week to support get_Y using
> sqlite3 index table. Test cases A, B, C and D to avoid...".
>
> Having these kind of checklist todos helps make it easy to get started
> each week and ensure everything is on track. The additional benefit for
> selection is that is helps convince reviewers you've thought about the
> technical details and forseen any potential problems.
>
> Hope this helps,
> Brad
>

Hi all,

I'm linking to a revision of my GSoC proposal:

https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit

Thank you to everyone for your feedback.


Peter,

I didn't realize Biopython has never been tested on IronPython. As I
have no familiarity with .NET or Windows, I'll have to rescind my
offer to test it. Sorry to get your hopes up!


Reece,

I've revised the prose sections and almost completely rewritten the
timeline. This version provides more information about my background,
a more detailed description of the overall project, and more specific
goals.


Brad,

I've tried to go into as much detail as my knowledge of VCF and GVF
structure allows. I laid out a more specific structure for both the
backend and frontend structures for the data. I've revised the unit
tests to be more specific and less dependent on interaction with other
modules and I've tried to anticipate some cases that may produce
unexpected behavior. I also highlighted specific places where the
design should be generalizable.


James,

I hope my revised project description is more focused. Regarding CNV
etc., I did not mean to specifically exclude them by mentioning SNPs,
and I've reworded that paragraph to be more general. I get the
impression that CNV and other structural variants are considerably
more complex to represent and manipulate. I'd be more than happy to
read more about breakpoint theory etc. and to prototype any specific
workflows you might suggest.


Lenna


From eric.talevich at gmail.com  Thu Apr  5 02:53:10 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 4 Apr 2012 22:53:10 -0400
Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices
Message-ID: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>

Hi all,

I'm considering some enhancements to the Phylo.draw function to make it
more customizable for power users. Since the function is based on
matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the
user; however, I'm not fully versed in what pyplot is capable of.

Relevant feature request in Redmine:
https://redmine.open-bio.org/issues/3336

Ideas:

1. Make the draw function return a mapping of clades to a collection of
pyplot graphical elements -- the objects emitted by pyplot during each step
of rendering the plot. Each clade in the tree is mapped to a horizontal
line, a vertical line, a text label (taxon name, normally), and another
text label for the branch (confidence/support, normally). The user can then
set the attributes of these objects as they wish, minimizing the need for
futher extensions to Phylo.draw.

Example:
{<Bio.Phylo.PhyloXML.Clade>: {
        "hline": <matplotlib.collections.LineCollection>,
        "vline": <matplotlib.collections.LineCollection>,
        "taxon_label": <matplotlib.text.Text>,
        "branch_label": <matplotlib.text.Text> },
 ...

If the user needs access to the figure or axis object as well, it's already
easy enough to create these beforehand and pass the 'axis' object to
Phylo.draw.


2. Add an argument 'branch_labels' to Phylo.draw. This will accept either
(a) a dict which maps the tree's Clade objects to string labels, or (b) a
function which accepts a Clade object and returns a string. Default: a
function that formats the clade's 'confidence' or 'confidences' attribute,
matching the current behavior.

Examples:
>>> draw(mytree, branch_labels={mytree.root: "Root", ...})
>>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence)
>>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank)


3. Accept **kwargs in Phylo.draw; pass it right along to pyplot at some
point.

Question: What basic pyplot function accepts **Ikwargs? pyplot.figure and
pyplot.set_subplot don't seem appropriate. An alternative is to use
pyplot.rcParams, either leaving it all to the user or treating the **kwargs
keys as the corresponding entries in rcParams. Syntax gets a little tricky.

(Not a top priority for me, actually, since rcParams works.)


Thoughts? All clear?

Thanks,
Eric


From chapmanb at 50mail.com  Thu Apr  5 10:47:09 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 05 Apr 2012 06:47:09 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
	<87r4w4hno2.fsf@fastmail.fm>
	<CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>
Message-ID: <871uo2cv6a.fsf@fastmail.fm>


Lenna;

> I'm linking to a revision of my GSoC proposal:
> 
> https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit
> 
> Thank you to everyone for your feedback.

This is coming along great, thanks for all the work on it. I've added a
couple of specific suggestions about iterative parsing, which PyVCF
does, and using external tools to make the coding region evaluation work
easier.

One other practical suggestion: you should add a link to the latest
version of your google doc at the top of your proposal on the GSoC
Melange site. You won't be able to edit there after Friday but can
update your google document in case of reviewer suggestions.

Thanks again and best of luck during the review process,
Brad

> 
> 
> Peter,
> 
> I didn't realize Biopython has never been tested on IronPython. As I
> have no familiarity with .NET or Windows, I'll have to rescind my
> offer to test it. Sorry to get your hopes up!
> 
> 
> Reece,
> 
> I've revised the prose sections and almost completely rewritten the
> timeline. This version provides more information about my background,
> a more detailed description of the overall project, and more specific
> goals.
> 
> 
> Brad,
> 
> I've tried to go into as much detail as my knowledge of VCF and GVF
> structure allows. I laid out a more specific structure for both the
> backend and frontend structures for the data. I've revised the unit
> tests to be more specific and less dependent on interaction with other
> modules and I've tried to anticipate some cases that may produce
> unexpected behavior. I also highlighted specific places where the
> design should be generalizable.
> 
> 
> James,
> 
> I hope my revised project description is more focused. Regarding CNV
> etc., I did not mean to specifically exclude them by mentioning SNPs,
> and I've reworded that paragraph to be more general. I get the
> impression that CNV and other structural variants are considerably
> more complex to represent and manipulate. I'd be more than happy to
> read more about breakpoint theory etc. and to prototype any specific
> workflows you might suggest.
> 
> 
> Lenna
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From arklenna at gmail.com  Fri Apr  6 02:50:52 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Thu, 5 Apr 2012 22:50:52 -0400
Subject: [Biopython-dev] GSoC genomic variant proposal
In-Reply-To: <871uo2cv6a.fsf@fastmail.fm>
References: <CAK610_7KnGD3ZAdbhegUjGgHkGOM3uUobM_hZKfPow-Fu05s2Q@mail.gmail.com>
	<87zkavtgcr.fsf@fastmail.fm>
	<CAK610_5UdpnxrO0ejQ4JxwgbNEOPvg+Yjouz14PHnPK_uRp1xg@mail.gmail.com>
	<87r4w4hno2.fsf@fastmail.fm>
	<CAK610_76xH9q2TcyP0CdRjSZSM9aokiiWkkX8r1uzzCFscxPcA@mail.gmail.com>
	<871uo2cv6a.fsf@fastmail.fm>
Message-ID: <CAK610_6PNyQVwhbF7HcTL0k9=cAhLL1t-jhu=KRULWT+DuvO7A@mail.gmail.com>

On Thu, Apr 5, 2012 at 6:47 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Lenna;
>
>> I'm linking to a revision of my GSoC proposal:
>>
>> https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit
>>
>> Thank you to everyone for your feedback.
>
> This is coming along great, thanks for all the work on it. I've added a
> couple of specific suggestions about iterative parsing, which PyVCF
> does, and using external tools to make the coding region evaluation work
> easier.
>
> One other practical suggestion: you should add a link to the latest
> version of your google doc at the top of your proposal on the GSoC
> Melange site. You won't be able to edit there after Friday but can
> update your google document in case of reviewer suggestions.
>
> Thanks again and best of luck during the review process,
> Brad
>


Brad -

Thank you again for your detailed feedback. As per your suggestion, I
have updated my proposal on GSoC Melange to include a link to the
latest version of my proposal.

Lenna


From mjldehoon at yahoo.com  Sat Apr  7 04:43:56 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 6 Apr 2012 21:43:56 -0700 (PDT)
Subject: [Biopython-dev] GSoC SearchIO project
Message-ID: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com>

--- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> The reason for using SearchIO (despite not being PEP8
> compatible - something I regret in the naming of SeqIO
> and the pattern it set) is to match SeqIO and AlignIO and
> BioPerl. Anyone familiar with BioPerl will immediately see
> what it is for - and some of the student applicants have
> already used BioPerl's SearchIO. Personally I find this
> quite a compelling argument.

Sorry but I am not convinced. I doubt that somebody familiar with BioPerl's Align and AlignIO modules will have trouble finding the parser in Biopython if in Biopython there is only a Bio.Align module. Also this means that some modules in Biopython are split up in Module and ModuleIO, whereas most others are not. In this particular case, for consistency you would have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a clean module organization in Biopython instead of strictly following what BioPerl did.

> That said, the name SearchIO isn't the clearest in the
> the world for a newcomer - however I haven't come up
> with anything significantly better myself. Perhaps there
> is a better name out there, which would justify breaking
> the pattern? I've considered pairwise and palign, but
> neither feels right.

How about including this module as a submodule in Bio.Align? If we think of Bio.Align as a general module for alignments, then pairwise alignments fit in it too. It depends a bit on the exact API, but I expect that we can come up with something elegant.

> Given a clean slate (Biopython 2?), then yes, I would
> agree with consolidating Bio.Align and Bio.AlignIO as
> one namespace, probable "align" (lower case). The
> situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO
> isn't quite so simple - perhaps "seq" (lower case)?

There are two steps here: consolidation of some modules, and changing the names of modules to comply with PEP8. The consolidation can happen without waiting for a Biopython 2, as long as there are clear deprecating warnings in the modules that will be removed. Compliance with PEP8 is a bit trickier, since it means relearning all module names, and some systems (Windows?) may not distinguish between lower and upper case.

> Then (in the absence of any other ideas), SearchIO
> would become "search" (lower case).

If we already know now that we will drop the IO from SearchIO at some point, then SearchIO doesn't seem to be a good name.

Best,
-Michiel.


From eric.talevich at gmail.com  Sat Apr  7 16:13:16 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 7 Apr 2012 12:13:16 -0400
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com>
References: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com>
Message-ID: <CAMC681kbybwpcd96PVV=y34nY6jSdnHMqS2XG+_BuoScy42q9A@mail.gmail.com>

On Sat, Apr 7, 2012 at 12:43 AM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> --- On Tue, 4/3/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > The reason for using SearchIO (despite not being PEP8
> > compatible - something I regret in the naming of SeqIO
> > and the pattern it set) is to match SeqIO and AlignIO and
> > BioPerl. Anyone familiar with BioPerl will immediately see
> > what it is for - and some of the student applicants have
> > already used BioPerl's SearchIO. Personally I find this
> > quite a compelling argument.
>
> Sorry but I am not convinced. I doubt that somebody familiar with
> BioPerl's Align and AlignIO modules will have trouble finding the parser in
> Biopython if in Biopython there is only a Bio.Align module. Also this means
> that some modules in Biopython are split up in Module and ModuleIO, whereas
> most others are not. In this particular case, for consistency you would
> have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a
> clean module organization in Biopython instead of strictly following what
> BioPerl did.
>

How about Bio.Search, for now?

We had a similar discussion at the end of GSoC 2009, when we decided to
merge Tree and TreeIO (names inspired by BioPerl) to create Phylo (because
not all trees are phylogenies, although there is also a Perl module called
Bio::Phylo). Since the *IO namespaces have only 4 public functions, plus a
<Format>IO.py module for each supported I/O format, it's not too cluttered.

Likewise, at the end of this GSoC it may be more clear whether the new
sub-package should have a different name. (SearchIO seems to have been
plenty effective at drawing attention to the project.) But in any case, I
support putting all the new work under one sub-package, rather than two.


 > That said, the name SearchIO isn't the clearest in the
> > the world for a newcomer - however I haven't come up
> > with anything significantly better myself. Perhaps there
> > is a better name out there, which would justify breaking
> > the pattern? I've considered pairwise and palign, but
> > neither feels right.
>
> How about including this module as a submodule in Bio.Align? If we think
> of Bio.Align as a general module for alignments, then pairwise alignments
> fit in it too. It depends a bit on the exact API, but I expect that we can
> come up with something elegant.
>
>
Does anything in Bio.Align already operate on SeqFeature objects?

Given that BLAST or HMMer output could be interpreted as (1) a series of
annotated features/regions on target sequences, or (2) a series of pairwise
alignments [*], perhaps it would be most effective to support those aspects
separately, through (1) Bio.Search or Bio.Feature [**], and (2) Bio.Align
or Bio.AlignIO.

[*] The multiple sequence alignment produced by HMMer is in a format we
already handle (Stockholm). Some people want to convert BLAST output to a
multiple sequence alignment, too, and while I suppose we could support that
in a literal sense, the result would be worse than the output of pretty
much any other alignment program so I don't think we should.

[**] A Bio.Feature module could involve GFF parsing and the variant
parsers, too. It would contain I/O functions that emit SeqFeatures, of
course.


From redmine at redmine.open-bio.org  Sat Apr  7 17:31:37 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 7 Apr 2012 17:31:37 +0000
Subject: [Biopython-dev] [Biopython - Feature #3338] (New) Convert a protein
	alignment and nucleotide sequences to codon alignment
Message-ID: <redmine.issue-3338.20120407173137@redmine.open-bio.org>


Issue #3338 has been reported by Eric Talevich.

----------------------------------------
Feature #3338: Convert a protein alignment and nucleotide sequences to codon alignment
https://redmine.open-bio.org/issues/3338

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


As discussed on the mailing list:
http://lists.open-bio.org/pipermail/biopython/2012-April/007913.html

This could be implemented in two ways:
1. Wrap PAL2NAL (pal2nal.pl) under Bio.Align.Applications
2. Implement this functionality directly in Python

While PAL2NAL has some convenience features like aligning protein sequences to CDS sequences that don't exactly match, it would be straightforward (and simpler for the user, in most cases) to implement a fussier version of it from scratch somewhere in Biopython.

So, where would be put this function?

Related:
* From a codon alignment, it would again be straightforward to calculate dN/dS ratios for pairs of sequences, much like PAML's yn00 (although that program does more stuff, too). Do we want to do that? Where?
* Are there ways Biopython could support codon alignments better, as distinct from nucleotide alignments?


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From eric.talevich at gmail.com  Sat Apr  7 18:42:02 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 7 Apr 2012 14:42:02 -0400
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
Message-ID: <CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>

On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich <eric.talevich at gmail.com>wrote:

> Hi all,
>
> I'm considering some enhancements to the Phylo.draw function to make it
> more customizable for power users. Since the function is based on
> matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the
> user; however, I'm not fully versed in what pyplot is capable of.
>
> Relevant feature request in Redmine:
> https://redmine.open-bio.org/issues/3336
>
> Ideas:

[...]
> 2. Add an argument 'branch_labels' to Phylo.draw. This will accept either
> (a) a dict which maps the tree's Clade objects to string labels, or (b) a
> function which accepts a Clade object and returns a string. Default: a
> function that formats the clade's 'confidence' or 'confidences' attribute,
> matching the current behavior.
>
> Examples:
> >>> draw(mytree, branch_labels={mytree.root: "Root", ...})
> >>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence)
> >>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank)
>
>
Just committed this feature:
https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d


From lgautier at gmail.com  Sun Apr  8 17:16:31 2012
From: lgautier at gmail.com (Laurent Gautier)
Date: Sun, 08 Apr 2012 19:16:31 +0200
Subject: [Biopython-dev] Sphinx documentation online ?
Message-ID: <4F81C7EF.7030505@gmail.com>

Hi,

I have seen emails exchanges and issues on the tracker regarding moving 
the documentation to Sphinx, but I could not find an instance of the 
documentation for biopython online (I was looking for one to 
cross-reference it with documentation I am writing).

Is this still work-in-progress, or is there an instance online and I 
missed it ?

Best,


Laurent


From eric.talevich at gmail.com  Sun Apr  8 19:25:00 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 8 Apr 2012 15:25:00 -0400
Subject: [Biopython-dev] Sphinx documentation online ?
In-Reply-To: <4F81C7EF.7030505@gmail.com>
References: <4F81C7EF.7030505@gmail.com>
Message-ID: <CAMC681=pQCtiicFy882D1FSc71XYQrWD-5ouU1pzAo47G0gJgQ@mail.gmail.com>

On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier <lgautier at gmail.com> wrote:

> Hi,
>
> I have seen emails exchanges and issues on the tracker regarding moving
> the documentation to Sphinx, but I could not find an instance of the
> documentation for biopython online (I was looking for one to
> cross-reference it with documentation I am writing).
>
> Is this still work-in-progress, or is there an instance online and I
> missed it ?
>
>
Hi Laurent,

I proposed this a while ago and played with Sphinx a little bit, but didn't
get very far. We're still using Epydoc for our generated API documentation:
http://biopython.org/DIST/docs/api/

I do hope to get back to this at some point, or perhaps assist someone else
with migrating Biopython to Sphinx.

-Eric


From lgautier at gmail.com  Sun Apr  8 20:46:45 2012
From: lgautier at gmail.com (Laurent Gautier)
Date: Sun, 08 Apr 2012 22:46:45 +0200
Subject: [Biopython-dev] Sphinx documentation online ?
In-Reply-To: <CAMC681=pQCtiicFy882D1FSc71XYQrWD-5ouU1pzAo47G0gJgQ@mail.gmail.com>
References: <4F81C7EF.7030505@gmail.com>
	<CAMC681=pQCtiicFy882D1FSc71XYQrWD-5ouU1pzAo47G0gJgQ@mail.gmail.com>
Message-ID: <4F81F935.9030702@gmail.com>

On 2012-04-08 21:25, Eric Talevich wrote:
> On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier <lgautier at gmail.com 
> <mailto:lgautier at gmail.com>> wrote:
>
>     Hi,
>
>     I have seen emails exchanges and issues on the tracker regarding
>     moving the documentation to Sphinx, but I could not find an
>     instance of the documentation for biopython online (I was looking
>     for one to cross-reference it with documentation I am writing).
>
>     Is this still work-in-progress, or is there an instance online and
>     I missed it ?
>
>
> Hi Laurent,
>
> I proposed this a while ago and played with Sphinx a little bit, but 
> didn't get very far. We're still using Epydoc for our generated API 
> documentation:
> http://biopython.org/DIST/docs/api/
>
> I do hope to get back to this at some point, or perhaps assist someone 
> else with migrating Biopython to Sphinx.
>
> -Eric
>
>

Hi Eric,

Thanks for the answer. I did see the Epydoc, but I was after Sphinx to 
be able to cross-reference documentations (see 
http://sphinx.pocoo.org/ext/intersphinx.html ).
I'll do with it for the time being.

Best,


Laurent


From eric.talevich at gmail.com  Mon Apr  9 18:25:04 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 9 Apr 2012 14:25:04 -0400
Subject: [Biopython-dev] Method to weight sequences in an alignment
Message-ID: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>

Folks,

I've written a function to weight sequences according to the simple scheme
used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment
objects or lists of plain strings, and could be added as a method with
minimal changes (for Python 2.5 compatibility, mainly). Any interest in
adding it to Biopython?

The code is below.

Cheers,
Eric

[*] Henikoff & Henikoff (1994): Position-based sequence weights.
http://www.ncbi.nlm.nih.gov/pubmed/7966282

----

def sequence_weights(aln):
    """Weight aligned sequences to emphasize more divergent members.

    Returns a list of floating-point numbers between 0 and 1, corresponding
to
    the proportional weight of each sequence in the alignment. The first
list
    is the weight of the first sequence in the alignment, and so on. Weights
    sum to 1.0.

    Method: At each column position, award each different residue an equal
    share of the weight, and then divide that weight equally among the
    sequences sharing the same residue.  For each sequence, sum the
    contributions from each position to give a sequence weight.

    See Henikoff & Henikoff (1994): Position-based sequence weights.
    """
    def col_weight(column):
        """Represent the diversity at a position.

        Award each different residue an equal share of the weight, and then
        divide that weight equally among the sequences sharing the same
        residue.

        So, if in a position of a multiple alignment, r different residues
        are represented, a residue represented in only one sequence
contributes
        a score of 1/r to that sequence, whereas a residue represented in s
        sequences contributes a score of 1/rs to each of the s sequences.
        """
        # Skip columns with all gaps or unique inserts
        if len([c for c in column if c not in '-.']) < 2:
            return [0] * len(column)
        # Count the number of occurrences of each residue type
        # (Treat gaps as a separate, 21st character)
        counts = Counter(column)
        # Get residue weights: 1/rs, where
        # r = nb. residue types, s = count of a particular residue type
        n_residues = len(counts)    # r
        freqs = dict((aa, 1.0 / (n_residues * count))
                for aa, count in counts.iteritems())
        weights = [freqs[aa] for aa in column]
        return weights

    seq_weights = [0] * len(aln)
    col_weights = map(col_weight, zip(*aln))
    # Sum the contributions from each position along each sequence -> total
weight
    for col in col_weights:
        for idx, row_val in enumerate(col):
            seq_weights[idx] += row_val
    # Normalize
    scale = 1.0 / sum(seq_weights)
    seq_weights = [scale * wt for wt in seq_weights]
    return seq_weights


From mjldehoon at yahoo.com  Mon Apr  9 23:27:31 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 9 Apr 2012 16:27:31 -0700 (PDT)
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CAMC681kbybwpcd96PVV=y34nY6jSdnHMqS2XG+_BuoScy42q9A@mail.gmail.com>
Message-ID: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>

Hi Eric, Peter,

> How about Bio.Search, for now?


I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells users something about what the module is for. Bio.Search could be anything (search PubMed? search the Entrez databases? search Google? anyway Bio.Search does not suggest that this module is about pairwise alignments). But Peter previously mentioned that he doesn't like Bio.Pairwise; can we convince you?

>> How
 about including this module as a submodule in Bio.Align?
> Does anything in Bio.Align already operate on SeqFeature objects? 
I was more thinking to have this module as a submodule in Bio.Align for the purpose of module organization rather than reusing or integrating it with Bio.Align. However, if we can make use of Bio.Align, then that could be a good thing.

Best,
-Michiel.


From chapmanb at 50mail.com  Tue Apr 10 00:58:19 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 09 Apr 2012 20:58:19 -0400
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
Message-ID: <87lim4h07o.fsf@fastmail.fm>


Michiel;

> Hi Eric, Peter,
> 
> > How about Bio.Search, for now?
> 
> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells
> users something about what the module is for. Bio.Search could be
> anything (search PubMed? search the Entrez databases? search Google?
> anyway Bio.Search does not suggest that this module is about pairwise
> alignments). But Peter previously mentioned that he doesn't like
> Bio.Pairwise; can we convince you?

I agree with Peter on this one. The module is primarily about searching
a sequence database with an input via multiple methods, not about
pairwise alignment of two sequences with is what Bio.Align.Pairwise
suggests to me.

Brad


From redmine at redmine.open-bio.org  Tue Apr 10 20:29:09 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 10 Apr 2012 20:29:09 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using
	Bio.Clustalw in Tutorial
Message-ID: <redmine.issue-3340.20120410202908@redmine.open-bio.org>


Issue #3340 has been reported by Peter Cock.

----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue Apr 10 20:29:09 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 10 Apr 2012 20:29:09 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using
	Bio.Clustalw in Tutorial
Message-ID: <redmine.issue-3340.20120410202908@redmine.open-bio.org>


Issue #3340 has been reported by Peter Cock.

----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Thu Apr 12 16:01:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Apr 2012 17:01:47 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
Message-ID: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>

Hello all,

The BOSC abstract deadline (tomorrow) has rather crept up on me,
despite Nomi's reminder emails (My excuse is I've been thinking
more about GSoC!). For anyone thinking of submitting a talk, the
abstract limit is just a page - see:
http://www.open-bio.org/wiki/BOSC_2012

I'm hoping to attend BOSC, but will probably not be at ISMB 2012.
I'd be delighted for another Biopython developer to give the project
update talk (and as in previous years, we'll help out with the abstract,
slides, etc). Anyone interested? Giving a talk can be very helpful in
getting travel funding ;)

I know Eric might be a candidate as he will be in Long Beach
(congratulations on getting your ISMB poster accepted Eric!).

Note that dedicated "Bioinformatics Open Source Project Updates"
track is new this year. The talks are likely to be at the shorter end of
the talk length range specified (i.e. closer to 5 minutes than 20 mins)
but that will partly depend on quite how full the final schedule turns
out to be.

The idea (speaking with my BOSC hat on) with the update talks is
to try to highlight what is new and exciting, with only a minimal
introduction for the higher profile projects - most of the audience
will know roughly what BioPerl etc are, and won't be interested
to hear it again ;)

So for the Biopython talk we'd probably want to cover things like
GSoC, work with PyPy and Python3, major new functionality, any
Biopython papers, etc, and a bit on future plans. The talk should be
short but sweet :)

Regards,

Peter


From redmine at redmine.open-bio.org  Thu Apr 12 18:52:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 12 Apr 2012 18:52:35 +0000
Subject: [Biopython-dev] [Biopython - Feature #3341] (New) Improve SffIO to
	parse 3 extra lines present in some SFF files ("Run Name:,
	Analysis Name:, Full Path:)"
Message-ID: <redmine.issue-3341.20120412185235@redmine.open-bio.org>


Issue #3341 has been reported by Martin Mokrej?.

----------------------------------------
Feature #3341: Improve SffIO to parse 3 extra lines present in some SFF files ("Run Name:, Analysis Name:, Full Path:)"
https://redmine.open-bio.org/issues/3341

Author: Martin Mokrej?
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Some file have extra 3 lines per each record in the SFF file. One such file is already in biopython test data:
biopython/Tests/Roche/E3MFGYR02_random_10_reads.sff
biopython/Tests/Roche/paired.sff

The three lines "Run Name:, Analysis Name:, Full Path:" are not parsed into the object and later on, are not written out. Hence, sff round trip read in -> write out breaks (biopython-1.58). These three lines somehow do not appear in every SFF file, and so far I haven't seen these in files extracted from SRA. Seems these only appear in original Roche SFF files.


>E3MFGYR02JWQ7T
  Run Prefix:   R_2008_01_09_16_16_00_
  Region #:     2
  XY Location:  3946_2103

  Run Name:       R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331
  Analysis Name:  /data/2008_02_08/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe
  Full Path:      /data/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe

  Read Header Len:  32
  Name Length:      14
  # of Bases:       265
  Clip Qual Left:   5
  Clip Qual Right:  264
  Clip Adap Left:   0
  Clip Adap Right:  0


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From eric.talevich at gmail.com  Thu Apr 12 22:37:12 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 12 Apr 2012 18:37:12 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
Message-ID: <CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>

On Thu, Apr 12, 2012 at 12:01 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hello all,
>
> The BOSC abstract deadline (tomorrow) has rather crept up on me,
> despite Nomi's reminder emails (My excuse is I've been thinking
> more about GSoC!). For anyone thinking of submitting a talk, the
> abstract limit is just a page - see:
> http://www.open-bio.org/wiki/BOSC_2012
>
> I'm hoping to attend BOSC, but will probably not be at ISMB 2012.
> I'd be delighted for another Biopython developer to give the project
> update talk (and as in previous years, we'll help out with the abstract,
> slides, etc). Anyone interested? Giving a talk can be very helpful in
> getting travel funding ;)
>
> I know Eric might be a candidate as he will be in Long Beach
> (congratulations on getting your ISMB poster accepted Eric!).
>
> Note that dedicated "Bioinformatics Open Source Project Updates"
> track is new this year. The talks are likely to be at the shorter end of
> the talk length range specified (i.e. closer to 5 minutes than 20 mins)
> but that will partly depend on quite how full the final schedule turns
> out to be.
>
> The idea (speaking with my BOSC hat on) with the update talks is
> to try to highlight what is new and exciting, with only a minimal
> introduction for the higher profile projects - most of the audience
> will know roughly what BioPerl etc are, and won't be interested
> to hear it again ;)
>
> So for the Biopython talk we'd probably want to cover things like
> GSoC, work with PyPy and Python3, major new functionality, any
> Biopython papers, etc, and a bit on future plans. The talk should be
> short but sweet :)
>
> Regards,
>
> Peter


OK, here are some potential talking points I scraped from past announcements:

* SeqIO.index_db:
Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
carry the index_db concept to other modules.

* Installation improvements:
pip support (v.1.57); easy_install will automatically handle the numpy
dependency (v.1.59, Feb '12)

* Portability:
Python 3 compatibility (except for a couple C extension modules);
still supporting Jython; now mostly supporting Pypy (except for
modules that use numpy or C extensions)

* Merged Brandon Invergo's independent project pypaml under
Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
support (v.1.59) and the existing support for phylogeny I/O under
Phylo, we can now easily assemble and run complete workflows involving
PAML.
(Similarly for PhyML, with SeqIO's "phylip-relaxed" and
Bio.Phylo.Applications.PhymlCommandline.)

* GenomeDiagram improvements:
New, pretty features. Eye candy for the slides.

* TogoWS

* Next release & future plans:
- Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
- Brad's GFF parser
- Deeper future: see the other mailing list thread

* GSoC 2011 results:
- Mikael Trellet -- Interface
- Michele Silva -- Mocapy++ Python module; also ported two
applications to Biopython
- Justinas D. -- Python-based extension system for Mocapy++

* Summer of Struct:
Jo?o and Eric are working to refactor and merge the vast amount of
Bio.PDB-related code produced during previous GSoCs. (Includes a
planned SeqIO-style API for structures in PDB, mmCIF and PBDML
formats.) Improvements have been trickling in since the last BOSC;
here comes the flood.


From chapmanb at 50mail.com  Fri Apr 13 00:23:03 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 12 Apr 2012 20:23:03 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
Message-ID: <877gxkh448.fsf@fastmail.fm>


Eric and Peter;
Eric -- I'm glad you're taking this on. It'll be great to have a
Biopython presentation at BOSC. The points you mentioned all sound
great, although I would drop some of the more boring ones like the
installation stuff (I can pick on that, since it's mine).

My only other suggestions is to focus the talk around the people who've
provided the improvements. One of the awesome things about Biopython is
the wide contributor base and we still manage to pull everything into a
coherent package thanks to Peter's guiding hand. It would be cool to
emphasize this community as part of the update.

Thanks again for doing this,
Brad

> > Hello all,
> >
> > The BOSC abstract deadline (tomorrow) has rather crept up on me,
> > despite Nomi's reminder emails (My excuse is I've been thinking
> > more about GSoC!). For anyone thinking of submitting a talk, the
> > abstract limit is just a page - see:
> > http://www.open-bio.org/wiki/BOSC_2012
> >
> > I'm hoping to attend BOSC, but will probably not be at ISMB 2012.
> > I'd be delighted for another Biopython developer to give the project
> > update talk (and as in previous years, we'll help out with the abstract,
> > slides, etc). Anyone interested? Giving a talk can be very helpful in
> > getting travel funding ;)
> >
> > I know Eric might be a candidate as he will be in Long Beach
> > (congratulations on getting your ISMB poster accepted Eric!).
> >
> > Note that dedicated "Bioinformatics Open Source Project Updates"
> > track is new this year. The talks are likely to be at the shorter end of
> > the talk length range specified (i.e. closer to 5 minutes than 20 mins)
> > but that will partly depend on quite how full the final schedule turns
> > out to be.
> >
> > The idea (speaking with my BOSC hat on) with the update talks is
> > to try to highlight what is new and exciting, with only a minimal
> > introduction for the higher profile projects - most of the audience
> > will know roughly what BioPerl etc are, and won't be interested
> > to hear it again ;)
> >
> > So for the Biopython talk we'd probably want to cover things like
> > GSoC, work with PyPy and Python3, major new functionality, any
> > Biopython papers, etc, and a bit on future plans. The talk should be
> > short but sweet :)
> >
> > Regards,
> >
> > Peter
> 
> 
> OK, here are some potential talking points I scraped from past announcements:
> 
> * SeqIO.index_db:
> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
> carry the index_db concept to other modules.
> 
> * Installation improvements:
> pip support (v.1.57); easy_install will automatically handle the numpy
> dependency (v.1.59, Feb '12)
> 
> * Portability:
> Python 3 compatibility (except for a couple C extension modules);
> still supporting Jython; now mostly supporting Pypy (except for
> modules that use numpy or C extensions)
> 
> * Merged Brandon Invergo's independent project pypaml under
> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
> support (v.1.59) and the existing support for phylogeny I/O under
> Phylo, we can now easily assemble and run complete workflows involving
> PAML.
> (Similarly for PhyML, with SeqIO's "phylip-relaxed" and
> Bio.Phylo.Applications.PhymlCommandline.)
> 
> * GenomeDiagram improvements:
> New, pretty features. Eye candy for the slides.
> 
> * TogoWS
> 
> * Next release & future plans:
> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
> - Brad's GFF parser
> - Deeper future: see the other mailing list thread
> 
> * GSoC 2011 results:
> - Mikael Trellet -- Interface
> - Michele Silva -- Mocapy++ Python module; also ported two
> applications to Biopython
> - Justinas D. -- Python-based extension system for Mocapy++
> 
> * Summer of Struct:
> Jo?o and Eric are working to refactor and merge the vast amount of
> Bio.PDB-related code produced during previous GSoCs. (Includes a
> planned SeqIO-style API for structures in PDB, mmCIF and PBDML
> formats.) Improvements have been trickling in since the last BOSC;
> here comes the flood.
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From arklenna at gmail.com  Fri Apr 13 03:26:35 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Thu, 12 Apr 2012 23:26:35 -0400
Subject: [Biopython-dev] [biopython] Fix flex library dependency of
	MMCIFlex; closes 2619 (#31)
In-Reply-To: <CAKVJ-_6VAMv6CAT2_An-syJ2ezOC4vT6kwj2vt0wHeCc07KHaw@mail.gmail.com>
References: <biopython/biopython/pull/31@github.com>
	<CAKVJ-_5FqKjoWqKd9SRD0=Sz3=_4BGDWpr1qomBxnpy6NaLnsw@mail.gmail.com>
	<CAK610_6ENA5W=4AUQv6XU8dd0pC0AszhGubN4JPAuLo65ec5oQ@mail.gmail.com>
	<CAKVJ-_647+L7j1TanQKhbDgL-2++hahxH5MQXEdEMyiiJv+VxQ@mail.gmail.com>
	<CAKVJ-_7sPgGaR8q9YF6+Ng2JeCXfJ4D05GDajObvjXHHwQ53Fg@mail.gmail.com>
	<CAK610_7TU99wF7NNeh5ukpdDVv8mhK+hDCtT1N-UOUb72=nPSg@mail.gmail.com>
	<CAKVJ-_4-xL4ZmWpMPfD7Xf1Et4yLmDHTL1az5ALVz9nJ-8hvgg@mail.gmail.com>
	<CAK610_7CHv88EjZcqZEdqo4Z_51FYJcZmGD_vhZ-iTDU-ULVuA@mail.gmail.com>
	<CAKVJ-_4hyUBpckiBQ4wUy_Ow9QT7pMy2tOhCbfoPeWVEbAfQwQ@mail.gmail.com>
	<CAKVJ-_6VAMv6CAT2_An-syJ2ezOC4vT6kwj2vt0wHeCc07KHaw@mail.gmail.com>
Message-ID: <CAK610_4kx7=v_AkTrg-DDsP7OhB+C6XGnBt+BSjMqMkNLeJDrA@mail.gmail.com>

On Thu, Mar 29, 2012 at 10:05 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Lenna,
>
> Have you tried your branch on Windows yet?
>
> It worked for me under my Python 2.5 setup using mingw32,
>
> C:\repositories\biopython>c:\python26\python setup.py install
> ...
> building 'Bio.PDB.mmCIF.MMCIFlex' extension
> creating build\temp.win32-2.5\Release\bio\pdb
> creating build\temp.win32-2.5\Release\bio\pdb\mmcif
> C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio
> -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/lex.yy.c -o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o
> lex.yy.c:1046: warning: 'yyunput' defined but not used
> C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio
> -Ic:\python25\include -Ic:\python25\PC -c
> Bio/PDB/mmCIF/MMCIFlexmodule.c -o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o
> writing build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def
> C:\cygwin\usr\bin\gcc.exe -mno-cygwin -shared -s
> build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o
> build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def
> -Lc:\python25\libs -Lc:\python25\PCBuild -lpython25 -lmsvcr71 -o
> build\lib.win32-2.5\Bio\PDB\mmCIF\MMCIFlex.pyd
> ...
>
> That worked fine and test_MMCIF.py is happy. However, MSVC v9 is not:
>
> C:\repositories\biopython>c:\python26\python setup.py install
> ...
> building 'Bio.PDB.mmCIF.MMCIFlex' extension
> C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe /c /nologo
> /Ox /MD /W3 /GS- /DNDEBUG -IBio -Ic:\python26\include -Ic:\python26\PC
> /TcBio/PDB/mmCIF/lex.yy.c
> /Fobuild\temp.win32-2.6\Release\Bio/PDB/mmCIF/lex.yy.obj
> lex.yy.c
> Bio/PDB/mmCIF/lex.yy.c(12) : fatal error C1083: Cannot open include
> file: 'unistd.h': No such file or directory
> error: command '"C:\Program Files\Microsoft Visual Studio
> 9.0\VC\BIN\cl.exe"' failed with exit status 2
>
> The same with Python 2.7 and the Microsoft compiler. Switching
> from this in Bio/PDB/mmCIF.yy.c:
>
> #include <unistd.h>
>
> to this:
>
> #include <io.h>
>
> lets it compile (although with some warnings) and test_MMCIF.py passes.
> If should be conditional of course, but I'm unclear if that is the appropriate
> fix or not though.
>
> Peter


Hi Peter,

I installed flex on my Windows VM and used it to generate lex.yy.c. It
puts #include <unistd.h> inside an #ifdef so it may work with MSVC. It
produces a working module for both Debian and Mac OS X (I do get
"defined but not used" warnings for generated functions). I've
cherry-picked it into my pull request.

I know you're quite busy right now with BOSC and GSoC, but let me know
if you get a chance to test it on MSVC.

Lenna


From p.j.a.cock at googlemail.com  Fri Apr 13 11:31:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 12:31:30 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
Message-ID: <CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>

On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> OK, here are some potential talking points I scraped from past announcements:
>
> * SeqIO.index_db:
> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
> carry the index_db concept to other modules.

Biopython 1.57 was already covered at BOSC 2011.

> * Installation improvements:
> pip support (v.1.57); easy_install will automatically handle the numpy
> dependency (v.1.59, Feb '12)

Brad commented on this, perhaps a line in the abstract?

> * Portability:
> Python 3 compatibility (except for a couple C extension modules);
> still supporting Jython; now mostly supporting Pypy (except for
> modules that use numpy or C extensions)

This is something I would want to cover.

> * Merged Brandon Invergo's independent project pypaml under
> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
> support (v.1.59) and the existing support for phylogeny I/O under
> Phylo, we can now easily assemble and run complete workflows involving
> PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and
> Bio.Phylo.Applications.PhymlCommandline.)

Yep.

> * GenomeDiagram improvements:
> New, pretty features. Eye candy for the slides.

Yep. Maybe even an example in the abstract?

> * TogoWS

Yep.

> * Next release & future plans:
> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
> - Brad's GFF parser
> - Deeper future: see the other mailing list thread

Good points - although I don't want to over promise ;)

> * GSoC 2011 results:
> - Mikael Trellet -- Interface
> - Michele Silva -- Mocapy++ Python module; also ported two
> applications to Biopython
> - Justinas D. -- Python-based extension system for Mocapy++

We should have a summary of what they did somewhere, perhaps
as an OBF blog post? I'm hoping to get this year's GSoC students
to write weekly progress reports on a blog or at least by email to
the mailing list.

> * Summer of Struct:
> Jo?o and Eric are working to refactor and merge the vast amount of
> Bio.PDB-related code produced during previous GSoCs. (Includes a
> planned SeqIO-style API for structures in PDB, mmCIF and PBDML
> formats.) Improvements have been trickling in since the last BOSC;
> here comes the flood.

:)

Here's a draft abstract - note we have to fit in a page. Having a logo
or some eye catching image is very effective for standing out in the
abstract book (on screen or on paper).

Comments welcome - but keep in mind the one page limit.

Eric - feel free to turn this into a Google Doc if you prefer.

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.pdf
Type: application/pdf
Size: 199737 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/9ebaac7d/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.tex
Type: application/x-tex
Size: 5037 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/9ebaac7d/attachment-0002.tex>

From eric.talevich at gmail.com  Fri Apr 13 14:31:08 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 13 Apr 2012 10:31:08 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
Message-ID: <CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>

Thanks for this. I'll keep it as LaTeX, since it already looks nice.

1. Several parts say "[to be revised prior to BOSC]" -- I take it we
have the option of updating our abstract shortly before BOSC, and this
is a note to the conference organizers that we intend to do so? To
save space and reduce distraction, should this be a footnote instead?

2. To save space: Do we need the line "Bioinformatics Open Source
Conference (BOSC) ..." after the author names?

3. Again to save space, and make room to cite the Phylo paper: can we
drop the citation for TogoWS, and add a few words of description in
the main text where it's mentioned? (We don't cite PAML, HMMer, etc.)

4. How do you feel about dropping inline citations, and just have a
list of \nocite references at the bottom? In a one-page abstract, it
should be easy enough for readers to figure out what's what.

-E

On Fri, Apr 13, 2012 at 7:31 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> OK, here are some potential talking points I scraped from past announcements:
>>
>> * SeqIO.index_db:
>> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to
>> carry the index_db concept to other modules.
>
> Biopython 1.57 was already covered at BOSC 2011.
>
>> * Installation improvements:
>> pip support (v.1.57); easy_install will automatically handle the numpy
>> dependency (v.1.59, Feb '12)
>
> Brad commented on this, perhaps a line in the abstract?
>
>> * Portability:
>> Python 3 compatibility (except for a couple C extension modules);
>> still supporting Jython; now mostly supporting Pypy (except for
>> modules that use numpy or C extensions)
>
> This is something I would want to cover.
>
>> * Merged Brandon Invergo's independent project pypaml under
>> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip
>> support (v.1.59) and the existing support for phylogeny I/O under
>> Phylo, we can now easily assemble and run complete workflows involving
>> PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and
>> Bio.Phylo.Applications.PhymlCommandline.)
>
> Yep.
>
>> * GenomeDiagram improvements:
>> New, pretty features. Eye candy for the slides.
>
> Yep. Maybe even an example in the abstract?
>
>> * TogoWS
>
> Yep.
>
>> * Next release & future plans:
>> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student
>> - Brad's GFF parser
>> - Deeper future: see the other mailing list thread
>
> Good points - although I don't want to over promise ;)
>
>> * GSoC 2011 results:
>> - Mikael Trellet -- Interface
>> - Michele Silva -- Mocapy++ Python module; also ported two
>> applications to Biopython
>> - Justinas D. -- Python-based extension system for Mocapy++
>
> We should have a summary of what they did somewhere, perhaps
> as an OBF blog post? I'm hoping to get this year's GSoC students
> to write weekly progress reports on a blog or at least by email to
> the mailing list.
>
>> * Summer of Struct:
>> Jo?o and Eric are working to refactor and merge the vast amount of
>> Bio.PDB-related code produced during previous GSoCs. (Includes a
>> planned SeqIO-style API for structures in PDB, mmCIF and PBDML
>> formats.) Improvements have been trickling in since the last BOSC;
>> here comes the flood.
>
> :)
>
> Here's a draft abstract - note we have to fit in a page. Having a logo
> or some eye catching image is very effective for standing out in the
> abstract book (on screen or on paper).
>
> Comments welcome - but keep in mind the one page limit.
>
> Eric - feel free to turn this into a Google Doc if you prefer.
>
> Peter


From p.j.a.cock at googlemail.com  Fri Apr 13 14:42:37 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 15:42:37 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
Message-ID: <CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>

On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Thanks for this. I'll keep it as LaTeX, since it already looks nice.
>
> 1. Several parts say "[to be revised prior to BOSC]" -- I take it we
> have the option of updating our abstract shortly before BOSC, and this
> is a note to the conference organizers that we intend to do so? To
> save space and reduce distraction, should this be a footnote instead?

It is common for BOSC abstracts to be revised following review prior to
acceptance (almost like a tiny paper), and yes, that was my intention.
Do you think something like [to be revised during abstract review]
might be clearer? I think this makes a lot of sense for the project
update talks in particular - but that stage for example we'll have the
GSoC students selected.

> 2. To save space: Do we need the line "Bioinformatics Open Source
> Conference (BOSC) ..." after the author names?

I like it to make the page self contained, useful if we post it as a lone
PDF file. The text could be smaller certainly if required - likewise the
logo could be shrunk a little.

> 3. Again to save space, and make room to cite the Phylo paper: can we
> drop the citation for TogoWS, and add a few words of description in
> the main text where it's mentioned? (We don't cite PAML, HMMer, etc.)

Fair point, I was thinking in terms of audience recognition. PAML
and HMMer are quite well known and relatively old/mature.

If the Phylo paper is accepted in time to be added to abstract then
of course we'd want to include it. But right now using a couple of
lines for a 'submitted' citation seemed overkill to me. But if you can
get it to fit nicely, please go ahead.

> 4. How do you feel about dropping inline citations, and just have a
> list of \nocite references at the bottom? In a one-page abstract, it
> should be easy enough for readers to figure out what's what.

If you prefer, or use the [1] style?

Peter


From eric.talevich at gmail.com  Fri Apr 13 15:40:06 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 13 Apr 2012 11:40:06 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
Message-ID: <CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>

On Fri, Apr 13, 2012 at 10:42 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Thanks for this. I'll keep it as LaTeX, since it already looks nice.
>>
>> 1. Several parts say "[to be revised prior to BOSC]" -- I take it we
>> have the option of updating our abstract shortly before BOSC, and this
>> is a note to the conference organizers that we intend to do so? To
>> save space and reduce distraction, should this be a footnote instead?
>
> It is common for BOSC abstracts to be revised following review prior to
> acceptance (almost like a tiny paper), and yes, that was my intention.
> Do you think something like [to be revised during abstract review]
> might be clearer? I think this makes a lot of sense for the project
> update talks in particular - but that stage for example we'll have the
> GSoC students selected.
>
>> 2. To save space: Do we need the line "Bioinformatics Open Source
>> Conference (BOSC) ..." after the author names?
>
> I like it to make the page self contained, useful if we post it as a lone
> PDF file. The text could be smaller certainly if required - likewise the
> logo could be shrunk a little.
>
>> 3. Again to save space, and make room to cite the Phylo paper: can we
>> drop the citation for TogoWS, and add a few words of description in
>> the main text where it's mentioned? (We don't cite PAML, HMMer, etc.)
>
> Fair point, I was thinking in terms of audience recognition. PAML
> and HMMer are quite well known and relatively old/mature.
>
> If the Phylo paper is accepted in time to be added to abstract then
> of course we'd want to include it. But right now using a couple of
> lines for a 'submitted' citation seemed overkill to me. But if you can
> get it to fit nicely, please go ahead.
>
>> 4. How do you feel about dropping inline citations, and just have a
>> list of \nocite references at the bottom? In a one-page abstract, it
>> should be easy enough for readers to figure out what's what.
>
> If you prefer, or use the [1] style?
>
> Peter

Here's an updated draft. How does it look?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.pdf
Type: application/pdf
Size: 262728 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/7c3bda8f/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Biopython_BOSC_abstract_2012_draft.tex
Type: application/x-tex
Size: 5573 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120413/7c3bda8f/attachment-0002.tex>

From p.j.a.cock at googlemail.com  Fri Apr 13 15:57:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 16:57:27 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
Message-ID: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>

On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Here's an updated draft. How does it look?

Looks fine to me - anyone else? A fresh pair of eyes would be good.

Also does anyone else want to be named as a talk co-author (and
promise to contribute with slides/figures/help for preparing the talk)?
Or should we just put "Eric et al" since he'll be the one on stage?

Peter


From anaryin at gmail.com  Fri Apr 13 16:02:04 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Fri, 13 Apr 2012 18:02:04 +0200
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
Message-ID: <CAJ9sUYM84rSr+F7VmF3nk83RiGz9dcE-sYZyuV5Ne1qSqaVNzQ@mail.gmail.com>

Third paragraph: 'summer' should read 'Summer'.

Good to me! I can help with the slides/figures/help, particularly on the
refactoring part of Bio.PDB to Bio.Struct. Let me know when and I can
easily get on Skype.

cheers!

Jo?o


From zhigang.wu at email.ucr.edu  Fri Apr 13 16:25:34 2012
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Fri, 13 Apr 2012 09:25:34 -0700
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
Message-ID: <CADhJE9sNOM4+5UOfLgF1VxhkyvT9NfT7ZAWNW6v+4ropGa0tHw@mail.gmail.com>

Probably I caught a grammar mistake.

Should we correct  "Biopython 1.60 is expected *to have been* released by
BOSC 2012"  to "Biopython 1.60 is expected *to be* released by BOSC 2012"?

Probably I was wrong. I am not a native speaker. :-)

Zhigang


On Fri, Apr 13, 2012 at 8:57 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> >
> > Here's an updated draft. How does it look?
>
> Looks fine to me - anyone else? A fresh pair of eyes would be good.
>
> Also does anyone else want to be named as a talk co-author (and
> promise to contribute with slides/figures/help for preparing the talk)?
> Or should we just put "Eric et al" since he'll be the one on stage?
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From arklenna at gmail.com  Fri Apr 13 16:31:53 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Fri, 13 Apr 2012 12:31:53 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CADhJE9sNOM4+5UOfLgF1VxhkyvT9NfT7ZAWNW6v+4ropGa0tHw@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
	<CADhJE9sNOM4+5UOfLgF1VxhkyvT9NfT7ZAWNW6v+4ropGa0tHw@mail.gmail.com>
Message-ID: <CAK610_5h6HdVdaNBVb8bQo8Z_oSkzJV2ooGjw+_qhDmLYVKBQw@mail.gmail.com>

On Fri, Apr 13, 2012 at 12:25 PM, Zhigang Wu <zhigang.wu at email.ucr.edu> wrote:
> Probably I caught a grammar mistake.
>
> Should we correct ?"Biopython 1.60 is expected *to have been* released by
> BOSC 2012" ?to "Biopython 1.60 is expected *to be* released by BOSC 2012"?
>
> Probably I was wrong. I am not a native speaker. :-)
>
> Zhigang
>

Hi Zhigang,

Actually, either way is correct - the original way is called the
future perfect tense.

Here's a description of the grammar if you are interested:
http://www.englishpage.com/verbpage/futureperfect.html

Lenna


From eric.talevich at gmail.com  Fri Apr 13 17:17:31 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 13 Apr 2012 13:17:31 -0400
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
Message-ID: <CAMC681=A9SDrjpmM3sWfwBXnidMa5q9Qac617-tkXJ5-1Vs_YA@mail.gmail.com>

On Fri, Apr 13, 2012 at 11:57 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> Here's an updated draft. How does it look?
>
> Looks fine to me - anyone else? A fresh pair of eyes would be good.
>
> Also does anyone else want to be named as a talk co-author (and
> promise to contribute with slides/figures/help for preparing the talk)?
> Or should we just put "Eric et al" since he'll be the one on stage?
>
> Peter

I added Jo?o as the fourth author and submitted it.

Cheers,
Eric


From p.j.a.cock at googlemail.com  Fri Apr 13 19:32:32 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 13 Apr 2012 20:32:32 +0100
Subject: [Biopython-dev] BOSC 2012 - Biopython Update
In-Reply-To: <CAMC681=A9SDrjpmM3sWfwBXnidMa5q9Qac617-tkXJ5-1Vs_YA@mail.gmail.com>
References: <CAKVJ-_6_UnkTNk4pW6fs3fMgixLH-Q48O8KMmky3xc5X_xYicg@mail.gmail.com>
	<CAMC681=X7SazZQ6csOvzDGbqP4YaurJGegSY36=EZ7_nPCgkRQ@mail.gmail.com>
	<CAKVJ-_5daA5Vgk0dyUzAbYKTBVoHVUkRiReAFftQU3pZukruUQ@mail.gmail.com>
	<CAMC681=ZT6_Y9TUtRHmGR2zm=spaFZyr+gVZ+NrftBdhnPJ=nA@mail.gmail.com>
	<CAKVJ-_6WdWWQVADkVHLNOU5mSiUt-Bi+ctcvLov6AqWri1Vm6A@mail.gmail.com>
	<CAMC681msf5R6eOqoe8qMa3TxrpVyiNHjQ7=XQj7UmoDYFHG8jQ@mail.gmail.com>
	<CAKVJ-_5eYWqV+AVbcQeBRu3ECheWKMp1NxLWVWpcu1EBmAWjzA@mail.gmail.com>
	<CAMC681=A9SDrjpmM3sWfwBXnidMa5q9Qac617-tkXJ5-1Vs_YA@mail.gmail.com>
Message-ID: <CAKVJ-_5eXqj=0xaNkCjYa+n8m6CmWN5nDLW-7+1LbvdhkHMpiQ@mail.gmail.com>

On Fri, Apr 13, 2012 at 6:17 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> I added Jo?o as the fourth author and submitted it.
>
> Cheers,
> Eric

Thanks Eric,

If there are any other comments or changes, we'll try to integrate
them along with any reviewers' comments.

Peter


From tiagoantao at gmail.com  Mon Apr 16 09:35:21 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 10:35:21 +0100
Subject: [Biopython-dev] plink phasing and others
Message-ID: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>

Hi,

During the last few months I have been in an hell hole writing code
like mad. Maybe some of this code is of interest to share.

I currently have:

1. Code to parse plink output. Pretty trivial stuff, but I bet lots of
people are doing this
2. Code to process admixture results. Admixture is far less used than STRUCTURE
3. Code to deal with phasing formats. Beagle, PHASE and shapeit
4. PCA
5. Some gene ontology stuff

My GO stuff is pretty specific, so I guess it might not be of interest.
All the other components are of fairly widely used things.
Admixture and PCA are standard popgen analysis. Admixture code could
probably be changed to also support STRUCTURE. I am not sure but PCA
might only work on linux.
Plink and phasing are of more general interest. These would be out of
Bio.PopGen.

There is no strange requirement to any of these code with one
exception: admixture and PCA require matplotib.

So that people have an understanding of the impact of these things, I
put the number of scholar citations:
plink - 3315
smartpca - 1673
admixture - 57
structure - 7448
beagle - >300
fastphase - 1935

Unfortunately there is little code to do automated analysis using these tools.

I could start migrating some of this code to biopython (would have to
write documentation, and comment the code better ;) )

-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin


From p.j.a.cock at googlemail.com  Mon Apr 16 10:26:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 11:26:30 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
Message-ID: <CAKVJ-_4u6ZEbz8hWYcC_ZG71+XOwtjqgyO+E4rViQYcpAWARTQ@mail.gmail.com>

2012/4/16 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> During the last few months I have been in an hell hole writing code
> like mad. Maybe some of this code is of interest to share.
>
> I currently have:
>
> 1. Code to parse plink output. Pretty trivial stuff, but I bet lots of
> people are doing this
> 2. Code to process admixture results. Admixture is far less used than STRUCTURE
> 3. Code to deal with phasing formats. Beagle, PHASE and shapeit
> 4. PCA
> 5. Some gene ontology stuff
>
> My GO stuff is pretty specific, so I guess it might not be of interest.
> All the other components are of fairly widely used things.
> Admixture and PCA are standard popgen analysis. Admixture code could
> probably be changed to also support STRUCTURE. I am not sure but PCA
> might only work on linux.
> Plink and phasing are of more general interest. These would be out of
> Bio.PopGen.
>
> There is no strange requirement to any of these code with one
> exception: admixture and PCA require matplotib.
>
> So that people have an understanding of the impact of these things, I
> put the number of scholar citations:
> plink - 3315
> smartpca - 1673
> admixture - 57
> structure - 7448
> beagle - >300
> fastphase - 1935
>
> Unfortunately there is little code to do automated analysis using these tools.
>
> I could start migrating some of this code to biopython (would have to
> write documentation, and comment the code better ;) )

Sounds good. The GO stuff would/should be more general than just
PopGen, and I know other people are looking at this on branches.

When you said PCA, that was principle component analysis, right?
What are you adding on top of NumPy/SciPy/matplotlib?

Peter


From tiagoantao at gmail.com  Mon Apr 16 12:05:34 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 13:05:34 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAKVJ-_4u6ZEbz8hWYcC_ZG71+XOwtjqgyO+E4rViQYcpAWARTQ@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAKVJ-_4u6ZEbz8hWYcC_ZG71+XOwtjqgyO+E4rViQYcpAWARTQ@mail.gmail.com>
Message-ID: <CAA9RGEPn+xfg=ZY2nwL9bXYABDDjJQN+0R3-wWbYv02sYLo79A@mail.gmail.com>

2012/4/16 Peter Cock <p.j.a.cock at googlemail.com>:
> Sounds good. The GO stuff would/should be more general than just
> PopGen, and I know other people are looking at this on branches.

What I do here is things like tree traversing (e.g. find all parent
nodes) and stuff like that. After that I do enrichment analysis
(fisher exact test, fdr, that stuff). Nothing of real interest for
now. I think we can ignore my code here (for now).

> When you said PCA, that was principle component analysis, right?

Yep, I am using eigenstrat/smartpca.

> What are you adding on top of NumPy/SciPy/matplotlib?

PCA plots and admixture plots.
Here is an example of both:
http://2.bp.blogspot.com/-6J6Gsas4uIs/TuELU3Gf4ZI/AAAAAAAAEWQ/CymvlzkX6hQ/s1600/PIIS0002929711004885.gr2_lrg.hi.jpg
TOP: PCA
Bottom: admixture


-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin


From p.j.a.cock at googlemail.com  Mon Apr 16 13:50:18 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 14:50:18 +0100
Subject: [Biopython-dev] [biopython] Fix flex library dependency of
 MMCIFlex; closes 2619 (#31)
In-Reply-To: <CAK610_4kx7=v_AkTrg-DDsP7OhB+C6XGnBt+BSjMqMkNLeJDrA@mail.gmail.com>
References: <biopython/biopython/pull/31@github.com>
	<CAKVJ-_5FqKjoWqKd9SRD0=Sz3=_4BGDWpr1qomBxnpy6NaLnsw@mail.gmail.com>
	<CAK610_6ENA5W=4AUQv6XU8dd0pC0AszhGubN4JPAuLo65ec5oQ@mail.gmail.com>
	<CAKVJ-_647+L7j1TanQKhbDgL-2++hahxH5MQXEdEMyiiJv+VxQ@mail.gmail.com>
	<CAKVJ-_7sPgGaR8q9YF6+Ng2JeCXfJ4D05GDajObvjXHHwQ53Fg@mail.gmail.com>
	<CAK610_7TU99wF7NNeh5ukpdDVv8mhK+hDCtT1N-UOUb72=nPSg@mail.gmail.com>
	<CAKVJ-_4-xL4ZmWpMPfD7Xf1Et4yLmDHTL1az5ALVz9nJ-8hvgg@mail.gmail.com>
	<CAK610_7CHv88EjZcqZEdqo4Z_51FYJcZmGD_vhZ-iTDU-ULVuA@mail.gmail.com>
	<CAKVJ-_4hyUBpckiBQ4wUy_Ow9QT7pMy2tOhCbfoPeWVEbAfQwQ@mail.gmail.com>
	<CAKVJ-_6VAMv6CAT2_An-syJ2ezOC4vT6kwj2vt0wHeCc07KHaw@mail.gmail.com>
	<CAK610_4kx7=v_AkTrg-DDsP7OhB+C6XGnBt+BSjMqMkNLeJDrA@mail.gmail.com>
Message-ID: <CAKVJ-_79nCQ15EBn7g0cirRsLUodZCCjy3=E2-8xEgt4VUmniQ@mail.gmail.com>

On Fri, Apr 13, 2012 at 4:26 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>
> Hi Peter,
>
> I installed flex on my Windows VM and used it to generate lex.yy.c. It
> puts #include <unistd.h> inside an #ifdef so it may work with MSVC. It
> produces a working module for both Debian and Mac OS X (I do get
> "defined but not used" warnings for generated functions). I've
> cherry-picked it into my pull request.
>

I've now tested that on my Windows machine (and Mac and Linux),
and applied the changes to the master branch. Thanks!

We must remember to drop an email to the Debian and RedHat
packaging teams since their old patch to setup.py isn't needed
now (they could control the flex problem by declaring it a build
time dependency).

Peter


From tiagoantao at gmail.com  Mon Apr 16 15:00:13 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 16:00:13 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
Message-ID: <CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>

Just a few practical things:

1. we still do not allow matplotlib dependencies, correct?
2. to what part of the name space should plink and phasing be added?
3. Are we on epidoc or sphinx? Or moving from one to the other?
doctest is acceptable right?
4. What is the current best way to run external applications? There
was an application wrapper class in the past...


From p.j.a.cock at googlemail.com  Mon Apr 16 15:18:10 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 16:18:10 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
Message-ID: <CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>

2012/4/16 Tiago Ant?o <tiagoantao at gmail.com>:
> Just a few practical things:
>
> 1. we still do not allow matplotlib dependencies, correct?

They would be run time dependencies, right? Not compile/build time?
We already have things like 'soft' dependencies on ReportLab and
NetworkX, and even matplotlib. It does complicate the unit tests a
bit to skip anything gracefully.

>
> 2. to what part of the name space should plink and phasing be added?

Unclear to me right now.

> 3. Are we on epidoc or sphinx? Or moving from one to the other?
> doctest is acceptable right?

We're still using LaTeX for the tutorial, and epydoc for the API docs.

Using doctest is acceptable and encouraged for documentation,
but be wary of cross platform differences. If you have a doctest
which has dependencies see test_wise.py rather than adding it
to run_tests.py

> 4. What is the current best way to run external applications? There
> was an application wrapper class in the past...

For simple Unix style applications controlled via the command line,
use the Bio.Application framework as in Bio.Align.Applications or
Bio.Sequencing.Applications, Bio.Phylo.Applications, or
Bio.Emboss.Applications (etc?).

Peter


From p.j.a.cock at googlemail.com  Mon Apr 16 15:20:59 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 16:20:59 +0100
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
	<CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
Message-ID: <CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>

On Sat, Apr 7, 2012 at 7:42 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich <eric.talevich at gmail.com>wrote:
>
>> Hi all,
>>
>> I'm considering some enhancements to the Phylo.draw function to make it
>> more customizable for power users. Since the function is based on
>> matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the
>> user; however, I'm not fully versed in what pyplot is capable of.
>>
>> Relevant feature request in Redmine:
>> https://redmine.open-bio.org/issues/3336
>>
>> Ideas:
>
> [...]
>
> Just committed this feature:
> https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d

Hi Eric,

That seems to have caused a test failure on one of our buildslaves:

======================================================================
ERROR: Run the tree layout algorithm, but don't display it.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py",
line 51, in test_draw
    Phylo.draw(dollo, do_show=False)
  File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py",
line 366, in draw
    fig = plt.figure()
  File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py",
line 270, in figure
    **kwargs)
  File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py",
line 120, in new_figure_manager
    backend_wx._create_wx_app()
  File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py",
line 1377, in _create_wx_app
    wxapp = wx.PySimpleApp()
  File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
line 8078, in __init__
    wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt)
  File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
line 7946, in __init__
    raise SystemExit(msg)
SystemExit: Unable to access the X Display, is $DISPLAY set properly?

----------------------------------------------------------------------

http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio
http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio

Interestingly the same machine is passing the tests under other Python versions.
That would seem to rule out the $DISPLAY environment variable being the cause.
My hunch would be this is something about the Python 2.6 install, perhaps it
is missing some library (wxPython maybe).

Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7
have the same version of matplotlib installed, but only one is failing the test:

$ python2.5
Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named matplotlib

$ python2.6
Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
>>> matplotlib.__version__
'1.0.0'

$ python2.7
Python 2.7 (r27:82500, Jul 13 2010, 14:02:41)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
>>> matplotlib.__version__
'1.0.0'


Peter


From tiagoantao at gmail.com  Mon Apr 16 15:31:50 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 16 Apr 2012 16:31:50 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
	<CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>
Message-ID: <CAA9RGEPBVfr=fvG4REOk0ZNrWPvS14ycJ-1mTkrYBMGNfZUbPw@mail.gmail.com>

2012/4/16 Peter Cock <p.j.a.cock at googlemail.com>:
> For simple Unix style applications controlled via the command line,
> use the Bio.Application framework as in Bio.Align.Applications or
> Bio.Sequencing.Applications, Bio.Phylo.Applications, or
> Bio.Emboss.Applications (etc?).

I wonder if people never had the need to abstract the computing
infrastructure? The current code does local (blocking) execution, but
we see environments with BAS or grids where other models are used. I
am not suggesting any specific solution, but the current approach
seems to me not very scalable. No?


-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin


From p.j.a.cock at googlemail.com  Mon Apr 16 16:08:20 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 16 Apr 2012 17:08:20 +0100
Subject: [Biopython-dev] plink phasing and others
In-Reply-To: <CAA9RGEPBVfr=fvG4REOk0ZNrWPvS14ycJ-1mTkrYBMGNfZUbPw@mail.gmail.com>
References: <CAA9RGEOD1BiaRAF5zYSoad_s5fZ3Jq6mQTRbbvuY7bNooz_E0g@mail.gmail.com>
	<CAA9RGEPKpK4Ti9_odipquk6qq1g0jhy3ARDvjZgv9BS-OTstpA@mail.gmail.com>
	<CAKVJ-_78-0YsHOSZ7J0SXcBdmQysdDx0EC7HTLdtEwi9nq2mYw@mail.gmail.com>
	<CAA9RGEPBVfr=fvG4REOk0ZNrWPvS14ycJ-1mTkrYBMGNfZUbPw@mail.gmail.com>
Message-ID: <CAKVJ-_4u+V9+HXoUo0fGTvnS8pi3QQ6wHbsF8KiHPSjOioL90g@mail.gmail.com>

2012/4/16 Tiago Ant?o <tiagoantao at gmail.com>:
> 2012/4/16 Peter Cock <p.j.a.cock at googlemail.com>:
>> For simple Unix style applications controlled via the command line,
>> use the Bio.Application framework as in Bio.Align.Applications or
>> Bio.Sequencing.Applications, Bio.Phylo.Applications, or
>> Bio.Emboss.Applications (etc?).
>
> I wonder if people never had the need to abstract the computing
> infrastructure? The current code does local (blocking) execution, but
> we see environments with BAS or grids where other models are used. I
> am not suggesting any specific solution, but the current approach
> seems to me not very scalable. No?

I use the current framework with an SGE cluster, str(cline_object)
gives the command line string to submit as the jobs.

It would be nice to have some documented examples using
this in combination with multiprocessing or something... but
I find most of the tools I call are already multi-threaded.

Peter


From andrew.sczesnak at med.nyu.edu  Mon Apr 16 16:48:41 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 16 Apr 2012 12:48:41 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
Message-ID: <4F8C4D69.4040009@med.nyu.edu>

Hi Eric,

I was playing with Bio.Cluster recently and noticed that trees generated 
by that module are not compatible with Bio.Phylo. I think it would be 
useful if output from Cluster could be manipulated with Phylo.

At first glance, it doesn't seem like it would be that tricky to add a 
method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, 
and I would be happy to work on this. Before making an attempt, I wanted 
to get your feedback on whether you think this would be useful and if 
you had anything similar in the works already.


Best,
Andrew


From eric.talevich at gmail.com  Mon Apr 16 22:15:14 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 16 Apr 2012 18:15:14 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <4F8C4D69.4040009@med.nyu.edu>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
Message-ID: <CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>

On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Hi Eric,
>
> I was playing with Bio.Cluster recently and noticed that trees generated by
> that module are not compatible with Bio.Phylo. I think it would be useful if
> output from Cluster could be manipulated with Phylo.
>
> At first glance, it doesn't seem like it would be that tricky to add a
> method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and
> I would be happy to work on this. Before making an attempt, I wanted to get
> your feedback on whether you think this would be useful and if you had
> anything similar in the works already.
>
>
> Best,
> Andrew

Hi Andrew,

Interesting idea. It would be simple enough to add a "from_cluster"
function or class method to either Phylo/BaseTree.py or
Phylo/_utils.py. But as every scientist knows, just because we can
doesn't necessarily mean we should. Do you have a specific use case in
mind?

If the main idea is to use Bio.Cluster to generate trees based on a
measure of sequence distance, we could probably do more to support
that. This code might also be worth posting on wiki "Phylo cookbook"
page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes
on it while we consider merging it into the main distribution.

-Eric


From andrew.sczesnak at med.nyu.edu  Mon Apr 16 22:47:25 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Mon, 16 Apr 2012 18:47:25 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
	<CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
Message-ID: <4F8CA17D.4080907@med.nyu.edu>

Eric,

I can describe two use cases from my own experience. First, the MAF 
parser I've been working on can pull the multiple alignment of some gene 
between a bunch of genomes. Thinking of recipes for the cookbook, I 
thought it would be neat to walk the user through constructing a 
distance matrix by hand (though you're right--more could be done to 
support this), clustering with Bio.Cluster and visualizing the result 
with Bio.Phylo. I like this example because it integrates several 
different parts of BioPython along with a lesson about inferring 
distances between sequences.

Second, for another project, I've been generating distance matrices 
based on the shared gene content of bacterial genomes and the 
presence-or-absence of orthologous groups in each. Presently, I ferry 
the matrices to a clustering program and then visualize the resulting 
trees in yet another tool. Looking into ways of streamlining this 
brought me back to Bio.Cluster, Bio.Phylo and the incompatibility of 
their tree objects.

I wonder, what would be the most elegant way of bridging the gap?


Best,
Andrew

On 04/16/2012 06:15 PM, Eric Talevich wrote:
> On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak
> <andrew.sczesnak at med.nyu.edu>  wrote:
>> Hi Eric,
>>
>> I was playing with Bio.Cluster recently and noticed that trees generated by
>> that module are not compatible with Bio.Phylo. I think it would be useful if
>> output from Cluster could be manipulated with Phylo.
>>
>> At first glance, it doesn't seem like it would be that tricky to add a
>> method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and
>> I would be happy to work on this. Before making an attempt, I wanted to get
>> your feedback on whether you think this would be useful and if you had
>> anything similar in the works already.
>>
>>
>> Best,
>> Andrew
>
> Hi Andrew,
>
> Interesting idea. It would be simple enough to add a "from_cluster"
> function or class method to either Phylo/BaseTree.py or
> Phylo/_utils.py. But as every scientist knows, just because we can
> doesn't necessarily mean we should. Do you have a specific use case in
> mind?
>
> If the main idea is to use Bio.Cluster to generate trees based on a
> measure of sequence distance, we could probably do more to support
> that. This code might also be worth posting on wiki "Phylo cookbook"
> page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes
> on it while we consider merging it into the main distribution.
>
> -Eric


From eric.talevich at gmail.com  Tue Apr 17 04:17:26 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 17 Apr 2012 00:17:26 -0400
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
	<CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
	<CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>
Message-ID: <CAMC681mQpjncc-wBctPPDgqRQEbO5RVJFeZ+=ky_dZinTr939g@mail.gmail.com>

On Mon, Apr 16, 2012 at 11:20 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Eric,
>
> That seems to have caused a test failure on one of our buildslaves:
>
> ======================================================================
> ERROR: Run the tree layout algorithm, but don't display it.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> ?File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py",
> line 51, in test_draw
> ? ?Phylo.draw(dollo, do_show=False)
> ?File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py",
> line 366, in draw
> ? ?fig = plt.figure()
> ?File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py",
> line 270, in figure
> ? ?**kwargs)
> ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py",
> line 120, in new_figure_manager
> ? ?backend_wx._create_wx_app()
> ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py",
> line 1377, in _create_wx_app
> ? ?wxapp = wx.PySimpleApp()
> ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
> line 8078, in __init__
> ? ?wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt)
> ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py",
> line 7946, in __init__
> ? ?raise SystemExit(msg)
> SystemExit: Unable to access the X Display, is $DISPLAY set properly?
>
> ----------------------------------------------------------------------
>
> http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio
> http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio
>
> Interestingly the same machine is passing the tests under other Python versions.
> That would seem to rule out the $DISPLAY environment variable being the cause.
> My hunch would be this is something about the Python 2.6 install, perhaps it
> is missing some library (wxPython maybe).
>
> Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7
> have the same version of matplotlib installed, but only one is failing the test:
>
> $ python2.5
> Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import matplotlib
> Traceback (most recent call last):
> ?File "<stdin>", line 1, in <module>
> ImportError: No module named matplotlib
>
> $ python2.6
> Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import matplotlib
>>>> matplotlib.__version__
> '1.0.0'
>
> $ python2.7
> Python 2.7 (r27:82500, Jul 13 2010, 14:02:41)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import matplotlib
>>>> matplotlib.__version__
> '1.0.0'
>
>
> Peter


Actually, it was this commit which added new unit tests:
https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8

On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not
sure how to debug this, exactly. Do you know a way to prevent
matplotlib from attempting to launch the Wx app, beyond turn off
interactive mode as the test already does?

One idea is to specify a matplotlib backend other than wx. For
example, using this import approach in test_Phylo_depend.py might do
the trick:

try:
    import matplotlib
except ImportError:
    raise MissingExternalDependencyError(
            "Install matplotlib if you want to use Bio.Phylo._utils.")
else:
    # Don't use the Wx backend for matplotlib, b/c that depends on Wx being
    # properly set up on the build machine. Instead, use the simpler postscript
    # backend -- we're not going to display or save the plot anyway, so it
    # doesn't matter much, as long as it's not Wx. I guess.
    matplotlib.use("ps")
    from matplotlib import pyplot


Would you be able to test this on the errant buildbot machine without
having to commit this to the trunk?


Thanks,
Eric


From p.j.a.cock at googlemail.com  Tue Apr 17 09:31:05 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 17 Apr 2012 10:31:05 +0100
Subject: [Biopython-dev] Enhancements to Phylo.draw;
	pyplot best practices
In-Reply-To: <CAMC681mQpjncc-wBctPPDgqRQEbO5RVJFeZ+=ky_dZinTr939g@mail.gmail.com>
References: <CAMC681nyVV6mNNSgT1zZeek+NwE3UwxprRArgGNFMeX-b3yPpA@mail.gmail.com>
	<CAMC681k3p-mRzUcdFmQW0_wsx64ENOgDdGtWtpzVROedax1EXg@mail.gmail.com>
	<CAKVJ-_5QdK=eTdnvyvyPFkCHJ+2X+tCUSTotZcWGw5p5k-k3GA@mail.gmail.com>
	<CAMC681mQpjncc-wBctPPDgqRQEbO5RVJFeZ+=ky_dZinTr939g@mail.gmail.com>
Message-ID: <CAKVJ-_74uXOPHy=heMi7=PMc_TrJt7UFiwHsTORpHK_TO86-TQ@mail.gmail.com>

On Tue, Apr 17, 2012 at 5:17 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Actually, it was this commit which added new unit tests:
> https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8
>

OK - thanks for checking.

> On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not
> sure how to debug this, exactly. Do you know a way to prevent
> matplotlib from attempting to launch the Wx app, beyond turn off
> interactive mode as the test already does?

Not sure.

> One idea is to specify a matplotlib backend other than wx. For
> example, using this import approach in test_Phylo_depend.py might do
> the trick:
>
> try:
> ? ?import matplotlib
> except ImportError:
> ? ?raise MissingExternalDependencyError(
> ? ? ? ? ? ?"Install matplotlib if you want to use Bio.Phylo._utils.")
> else:
> ? ?# Don't use the Wx backend for matplotlib, b/c that depends on Wx being
> ? ?# properly set up on the build machine. Instead, use the simpler postscript
> ? ?# backend -- we're not going to display or save the plot anyway, so it
> ? ?# doesn't matter much, as long as it's not Wx. I guess.
> ? ?matplotlib.use("ps")
> ? ?from matplotlib import pyplot
>
>
> Would you be able to test this on the errant buildbot machine without
> having to commit this to the trunk?

Yes, that works (this buildbot is one of 'my' servers so I can run this
directly). Please check that fix in.

Thanks,

Peter


From p.j.a.cock at googlemail.com  Tue Apr 17 15:23:22 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 17 Apr 2012 16:23:22 +0100
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
Message-ID: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>

On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Here are some things that I think are strong
> candidates for 1.60 (not an exclusive list!)
>
> ...
>
> BGZF support: Low level module like Python's gzip,
> support in SeqIO for indexing BGZF compressed files,
> ...

I've just rebased my bgzf branch, which I think is ready to apply to the
trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
https://github.com/peterjc/biopython/tree/bgzf2

Would anyone like to review this please? There are unittests and
plenty of docstrings - but so far nothing in the Tutorial though.

I wrote a blog post late last year explaining what this allows, and
this branch includes the changes to Bio.SeqIO to index BGZF
compressed sequence files this discussed:
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

The probable next step after this is combining it with Andrew Sczesnak's
work on indexing MAF files (they can get pretty big) as explored by 'I.J.'
(who as far as I know hasn't signed up to the biopython-dev list, BCC'd).

Also it would be interesting to explore doing the (de)compression of
blocks on worker threads to take advantage of multiple cores.

Another idea would be too switch from a plain dictionary to an
ordered dictionary for holding cached decompressed blocks,
giving a way to drop the oldest block (although not perhaps as
clever as dropping the lest recently used block, the overhead is
lower). That would require including our own OrderedDict class
on the older Python platforms.

Peter

[*] PyPy testing is complicated by running out of file handles,
an existing issue not something directly from this work. Part
of this is down to different GC under PyPy.


From eric.talevich at gmail.com  Tue Apr 17 15:25:35 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 17 Apr 2012 11:25:35 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <4F8CA17D.4080907@med.nyu.edu>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
	<CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
	<4F8CA17D.4080907@med.nyu.edu>
Message-ID: <CAMC681m9Vd3J-CVfJE8aMq0Pmd66OXoThdJD1LaFguWagNTqqQ@mail.gmail.com>

Andrew,

It would be useful to have a quick and portable function for
distance-based tree estimation in Bio.Phylo, since otherwise it's
necessary to use one of the wrappers for external programs in
Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does
the hierarchical clustering algorithm in Bio.Cluster correspond to any
common tree-estimation algorithm, e.g. UPGMA? If so, then it would
make a lot of sense to provide the glue for using it that way. If you
have done some work in this direction, I would be happy to see it.

-Eric


On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Eric,
>
> I can describe two use cases from my own experience. First, the MAF parser
> I've been working on can pull the multiple alignment of some gene between a
> bunch of genomes. Thinking of recipes for the cookbook, I thought it would
> be neat to walk the user through constructing a distance matrix by hand
> (though you're right--more could be done to support this), clustering with
> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example
> because it integrates several different parts of BioPython along with a
> lesson about inferring distances between sequences.
>
> Second, for another project, I've been generating distance matrices based on
> the shared gene content of bacterial genomes and the presence-or-absence of
> orthologous groups in each. Presently, I ferry the matrices to a clustering
> program and then visualize the resulting trees in yet another tool. Looking
> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and
> the incompatibility of their tree objects.
>
> I wonder, what would be the most elegant way of bridging the gap?
>
>
> Best,
> Andrew
>


From bioinformed at gmail.com  Tue Apr 17 16:11:37 2012
From: bioinformed at gmail.com (Kevin Jacobs <jacobs@bioinformed.com>)
Date: Tue, 17 Apr 2012 12:11:37 -0400
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
Message-ID: <CAD=vDiqJLx=D_t0RVt1nPCTwxjgwpTXsvQprmd_hX5ffrR7PZQ@mail.gmail.com>

On Tue, Apr 17, 2012 at 11:23 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > Here are some things that I think are strong
> > candidates for 1.60 (not an exclusive list!)
> >
> > ...
> >
> > BGZF support: Low level module like Python's gzip,
> > support in SeqIO for indexing BGZF compressed files,
> > ...
>
> I've just rebased my bgzf branch, which I think is ready to apply to the
> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
> https://github.com/peterjc/biopython/tree/bgzf2
>
> Would anyone like to review this please? There are unittests and
> plenty of docstrings - but so far nothing in the Tutorial though.
>
>
Hi Peter,

I've implemented code to create BAM/tabix style index files and perform
lookups, so it has been high on my list to test and validate your BGZF code
(rather having to write my own).  I'm notoriously short on time, but this
is in the critical path for several projects and I'm going to work on it
over the next week or so.

-Kevin


From redmine at redmine.open-bio.org  Wed Apr 18 01:29:29 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 18 Apr 2012 01:29:29 +0000
Subject: [Biopython-dev] [Biopython - Bug #3333] PhyloXML writer fails to
	include is_aligned attribute with mol_seq elements
References: <redmine.issue-3333.20120326222124@redmine.open-bio.org>
Message-ID: <redmine.journal-14810.20120418012929@redmine.open-bio.org>


Issue #3333 has been updated by Eric Talevich.


The answer is: I'm an idiot. The mol_seq attribute was first defined as a complex attribute in the writer (via _handle_complex), but then further down redefined as a simple attribute.

Fix:
https://github.com/biopython/biopython/commit/a93c9892268274c4969131a1d401bb8ee235524a
----------------------------------------
Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements
https://redmine.open-bio.org/issues/3333

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


First reported here:
http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html

Steps to reproduce:

1. Load a tree, convert to PhyloXML
<pre>
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
</pre>

2. Add a sequence
<pre>
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
</pre>

3. Verify that the sequence information has been set -- mol_seq has is_aligned set
<pre>
print tree
</pre>

<pre>
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
</pre>

4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
<pre>
print tree.format('phyloxml')
</pre>

<pre>
...
<phy:clade>
  <phy:name>c</phy:name>
  <phy:branch_length>1.0</phy:branch_length>
  <phy:sequence type="dna">
    <phy:mol_seq>AAA</phy:mol_seq>
  </phy:sequence>
</phy:clade>
...
</pre>


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Apr 18 01:52:03 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 18 Apr 2012 01:52:03 +0000
Subject: [Biopython-dev] [Biopython - Bug #3333] (Closed) PhyloXML writer
	fails to include is_aligned attribute with mol_seq elements
References: <redmine.issue-3333.20120326222124@redmine.open-bio.org>
Message-ID: <redmine.journal-14811.20120418015203@redmine.open-bio.org>


Issue #3333 has been updated by Eric Talevich.

Status changed from New to Closed
% Done changed from 0 to 100


----------------------------------------
Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements
https://redmine.open-bio.org/issues/3333

Author: Eric Talevich
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


First reported here:
http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html

Steps to reproduce:

1. Load a tree, convert to PhyloXML
<pre>
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
</pre>

2. Add a sequence
<pre>
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
</pre>

3. Verify that the sequence information has been set -- mol_seq has is_aligned set
<pre>
print tree
</pre>

<pre>
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
</pre>

4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
<pre>
print tree.format('phyloxml')
</pre>

<pre>
...
<phy:clade>
  <phy:name>c</phy:name>
  <phy:branch_length>1.0</phy:branch_length>
  <phy:sequence type="dna">
    <phy:mol_seq>AAA</phy:mol_seq>
  </phy:sequence>
</phy:clade>
...
</pre>


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Thu Apr 19 04:27:49 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 19 Apr 2012 04:27:49 +0000
Subject: [Biopython-dev] [Biopython - Feature #3342] (New)
	Phylo.root_with_outgroup: set the length of the outgroup branch
Message-ID: <redmine.issue-3342.20120419042749@redmine.open-bio.org>


Issue #3342 has been reported by Eric Talevich.

----------------------------------------
Feature #3342: Phylo.root_with_outgroup: set the length of the outgroup branch
https://redmine.open-bio.org/issues/3342

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Add an option to the root_with_outgroup method to specify the length of the branch leading from the new root to the outgroup. This should not change the total tree length, i.e. this length is subtracted from the branch on the other side of the root.

This option makes it possible to root the tree in other ways that split the outgroup branch, leaving a bifurcating rather than trifurcating root.

I've attached a patch that implements this feature, plus unit tests for it.

HOWEVER:

A sane API for this method would look like:

>>> tree.root_with_outgroup("apple", "orange", outgroup_branch_length=0.4)

The original function definition included *args for specifying the outgroup taxa in one shot (instead of requiring a separate call to common_ancestor). But while Python 3 permits keyword-only arguments (a defined keyword argument after *args or just *), Python 2 does not. So I made the function calling style shown above work in a very weird way: the function definition has **kwargs instead of outgroup_branch_length=None, and the necessary keyword argument is pulled out of kwargs inside the body of the function. The name of this argument is given in the docstring, so it's still partly discoverable.

Are we cool with this? Or, can anyone think of a better way to handle this?


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Fri Apr 20 08:39:02 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Apr 2012 09:39:02 +0100
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser
	(#33)
In-Reply-To: <biopython/biopython/pull/33@github.com>
References: <biopython/biopython/pull/33@github.com>
Message-ID: <CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>

I've had a quick look on GitHub and it isn't obvious to me how to get
pull request emails CC'd to our dev mailing list... but anyway, Lenna
has been busy:

Peter

---------- Forwarded message ----------
From: Lenna Peterson
<reply+i-4201999-d8628b2a34f52e923e8471a792110c2edfbe13a8-63959 at reply.github.com>
Date: Thu, Apr 19, 2012 at 11:35 PM
Subject: [biopython] Feature: Python implementation of MMCIF parser (#33)
To: Peter Cock <p.j.a.cock at googlemail.com>


I've written a PLY (Python lex-yacc) module that is superimposable
with the C MMCIF module.

I've also partially rewritten the C MMCIF module to be object-oriented.

### Changed files ###

* MMCIFlexmodule.c: Now object-oriented (open file in constructor,
close file in destructor, etc). Docstrings! Added file IO exception.
* MMCIF2Dict.py: Minor changes for new object oriented API
* MMCIFParser: Changed all uses of map() to list comprehensions (more
compatible with 3)

### New files ###

* MMCIFlex.py: PLY-based module for tokenizing input.

### What it needs ###
Addition of PLY dependency to setup.py.
I'm not quite sure how to handle this, as PLY wouldn't be necessary on
a platform with C Python. Thoughts? Which non-CPython implementations
are worth testing?


New C module tested on Python 2.6 on Mac OS X and Debian. I hope it
still works on Windows.
On my machine, the C module processes a 30,000 line test file in 10-15
ms; the Python module takes ~150 ms.

You can merge this Pull Request by running:

?git pull https://github.com/lennax/biopython MMCIF2

Or you can view, comment on it, or merge it online at:

?https://github.com/biopython/biopython/pull/33

-- Commit Summary --

* Ply test in progress.
* Quoted values with spaces are being broken.
* Removed hard inclusion of ply.
* Fixed quoted strings with spaces.
* Changed Parser call to 2Dict. Semicolons break.
* Changed Parser call to 2Dict. Semicolons break.
* Lexes full file w/o error, FIXME loops
* Tweak: comment handling
* Changed token "NAME" to "TAG"
* Using IUCr grammar. FIXME quote/semi
* Fixed quoted strings.
* Semicolon text field fixed, FIXME included \n
* Fixed semi newlines.
* non-eol temp fix, doesn't match single chars
* Lexes full CIF file with no noticed errors.
* Added timing.
* Added states to lexer.
* Lex loops into [header, [items], ...]; \d hacks.
* Enforced semicolon rule.
* Yacc works.
* Re-added values to lexer state 'loop'
* FIXME syntax error/hangs on full file.
* Lexer gathers values, added parse precedence.
* Minor lex cleanup.
* Testing exclusionary lex redo.
* Streamlined rules, no loop yet.
* Still won't yacc 30k line file.
* Merge branch 'master' of git://github.com/biopython/biopython into ply2
* Added __name__ __main__ check.
* Parser redo, still doesn't parse 30k line file.
* Added comments to tokenizer.
* Fixed lex module's callability from yacc.
* Fixed DATA token failure.
* Multiple improvements, still no 30k.
* Moved lexer arguments to constructor.
* Moved data input to constructor, added docs
* Validated to pep8.
* Merge branch 'master' of git://github.com/biopython/biopython into ply2
* Add MMCIF2Dict from ply branch.
* Remove flex header dependency of CIF parser.
* Update MMCIFParser call of MMCIF2Dict.
* PLY lexer works with MMCIF2Dict.
* Cleanup.
* Cleaned up import.
* Updated docstring.
* Subclassed dict.
* Restored MMCIFParser call to MMCIF2Dict.
* Removed main() from lex input.
* Restored newline.
* Fix C prototype warnings.
* Modifying python lexer to be substitutable w/ C.
* Make header for generated C.
* Import C lexer or Python lexer.
* Improvements and documentation.
* Uncomment GLOBAL token definition.
* PLY lexer and C lexer should be interchangeable.
* Improve error reporting of import.
* Turn on ply lex optimize.
* Call instance of Python lexer.
* Working on implementing class in C module.
* Start unit test for MMCIF.
* Minimal unit test for MMCIFParser.
* Revert to old generated C; manually added noyywrap
* Manually added function prototypes to generated C.
* Merge branch 'ply2' into dev
* Merge branch 'ply' into dev
* Merge branch 'c-dev' into dev
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Cleaning up old files.
* More cleanup.
* Merging Parser from MMCIFlex branch.
* Parser and unit test for PyCIFRW
* Python and C lexer APIs are now identical.
* Add copyright and license notices.
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Trying GnuWin32 flex-generated C.
* Win flex generated with new mmcif.lex
* GnuWin32 flex generated C, used dos2unix for CRLF
* Added correct author to flex C module.
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Merge branch 'master' of git://github.com/biopython/biopython into dev
* Change map() to list comprehensions for 3 compat.
* Renamed python lexer to match C module.
* Added file IO exception to C module.
* Tweak lexer module import.
* Prep Python CIF lexer for pull request.
* Whitespace tweaks.

-- File Changes --

M Bio/PDB/MMCIF2Dict.py (20)
M Bio/PDB/MMCIFParser.py (8)
A Bio/PDB/mmCIF/MMCIFlex.py (253)
M Bio/PDB/mmCIF/MMCIFlexmodule.c (122)

-- Patch Links --

?https://github.com/biopython/biopython/pull/33.patch
?https://github.com/biopython/biopython/pull/33.diff

---
Reply to this email directly or view it on GitHub:
https://github.com/biopython/biopython/pull/33


From andrew.sczesnak at med.nyu.edu  Fri Apr 20 22:28:43 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Fri, 20 Apr 2012 18:28:43 -0400
Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo
In-Reply-To: <CAMC681m9Vd3J-CVfJE8aMq0Pmd66OXoThdJD1LaFguWagNTqqQ@mail.gmail.com>
References: <CAMC681m=rPvg90yamvmM=oJ_KSQRVog+q5fq0ezQdcoQSz+GxQ@mail.gmail.com>
	<4F8C4D69.4040009@med.nyu.edu>
	<CAMC681mAkQDmqDmxBeHRdj_rjS7E1mJMLFiyX00AA9pUR78auQ@mail.gmail.com>
	<4F8CA17D.4080907@med.nyu.edu>
	<CAMC681m9Vd3J-CVfJE8aMq0Pmd66OXoThdJD1LaFguWagNTqqQ@mail.gmail.com>
Message-ID: <4F91E31B.9030101@med.nyu.edu>

Eric,

If my understanding is correct, UPGMA is slang for agglomerative 
average-linkage hierarchical clustering which is implemented along with 
single- and complete-linkage in the module. There's no equivalent of 
neighbor-joining or maximum-likelihood and Bio.Cluster probably isn't 
that fast with large numbers of nodes so wrappers are still useful. We 
could probably add an NJ implementation for small matrices pretty easily 
if you think it's worthwhile.

Either way, the glue could be useful for visualizing relationships 
between genes/samples in microarrays (what I gather Bio.Cluster is 
intended for).


Andrew

On 04/17/2012 11:25 AM, Eric Talevich wrote:
> Andrew,
>
> It would be useful to have a quick and portable function for
> distance-based tree estimation in Bio.Phylo, since otherwise it's
> necessary to use one of the wrappers for external programs in
> Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does
> the hierarchical clustering algorithm in Bio.Cluster correspond to any
> common tree-estimation algorithm, e.g. UPGMA? If so, then it would
> make a lot of sense to provide the glue for using it that way. If you
> have done some work in this direction, I would be happy to see it.
>
> -Eric
>
>
> On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak
> <andrew.sczesnak at med.nyu.edu>  wrote:
>> Eric,
>>
>> I can describe two use cases from my own experience. First, the MAF parser
>> I've been working on can pull the multiple alignment of some gene between a
>> bunch of genomes. Thinking of recipes for the cookbook, I thought it would
>> be neat to walk the user through constructing a distance matrix by hand
>> (though you're right--more could be done to support this), clustering with
>> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example
>> because it integrates several different parts of BioPython along with a
>> lesson about inferring distances between sequences.
>>
>> Second, for another project, I've been generating distance matrices based on
>> the shared gene content of bacterial genomes and the presence-or-absence of
>> orthologous groups in each. Presently, I ferry the matrices to a clustering
>> program and then visualize the resulting trees in yet another tool. Looking
>> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and
>> the incompatibility of their tree objects.
>>
>> I wonder, what would be the most elegant way of bridging the gap?
>>
>>
>> Best,
>> Andrew
>>


From andrew.sczesnak at med.nyu.edu  Fri Apr 20 22:35:59 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Fri, 20 Apr 2012 18:35:59 -0400
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
Message-ID: <4F91E4CF.8040602@med.nyu.edu>

Peter,

My colleague was writing some code using MafIndex and commented how long 
it took her to download, decompress and index the human multiz 
alignments from UCSC. It seems like it'd be great to keep the files 
compressed... perhaps if the code works well enough we can convince UCSC 
to host bgzip'd copies (or maybe them available on one of our 
institutions servers).

Is I.J. interested in joining the community? I'd like to look into 
adding BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If 
not, could you put me in touch?


Andrew

On 04/17/2012 11:23 AM, Peter Cock wrote:
> On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock<p.j.a.cock at googlemail.com>  wrote:
>>
>> Here are some things that I think are strong
>> candidates for 1.60 (not an exclusive list!)
>>
>> ...
>>
>> BGZF support: Low level module like Python's gzip,
>> support in SeqIO for indexing BGZF compressed files,
>> ...
>
> I've just rebased my bgzf branch, which I think is ready to apply to the
> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3.
> https://github.com/peterjc/biopython/tree/bgzf2
>
> Would anyone like to review this please? There are unittests and
> plenty of docstrings - but so far nothing in the Tutorial though.
>
> I wrote a blog post late last year explaining what this allows, and
> this branch includes the changes to Bio.SeqIO to index BGZF
> compressed sequence files this discussed:
> http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
>
> The probable next step after this is combining it with Andrew Sczesnak's
> work on indexing MAF files (they can get pretty big) as explored by 'I.J.'
> (who as far as I know hasn't signed up to the biopython-dev list, BCC'd).
>
> Also it would be interesting to explore doing the (de)compression of
> blocks on worker threads to take advantage of multiple cores.
>
> Another idea would be too switch from a plain dictionary to an
> ordered dictionary for holding cached decompressed blocks,
> giving a way to drop the oldest block (although not perhaps as
> clever as dropping the lest recently used block, the overhead is
> lower). That would require including our own OrderedDict class
> on the older Python platforms.
>
> Peter
>
> [*] PyPy testing is complicated by running out of file handles,
> an existing issue not something directly from this work. Part
> of this is down to different GC under PyPy.


From arklenna at gmail.com  Sat Apr 21 00:57:21 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Fri, 20 Apr 2012 20:57:21 -0400
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF
 parser (#33)
In-Reply-To: <CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
Message-ID: <CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>

On Fri, Apr 20, 2012 at 4:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> I've had a quick look on GitHub and it isn't obvious to me how to get
> pull request emails CC'd to our dev mailing list... but anyway, Lenna
> has been busy:
>
> Peter
>
> ---------- Forwarded message ----------
> From: Lenna Peterson
> <reply+i-4201999-d8628b2a34f52e923e8471a792110c2edfbe13a8-63959 at reply.github.com>
> Date: Thu, Apr 19, 2012 at 11:35 PM
> Subject: [biopython] Feature: Python implementation of MMCIF parser (#33)
> To: Peter Cock <p.j.a.cock at googlemail.com>
>
>
> I've written a PLY (Python lex-yacc) module that is superimposable
> with the C MMCIF module.
>
> I've also partially rewritten the C MMCIF module to be object-oriented.
>
> ### Changed files ###
>
> * MMCIFlexmodule.c: Now object-oriented (open file in constructor,
> close file in destructor, etc). Docstrings! Added file IO exception.
> * MMCIF2Dict.py: Minor changes for new object oriented API
> * MMCIFParser: Changed all uses of map() to list comprehensions (more
> compatible with 3)
>
> ### New files ###
>
> * MMCIFlex.py: PLY-based module for tokenizing input.
>
> ### What it needs ###
> Addition of PLY dependency to setup.py.
> I'm not quite sure how to handle this, as PLY wouldn't be necessary on
> a platform with C Python. Thoughts? Which non-CPython implementations
> are worth testing?
>
>
> New C module tested on Python 2.6 on Mac OS X and Debian. I hope it
> still works on Windows.
> On my machine, the C module processes a 30,000 line test file in 10-15
> ms; the Python module takes ~150 ms.


I've started testing the PLY lexer on PyPy. NumPyPy now implements
more functions needed by PDB; the only things I found to be missing
are random and linalg. This eliminates Superimposer, FragmentMapper,
and Vector.

I played around with trying to spoof "import numpy" to automatically
import numpypy (code here: https://gist.github.com/2432815) but I
don't think that's wise yet.

My last commit to this branch was a few changes to allow the MMCIF
parser to work on NumPy. PyPy won't run `setup.py test` due to global
numpy failure, but if I install this branch and `pypy test_MMCIF.py`,
it passes.

Anybody with more PyPy and/or package structuring experience have thoughts?

Lenna


From p.j.a.cock at googlemail.com  Sat Apr 21 10:32:33 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 21 Apr 2012 11:32:33 +0100
Subject: [Biopython-dev] [biopython] Feature: Python implementation of
	MMCIF parser (#33)
In-Reply-To: <CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
	<CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
Message-ID: <CAKVJ-_48hRJ-8+h4=Mo8e-7Sp2TP+XBFqnywBFmJV-V+cPjryA@mail.gmail.com>

On Saturday, April 21, 2012, Lenna Peterson wrote:

>
> > ### What it needs ###
> > Addition of PLY dependency to setup.py.
> > I'm not quite sure how to handle this, as PLY wouldn't be necessary on
> > a platform with C Python. Thoughts? Which non-CPython implementations
> > are worth testing?


Basically Jython (which we've tried to support for a while) and PyPy
(which I would like to officially support in future). Although a pure
python setup can be useful in other settings, e.g. Windows
development without the compilers otherwise needed.

However, neither of those have NumPy (yet), which we need for
the PDB module that would use the MMCIF parser.

>
> > New C module tested on Python 2.6 on Mac OS X and Debian. I hope it
> > still works on Windows.
> > On my machine, the C module processes a 30,000 line test file in 10-15
> > ms; the Python module takes ~150 ms.


That's a factor of ten slower, but still sounds fast enough perhaps
that we don't really need the C code for usability.

>
> I've started testing the PLY lexer on PyPy. NumPyPy now implements
> more functions needed by PDB; the only things I found to be missing
> are random and linalg. This eliminates Superimposer, FragmentMapper,
> and Vector.
>
> I played around with trying to spoof "import numpy" to automatically
> import numpypy (code here: https://gist.github.com/2432815) but I
> don't think that's wise yet.
>
> My last commit to this branch was a few changes to allow the MMCIF
> parser to work on NumPy. PyPy won't run `setup.py test` due to global
> numpy failure, but if I install this branch and `pypy test_MMCIF.py`,
> it passes.
>
> Anybody with more PyPy and/or package structuring experience

have thoughts?


I filed a few bugs on missing code in PyPy's NumPy re-implementation
(now called numpypy), good to hear they are getting closer to being
enough for us to run Bio.PDB on it. Thank you for exploring this.

Right now with in you shoes for MMCIF parsing I would focus on
the parser failures with certain input files - there is an open bug
on RedMine https://redmine.open-bio.org/issues/2626 and the
Issue of multiple models (Eric can probably advise here),
https://redmine.open-bio.org/issues/2943

And I must close this bug now your earlier work has been
checked in - https://redmine.open-bio.org/issues/2619

Thanks!

Peter

>


From redmine at redmine.open-bio.org  Sat Apr 21 10:39:15 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 21 Apr 2012 10:39:15 +0000
Subject: [Biopython-dev] [Biopython - Bug #2619] (Closed)
	Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py
References: <redmine.issue-2619.20081018153139@redmine.open-bio.org>
Message-ID: <redmine.journal-14814.20120421103915@redmine.open-bio.org>


Issue #2619 has been updated by Peter Cock.

Status changed from New to Closed
% Done changed from 0 to 100

Fixed with Lenna's work - see this commit and its parents:
https://github.com/biopython/biopython/commit/e5ebb85d0614a34e59e7c2118a366512dc4d1320
----------------------------------------
Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py
https://redmine.open-bio.org/issues/2619

Author: Chris Oldfield
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.48
URL: 


MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py.  According to  

http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html

this is because it doesn't compile on Windows.  Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me.

The fix on linux is to uncomment setup.py lines 486 on.  A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance.

Source install of version 1.48, gentoo linux 2008, x86_64.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Apr 21 18:05:01 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 21 Apr 2012 18:05:01 +0000
Subject: [Biopython-dev] [Biopython - Bug #2626] Bio.PDB mmCIFParser parse
	exceptions
References: <redmine.issue-2626.20081023230309@redmine.open-bio.org>
Message-ID: <redmine.journal-14816.20120421180501@redmine.open-bio.org>


Issue #2626 has been updated by Lenna Peterson.

File mmCifParseCheck.py added

I've attempted to rescue this code from overzealous "text formatting".

Attached version appeared to work on one test file; haven't tested the example broken files yet. 
----------------------------------------
Bug #2626: Bio.PDB mmCIFParser parse exceptions
https://redmine.open-bio.org/issues/2626

Author: Chris Oldfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Other
Target version: 1.48
URL: 


I recently ran the mmCIFParser object over all of PDB's mmCIF files and found a large number of files failed to parse correctly (a short script at the end to demonstrate).  Of ~50k mmCIF files, 3891 files failed to parse and another 1980 were missing fields in the mmCIF dictionary.  

A few examples of files that failed to parse: 
http://www.rcsb.org/pdb/files/1alw.cif.gz
http://www.rcsb.org/pdb/files/1det.cif.gz
http://www.rcsb.org/pdb/files/1tmy.cif.gz

A few with missing fields:
http://www.rcsb.org/pdb/files/1mfl.cif.gz
http://www.rcsb.org/pdb/files/1tfj.cif.gz
http://www.rcsb.org/pdb/files/1zn8.cif.gz

The problem seems to be that an error in one mmCIF table, like an extra field, seems to propogate through the rest of the parse.

x86_64 gentoo linux 2008, src BioPython install

__CODE__
import sys
from Bio.PDB import *

if len(sys.argv) != 2:
    print "usage: mmCifParseCheck.py <structFile>"
    sys.exit(0)
structFile = sys.argv[1]

resultString = "";

#parse to structure object
numRes = 0
parser=MMCIFParser()
try:
    structure=parser.get_structure('test',structFile)
    for model in structure:
        for chain in model:
            for residue in chain:
                if(residue.id[0][:2] != "H_"):
                    numRes += 1
except:
    resultString += "parse to structure object failed\n";
else:
    resultString += "parse to structure object succeeded\n";

#parse whole mmCIF file to dict
try:
    mmcif_dict=MMCIF2Dict.MMCIF2Dict(structFile)
except:
    resultString += "parse to dict failed\n";
else:
    resultString += "parse to dict succeeded\n";

#get a required entry
try:
    id = mmcif_dict['_entry.id']
except:
    resultString += "key lookup failed\n";
else:
    resultString += "key lookup succeeded\n";

print resultString
print "number of non-het residues " + str(numRes)


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Apr 21 18:16:07 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 21 Apr 2012 18:16:07 +0000
Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL
	records without model id
References: <redmine.issue-2950.20091117003545@redmine.open-bio.org>
Message-ID: <redmine.journal-14817.20120421181607@redmine.open-bio.org>


Issue #2950 has been updated by Lenna Peterson.


Did this commit close this bug?  https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
----------------------------------------
Bug #2950: Bio.PDBIO.save writes MODEL records without model id
https://redmine.open-bio.org/issues/2950

Author: Barry Finzel
Status: In Progress
Priority: Normal
Assignee: Konstantin Okonechnikov
Category: Main Distribution
Target version: Not Applicable
URL: 


The MODEL record format for PDB files has an integer model identifier
(e.g., "MODEL        1") not currently written to output.
Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Sun Apr 22 06:48:10 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Sun, 22 Apr 2012 02:48:10 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
	(closes 2943)
Message-ID: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>

I've implemented the parser changes (written by Paul Bathen; see bug
report) to allow the MMCIF parser to handle multiple models.

Models are now accessed by a string key of their model number, rather
than an arbitrary index (structure['1'] versus structure[0]).

I updated the MMCIF unit test for the new model access method and
added a test file with multiple models.

I'm not sure if there is documentation to be updated re: accessing the models.

issue: https://redmine.open-bio.org/issues/2943
pull request: https://github.com/biopython/biopython/pull/34

- Lenna


From MatatTHC at gmx.de  Sun Apr 22 10:06:28 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Sun, 22 Apr 2012 12:06:28 +0200
Subject: [Biopython-dev] SeqIO circular
In-Reply-To: <CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>
References: <CALNFT0jq=VTwSDv-4x7ZrHoQRLajCUHY8NGPMw9cDuGnwwNiuw@mail.gmail.com>
	<CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
	<CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>
Message-ID: <CALNFT0hrc+T-0xWesCuK0E5X8=mcDCqXoRRJJ4ms2qAibWXhTg@mail.gmail.com>

Hi,

since this bug seems to be of low priority I decided to try my best to
help a bit and search the web a bit.
It seems that the property is stored in PrimarySeq or Seq  in bioperl.
See for instance:

http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm

Or also:
http://bugzilla.open-bio.org/show_bug.cgi?id=2578

This seems to be realised as boolean variable or function.

Regards,
Matthias

2012/4/4 Matthias Bernt <MatatTHC at gmx.de>:
> Hi,
>
> are there any news on this? May I help somehow? But I have to admit
> that I barely speak perl and have no experience with bioperl. If
> someone tells me where to look I might still try it.
>
> Matthias
>
> 2012/3/29 Peter Cock <p.j.a.cock at googlemail.com>:
>> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
>>> Hi,
>>>
>>> Is it possible to get the property if a genome is circular / linear
>>> from SeqIO applied to genbank files? I could not find it.
>>>
>>> There is also a related bugreport:
>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>>>
>>> I used the old parser before and switched to SeqIO which I really like
>>> for the possibilities to parse different formats... but I really need
>>> the information.
>>
>> Does anyone happen to have a BioPerl + BioSQL setup installed
>> and working? IIRC checking that to make sure however we
>> store the circular was compatible was the only real hurdle.
>>
>> Peter


From redmine at redmine.open-bio.org  Sun Apr 22 18:46:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 22 Apr 2012 18:46:35 +0000
Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL
	records without model id
References: <redmine.issue-2950.20091117003545@redmine.open-bio.org>
Message-ID: <redmine.journal-14818.20120422184635@redmine.open-bio.org>


Issue #2950 has been updated by Eric Talevich.

Assignee deleted (Konstantin Okonechnikov)

Yes it did, thanks. I'll close this bug now.
----------------------------------------
Bug #2950: Bio.PDBIO.save writes MODEL records without model id
https://redmine.open-bio.org/issues/2950

Author: Barry Finzel
Status: In Progress
Priority: Normal
Assignee: 
Category: Main Distribution
Target version: Not Applicable
URL: 


The MODEL record format for PDB files has an integer model identifier
(e.g., "MODEL        1") not currently written to output.
Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Apr 22 18:48:39 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 22 Apr 2012 18:48:39 +0000
Subject: [Biopython-dev] [Biopython - Bug #2951] (Closed) PDBParser assigns
	model 0 to first model no matter what...
References: <redmine.issue-2951.20091117011131@redmine.open-bio.org>
Message-ID: <redmine.journal-14819.20120422184839@redmine.open-bio.org>


Issue #2951 has been updated by Eric Talevich.

Status changed from New to Closed
% Done changed from 0 to 100

Closed with this commit, as pointed out just now by Lenna Peterson:
https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9

----------------------------------------
Bug #2951: PDBParser assigns model 0 to first model no matter what...
https://redmine.open-bio.org/issues/2951

Author: TallPaul empty
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.52
URL: 


I'm not sure if this is a bug or a feature, but PDBParser assigns the first model it sees as model 0 then increments that. This means someone thinking they are studying model X is actually studying X+1, and that assumes that authors always use sequential model numbers without skips. If authors CAN skip model number, ie, MODEL 2, then MODEL 4, then MODEL 5... then in biopython these be models 0,1, and 2 in the structure... yuck. 

If this needs to be maintained for posterity, I would suggest adding another field to capture the TRUE model number if it exists.

See lines 106 and 122 here:
http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#106

Paul


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Apr 22 18:49:43 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 22 Apr 2012 18:49:43 +0000
Subject: [Biopython-dev] [Biopython - Bug #2950] (Closed) Bio.PDBIO.save
	writes MODEL records without model id
References: <redmine.issue-2950.20091117003545@redmine.open-bio.org>
Message-ID: <redmine.journal-14820.20120422184943@redmine.open-bio.org>


Issue #2950 has been updated by Eric Talevich.

Status changed from In Progress to Closed
% Done changed from 20 to 100

Closed the blocker, too. Thanks again to Konstantin.
----------------------------------------
Bug #2950: Bio.PDBIO.save writes MODEL records without model id
https://redmine.open-bio.org/issues/2950

Author: Barry Finzel
Status: Closed
Priority: Normal
Assignee: 
Category: Main Distribution
Target version: Not Applicable
URL: 


The MODEL record format for PDB files has an integer model identifier
(e.g., "MODEL        1") not currently written to output.
Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Mon Apr 23 05:35:23 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Mon, 23 Apr 2012 01:35:23 -0400
Subject: [Biopython-dev] pull request: Bio.SCOP.Raf chem dict updater
Message-ID: <CAK610_4RDX1DFf0YQ4pXW5EfUwaXG8vKUj37kco+58qn9=017w@mail.gmail.com>

I've adapted Hongbo Zhu's code to extract the three to one letter
codes directly from the PDB Chemical Component dictionary.

Existing calls of `from Raf import to_one_letter_code` should work as expected.

pull request: https://github.com/biopython/biopython/pull/35
issue: https://redmine.open-bio.org/issues/3169

Lenna


From redmine at redmine.open-bio.org  Mon Apr 23 17:00:15 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 23 Apr 2012 17:00:15 +0000
Subject: [Biopython-dev] [Biopython - Bug #2943] (Closed) MMCIFParser only
	handling a single model.
References: <redmine.issue-2943.20091103125836@redmine.open-bio.org>
Message-ID: <redmine.journal-14823.20120423170015@redmine.open-bio.org>


Issue #2943 has been updated by Peter Cock.

Status changed from New to Closed
% Done changed from 0 to 100

This should be working on the trunk now ready for Biopython 1.60 - thanks Lenna. See this commit and those preceding it:
https://github.com/biopython/biopython/commit/2ac67cd14682a4bbad9e09654485914f9495138d

If we've missed anything please reopen this bug. Thanks Paul!
----------------------------------------
Bug #2943: MMCIFParser only handling a single model.
https://redmine.open-bio.org/issues/2943

Author: TallPaul empty
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.52
URL: 


MMCIFParser as-written only handles a single model in a protein. Any protein that has multiple modesl with repeating chains and residues will get an exception since the residue ID will already exist. Please make the following changes in MMCIFParser.py:

Change the __doc__ setting:
#Optional __DOC__ change if the new MMCIFlex is not used nor the changes
#to MMCIF2Dict based on the new MMCIFlex.
#Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python
__doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" 

Regardles of the DOC changes:
Insert the following model_list line 
        occupancy_list=mmcif_dict["_atom_site.occupancy"]
        fieldname_list=mmcif_dict["_atom_site.group_PDB"]
        #Added by Paul T. Bathen Nov 2009
        model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"]
        try:
 
Make the following changes:
        #Modified by Paul T. Bathen Nov 2009: comment out this line
        #current_model_id=0
        structure_builder=self._structure_builder
        structure_builder.init_structure(structure_id)
        #Modified by Paul T. Bathen Nov 2009: comment out this line
        #structure_builder.init_model(current_model_id)
        structure_builder.init_seg(" ")
        #Added by Paul T. Bathen Nov 2009
        current_model_id = -1

Make the following changes in the for loop:
            #Note by Paul T. Bathen: should MMCIFParser include 
            #the HOH and WAT stmts in PDBParser immediately below?
            #if fieldname=="HETATM":
            #    if resname=="HOH" or resname=="WAT":
            #        hetero_flag="W"
            #    else:
            #        hetero_flag="H"

            if fieldname=="HETATM":
                hetatm_flag="H"
            else:
                hetatm_flag=" "
 
            #Added by Paul T. Bathen Nov 2009
            model_id = model_list[i]
            if current_model_id != model_id:
                current_model_id = model_id
                structure_builder.init_model(current_model_id)
            #end of addition

After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2beg.ent and both parsed with the same number of models, chains, and residues. 

Paul


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Mon Apr 23 17:02:01 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 23 Apr 2012 18:02:01 +0100
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
Message-ID: <CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>

On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson <arklenna at gmail.com> wrote:
> I've implemented the parser changes (written by Paul Bathen; see bug
> report) to allow the MMCIF parser to handle multiple models.
>
> Models are now accessed by a string key of their model number, rather
> than an arbitrary index (structure['1'] versus structure[0]).
>
> I updated the MMCIF unit test for the new model access method and
> added a test file with multiple models.
>
> I'm not sure if there is documentation to be updated re: accessing the models.
>
> issue: https://redmine.open-bio.org/issues/2943
> pull request: https://github.com/biopython/biopython/pull/34

I've applied that to the trunk, thank you, but on reading this, why are the
model keys strings and not integers? Does MMCIF allow odd keys or
something?

Peter


From eric.talevich at gmail.com  Mon Apr 23 20:10:27 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 23 Apr 2012 16:10:27 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
Message-ID: <CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>

On Mon, Apr 23, 2012 at 1:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>> I've implemented the parser changes (written by Paul Bathen; see bug
>> report) to allow the MMCIF parser to handle multiple models.
>>
>> Models are now accessed by a string key of their model number, rather
>> than an arbitrary index (structure['1'] versus structure[0]).
>>
>> I updated the MMCIF unit test for the new model access method and
>> added a test file with multiple models.
>>
>> I'm not sure if there is documentation to be updated re: accessing the models.
>>
>> issue: https://redmine.open-bio.org/issues/2943
>> pull request: https://github.com/biopython/biopython/pull/34
>
> I've applied that to the trunk, thank you, but on reading this, why are the
> model keys strings and not integers? Does MMCIF allow odd keys or
> something?
>

Ack, I didn't look at that closely enough. Check out this patch to see
the current situation:
https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9

The models associated with a structure are numbered with a sequential
integer id, starting from 0. It's always been like that in our PDB
parser and we haven't changed it. To ensure that model numbers
specified in the PDB file are preserved when writing the PDB back to
file, the above patch introduced a new attribute on the Model object
called serial_num (also an integer, equal to model.id unless specified
otherwise). That attribute is only used when writing a new PDB file;
Model.__getitem__ still uses Model.id as before.

Perhaps that's surprising now that we read the serial numbers, but it
kept backward compatibility. Plus, it preserves list-like behavior
(item access via integers), even though the models are actually stored
in a dict.

So!

In the mmCIF parser, the calls to structure_builder.init_model should
be given two arguments instead of one: an integer id counting from 0,
and then another integer (probably) containing the model "serial
number" specified in the mmCIF file. In the event that an mmCIF file
doesn't specify the model number, the serial number should be the same
as the sequential id.

Cool? This will also help us convert between PDB and mmCIF formats in
the future.

As for accessing the models by their serial number, using string keys
seems like an effective workaround, but still obviously a workaround
rather than an ideal situation. Let's discuss that a little more,
perhaps file another bug when we've reached some consensus.

Best,
Eric


From eric.talevich at gmail.com  Mon Apr 23 20:32:11 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 23 Apr 2012 16:32:11 -0400
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF
 parser (#33)
In-Reply-To: <CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
	<CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
Message-ID: <CAMC681mRNgjVNMSWBANMh8Ztjj=cS-jBVAL6ntjXVJYZZekk3w@mail.gmail.com>

On Fri, Apr 20, 2012 at 8:57 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>
> I've started testing the PLY lexer on PyPy. NumPyPy now implements
> more functions needed by PDB; the only things I found to be missing
> are random and linalg. This eliminates Superimposer, FragmentMapper,
> and Vector.
>
> I played around with trying to spoof "import numpy" to automatically
> import numpypy (code here: https://gist.github.com/2432815) but I
> don't think that's wise yet.
>
> My last commit to this branch was a few changes to allow the MMCIF
> parser to work on NumPy. PyPy won't run `setup.py test` due to global
> numpy failure, but if I install this branch and `pypy test_MMCIF.py`,
> it passes.
>
> Anybody with more PyPy and/or package structuring experience have thoughts?
>
> Lenna

Would it be more or less error-prone to simply replace every numpy
import with this (after testing each module on PyPy):

try:
    import numpy
except:
    import numpypy as numpy

Or similarly, use this as one of our compatibility utilities:

from Bio import numpy
# Some conditional junk in Bio/__init__.py or setup.py to reveal this
module to PyPy and CPython as needed


In either case, here's the relatively short list of modules that would
need to be modified:

Bio/Affy/CelFile.py
Bio/Cluster/__init__.py
Bio/KDTree/KDTree.py
Bio/LogisticRegression.py
Bio/MarkovModel.py
Bio/MaxEntropy.py
Bio/NaiveBayes.py
Bio/PDB/Atom.py
Bio/PDB/FragmentMapper.py
Bio/PDB/MMCIFParser.py
Bio/PDB/NeighborSearch.py
Bio/PDB/PDBParser.py
Bio/PDB/ResidueDepth.py
Bio/PDB/Superimposer.py
Bio/PDB/Vector.py
Bio/SVDSuperimposer/SVDSuperimposer.py
Bio/Statistics/lowess.py
Bio/SubsMat/__init__.py
Bio/kNN.py


From p.j.a.cock at googlemail.com  Mon Apr 23 20:47:02 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 23 Apr 2012 21:47:02 +0100
Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF
 parser (#33)
In-Reply-To: <CAMC681mRNgjVNMSWBANMh8Ztjj=cS-jBVAL6ntjXVJYZZekk3w@mail.gmail.com>
References: <biopython/biopython/pull/33@github.com>
	<CAKVJ-_4hv_MCb+Cmxqt8F0E09OeP+uOAX=Pj5Ajixyr8asKqjg@mail.gmail.com>
	<CAK610_6FCExTXK6k8p8k35mRvhRpdARyVXNu8pMVs0b4kicXRw@mail.gmail.com>
	<CAMC681mRNgjVNMSWBANMh8Ztjj=cS-jBVAL6ntjXVJYZZekk3w@mail.gmail.com>
Message-ID: <CAKVJ-_4Vh306dDL-9uTag=RO238SeMs3B_CZwWxkA7oX=9rHkw@mail.gmail.com>

On Mon, Apr 23, 2012 at 9:32 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Would it be more or less error-prone to simply replace every numpy
> import with this (after testing each module on PyPy):
>
> try:
> ? ?import numpy
> except:
> ? ?import numpypy as numpy
>

Maybe, but right now do any of our NumPy using modules
pass under PyPy? I don't believe so... but I haven't tried
a PyPy nightly build lately.

It was unfortunate that originally PyPy's micronumpy
pretended to by numpy, so that you'd write "import numpy"
and think it worked but be surprised later when something
fundamental like the dot function was missing, or 2D arrays.
That lead to a few nasty try/import lines in our unit tests.

Let's wait and see how PyPy's numpy support improves
before rushing to change any of our numpy imports. I am
hopefully that Bio.PDB will be fine in their next release,
whereas things using the NumPy C API will probably not be.

Peter


From arklenna at gmail.com  Mon Apr 23 23:05:03 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Mon, 23 Apr 2012 19:05:03 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
Message-ID: <CAK610_69XxiUEaTcLaK_RqrLCG5CmXMy-ytgjSdgbZy6c-VHhw@mail.gmail.com>

On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Ack, I didn't look at that closely enough. Check out this patch to see
> the current situation:
> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>
> The models associated with a structure are numbered with a sequential
> integer id, starting from 0. It's always been like that in our PDB
> parser and we haven't changed it. To ensure that model numbers
> specified in the PDB file are preserved when writing the PDB back to
> file, the above patch introduced a new attribute on the Model object
> called serial_num (also an integer, equal to model.id unless specified
> otherwise). That attribute is only used when writing a new PDB file;
> Model.__getitem__ still uses Model.id as before.
>
> Perhaps that's surprising now that we read the serial numbers, but it
> kept backward compatibility. Plus, it preserves list-like behavior
> (item access via integers), even though the models are actually stored
> in a dict.
>
> So!
>
> In the mmCIF parser, the calls to structure_builder.init_model should
> be given two arguments instead of one: an integer id counting from 0,
> and then another integer (probably) containing the model "serial
> number" specified in the mmCIF file. In the event that an mmCIF file
> doesn't specify the model number, the serial number should be the same
> as the sequential id.
>
> Cool? This will also help us convert between PDB and mmCIF formats in
> the future.


Got it. I'm working on implementing the serial_number/model_number
dichotomy for MMCIF.


> As for accessing the models by their serial number, using string keys
> seems like an effective workaround, but still obviously a workaround
> rather than an ideal situation. Let's discuss that a little more,
> perhaps file another bug when we've reached some consensus.


Er, I made and then lost (still haven't *quite* gotten the hang of git
rebase) a patch that applied int() to the MMCIF model numbers. I'll
add that back so both model and serial numbers are ints.


Lenna


From arklenna at gmail.com  Tue Apr 24 04:25:12 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Tue, 24 Apr 2012 00:25:12 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
Message-ID: <CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>

On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Ack, I didn't look at that closely enough. Check out this patch to see
> the current situation:
> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>
> The models associated with a structure are numbered with a sequential
> integer id, starting from 0. It's always been like that in our PDB
> parser and we haven't changed it. To ensure that model numbers
> specified in the PDB file are preserved when writing the PDB back to
> file, the above patch introduced a new attribute on the Model object
> called serial_num (also an integer, equal to model.id unless specified
> otherwise). That attribute is only used when writing a new PDB file;
> Model.__getitem__ still uses Model.id as before.
>
> Perhaps that's surprising now that we read the serial numbers, but it
> kept backward compatibility. Plus, it preserves list-like behavior
> (item access via integers), even though the models are actually stored
> in a dict.
>
> So!
>
> In the mmCIF parser, the calls to structure_builder.init_model should
> be given two arguments instead of one: an integer id counting from 0,
> and then another integer (probably) containing the model "serial
> number" specified in the mmCIF file. In the event that an mmCIF file
> doesn't specify the model number, the serial number should be the same
> as the sequential id.
>
> Cool? This will also help us convert between PDB and mmCIF formats in
> the future.
>
> As for accessing the models by their serial number, using string keys
> seems like an effective workaround, but still obviously a workaround
> rather than an ideal situation. Let's discuss that a little more,
> perhaps file another bug when we've reached some consensus.
>
> Best,
> Eric


Hi Eric,

I believe I've implemented the model_id/serial_id system found in PDB:

https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d

Please let me know if you think that looks right. I couldn't find an
mmCIF file without a model column to test, but I believe in that case
it will assign model_id and serial_id to 0. Would that be the correct
behavior?

I also modified the unit test to check the model serial_num.
https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6

Currently serial_num is int() of the CIF model column. Regarding
access by string serial_num, I am concerned that the int/string access
would be too subtle (structure[0] == structure['1']; structure[1] ==
structure['2']?). Perhaps an accessor function? i.e.
structure.get_model('1')

Let me know if you think I should write get_model() or something along
those lines.

Lenna


From eric.talevich at gmail.com  Tue Apr 24 15:38:50 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 24 Apr 2012 11:38:50 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
Message-ID: <CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>

On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
> On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> Ack, I didn't look at that closely enough. Check out this patch to see
>> the current situation:
>> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>>
>> The models associated with a structure are numbered with a sequential
>> integer id, starting from 0. It's always been like that in our PDB
>> parser and we haven't changed it. To ensure that model numbers
>> specified in the PDB file are preserved when writing the PDB back to
>> file, the above patch introduced a new attribute on the Model object
>> called serial_num (also an integer, equal to model.id unless specified
>> otherwise). That attribute is only used when writing a new PDB file;
>> Model.__getitem__ still uses Model.id as before.
>>
>> Perhaps that's surprising now that we read the serial numbers, but it
>> kept backward compatibility. Plus, it preserves list-like behavior
>> (item access via integers), even though the models are actually stored
>> in a dict.
>>
>> So!
>>
>> In the mmCIF parser, the calls to structure_builder.init_model should
>> be given two arguments instead of one: an integer id counting from 0,
>> and then another integer (probably) containing the model "serial
>> number" specified in the mmCIF file. In the event that an mmCIF file
>> doesn't specify the model number, the serial number should be the same
>> as the sequential id.
>>
>> Cool? This will also help us convert between PDB and mmCIF formats in
>> the future.
>>
>> As for accessing the models by their serial number, using string keys
>> seems like an effective workaround, but still obviously a workaround
>> rather than an ideal situation. Let's discuss that a little more,
>> perhaps file another bug when we've reached some consensus.
>>
>> Best,
>> Eric
>
>
> Hi Eric,
>
> I believe I've implemented the model_id/serial_id system found in PDB:
>
> https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
>
> Please let me know if you think that looks right. I couldn't find an
> mmCIF file without a model column to test, but I believe in that case
> it will assign model_id and serial_id to 0. Would that be the correct
> behavior?
>
> I also modified the unit test to check the model serial_num.
> https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
>
> Currently serial_num is int() of the CIF model column. Regarding
> access by string serial_num, I am concerned that the int/string access
> would be too subtle (structure[0] == structure['1']; structure[1] ==
> structure['2']?). Perhaps an accessor function? i.e.
> structure.get_model('1')
>
> Let me know if you think I should write get_model() or something along
> those lines.
>
> Lenna

I left another nitpick on b453a, but besides that it looks exactly right to me.

The string/int distinction would indeed be weird, especially for newer
Python users coming from Perl or Javascript. I don't see a direct
analogue for get_model(serial_num) in the other Entities (Residue,
Chain, Model, Structure), so I'm inclined to put off the decision for
now (i.e. leave it out of this patch set).

-Eric


From p.j.a.cock at googlemail.com  Tue Apr 24 15:58:10 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 24 Apr 2012 16:58:10 +0100
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <4F91E4CF.8040602@med.nyu.edu>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
	<4F91E4CF.8040602@med.nyu.edu>
Message-ID: <CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>

On Fri, Apr 20, 2012 at 11:35 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Peter,
>
> My colleague was writing some code using MafIndex and commented how long it
> took her to download, decompress and index the human multiz alignments from
> UCSC. It seems like it'd be great to keep the files compressed... perhaps if
> the code works well enough we can convince UCSC to host bgzip'd copies (or
> maybe them available on one of our institutions servers).

That does sound good - it is a perfect example of where BGZF is a more
useful alternative to standard GZIP. Some numbers on how much of a
size penalty it imposes would help though...

> Is I.J. interested in joining the community? I'd like to look into adding
> BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If not, could
> you put me in touch?

Perhaps he's just busy at the moment (BCC'd again)?

It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py
and I'm willing to do this myself for MAF (while going over your index work -
something I want to do anyway). The only potential catch is avoiding offset
arithmetic.

Peter


From arklenna at gmail.com  Tue Apr 24 17:56:37 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Tue, 24 Apr 2012 13:56:37 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
Message-ID: <CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>

On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> >>
> >> Ack, I didn't look at that closely enough. Check out this patch to see
> >> the current situation:
> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
> >>
> >> The models associated with a structure are numbered with a sequential
> >> integer id, starting from 0. It's always been like that in our PDB
> >> parser and we haven't changed it. To ensure that model numbers
> >> specified in the PDB file are preserved when writing the PDB back to
> >> file, the above patch introduced a new attribute on the Model object
> >> called serial_num (also an integer, equal to model.id unless specified
> >> otherwise). That attribute is only used when writing a new PDB file;
> >> Model.__getitem__ still uses Model.id as before.
> >>
> >> Perhaps that's surprising now that we read the serial numbers, but it
> >> kept backward compatibility. Plus, it preserves list-like behavior
> >> (item access via integers), even though the models are actually stored
> >> in a dict.
> >>
> >> So!
> >>
> >> In the mmCIF parser, the calls to structure_builder.init_model should
> >> be given two arguments instead of one: an integer id counting from 0,
> >> and then another integer (probably) containing the model "serial
> >> number" specified in the mmCIF file. In the event that an mmCIF file
> >> doesn't specify the model number, the serial number should be the same
> >> as the sequential id.
> >>
> >> Cool? This will also help us convert between PDB and mmCIF formats in
> >> the future.
> >>
> >> As for accessing the models by their serial number, using string keys
> >> seems like an effective workaround, but still obviously a workaround
> >> rather than an ideal situation. Let's discuss that a little more,
> >> perhaps file another bug when we've reached some consensus.
> >>
> >> Best,
> >> Eric
> >
> >
> > Hi Eric,
> >
> > I believe I've implemented the model_id/serial_id system found in PDB:
> >
> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
> >
> > Please let me know if you think that looks right. I couldn't find an
> > mmCIF file without a model column to test, but I believe in that case
> > it will assign model_id and serial_id to 0. Would that be the correct
> > behavior?
> >
> > I also modified the unit test to check the model serial_num.
> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
> >
> > Currently serial_num is int() of the CIF model column. Regarding
> > access by string serial_num, I am concerned that the int/string access
> > would be too subtle (structure[0] == structure['1']; structure[1] ==
> > structure['2']?). Perhaps an accessor function? i.e.
> > structure.get_model('1')
> >
> > Let me know if you think I should write get_model() or something along
> > those lines.
> >
> > Lenna
>
> I left another nitpick on b453a, but besides that it looks exactly right to me.
>
> The string/int distinction would indeed be weird, especially for newer
> Python users coming from Perl or Javascript. I don't see a direct
> analogue for get_model(serial_num) in the other Entities (Residue,
> Chain, Model, Structure), so I'm inclined to put off the decision for
> now (i.e. leave it out of this patch set).
>
> -Eric


Eric,

Okay, I've changed the bad model num generic warning to a
PDBConstructionException.

New pull request to get MMCIF to the same state as PDB:
https://github.com/biopython/biopython/pull/36

So are chains accessed by 0, 1, 2 or by A, B, C?

Lenna


From anaryin at gmail.com  Tue Apr 24 17:59:10 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 24 Apr 2012 19:59:10 +0200
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
	<CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
Message-ID: <CAJ9sUYPzMhcfwB2DPQwiSTn=146S=2aThzeqbj1sfEJUNOgG3w@mail.gmail.com>

Hi Lenna,

IMO, chains should be accessed by A, B, C I'd say, doesn't make sense
numerically.

Congrats on the GSOC application and on the good work so far!

Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


No dia 24 de Abril de 2012 19:56, Lenna Peterson <arklenna at gmail.com>escreveu:

> On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> >
> > On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com>
> wrote:
> > > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <
> eric.talevich at gmail.com> wrote:
> > >>
> > >> Ack, I didn't look at that closely enough. Check out this patch to see
> > >> the current situation:
> > >>
> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
> > >>
> > >> The models associated with a structure are numbered with a sequential
> > >> integer id, starting from 0. It's always been like that in our PDB
> > >> parser and we haven't changed it. To ensure that model numbers
> > >> specified in the PDB file are preserved when writing the PDB back to
> > >> file, the above patch introduced a new attribute on the Model object
> > >> called serial_num (also an integer, equal to model.id unless
> specified
> > >> otherwise). That attribute is only used when writing a new PDB file;
> > >> Model.__getitem__ still uses Model.id as before.
> > >>
> > >> Perhaps that's surprising now that we read the serial numbers, but it
> > >> kept backward compatibility. Plus, it preserves list-like behavior
> > >> (item access via integers), even though the models are actually stored
> > >> in a dict.
> > >>
> > >> So!
> > >>
> > >> In the mmCIF parser, the calls to structure_builder.init_model should
> > >> be given two arguments instead of one: an integer id counting from 0,
> > >> and then another integer (probably) containing the model "serial
> > >> number" specified in the mmCIF file. In the event that an mmCIF file
> > >> doesn't specify the model number, the serial number should be the same
> > >> as the sequential id.
> > >>
> > >> Cool? This will also help us convert between PDB and mmCIF formats in
> > >> the future.
> > >>
> > >> As for accessing the models by their serial number, using string keys
> > >> seems like an effective workaround, but still obviously a workaround
> > >> rather than an ideal situation. Let's discuss that a little more,
> > >> perhaps file another bug when we've reached some consensus.
> > >>
> > >> Best,
> > >> Eric
> > >
> > >
> > > Hi Eric,
> > >
> > > I believe I've implemented the model_id/serial_id system found in PDB:
> > >
> > >
> https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
> > >
> > > Please let me know if you think that looks right. I couldn't find an
> > > mmCIF file without a model column to test, but I believe in that case
> > > it will assign model_id and serial_id to 0. Would that be the correct
> > > behavior?
> > >
> > > I also modified the unit test to check the model serial_num.
> > >
> https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
> > >
> > > Currently serial_num is int() of the CIF model column. Regarding
> > > access by string serial_num, I am concerned that the int/string access
> > > would be too subtle (structure[0] == structure['1']; structure[1] ==
> > > structure['2']?). Perhaps an accessor function? i.e.
> > > structure.get_model('1')
> > >
> > > Let me know if you think I should write get_model() or something along
> > > those lines.
> > >
> > > Lenna
> >
> > I left another nitpick on b453a, but besides that it looks exactly right
> to me.
> >
> > The string/int distinction would indeed be weird, especially for newer
> > Python users coming from Perl or Javascript. I don't see a direct
> > analogue for get_model(serial_num) in the other Entities (Residue,
> > Chain, Model, Structure), so I'm inclined to put off the decision for
> > now (i.e. leave it out of this patch set).
> >
> > -Eric
>
>
> Eric,
>
> Okay, I've changed the bad model num generic warning to a
> PDBConstructionException.
>
> New pull request to get MMCIF to the same state as PDB:
> https://github.com/biopython/biopython/pull/36
>
> So are chains accessed by 0, 1, 2 or by A, B, C?
>
> Lenna
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From eric.talevich at gmail.com  Tue Apr 24 18:20:16 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 24 Apr 2012 14:20:16 -0400
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
	<CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
Message-ID: <CAMC681nrx7iX7tdSr3F0L+Mm6g5LWw20gSQOj18Qga1dvh2T0w@mail.gmail.com>

On Tue, Apr 24, 2012 at 1:56 PM, Lenna Peterson <arklenna at gmail.com> wrote:
> On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson <arklenna at gmail.com> wrote:
>> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> >>
>> >> Ack, I didn't look at that closely enough. Check out this patch to see
>> >> the current situation:
>> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9
>> >>
>> >> The models associated with a structure are numbered with a sequential
>> >> integer id, starting from 0. It's always been like that in our PDB
>> >> parser and we haven't changed it. To ensure that model numbers
>> >> specified in the PDB file are preserved when writing the PDB back to
>> >> file, the above patch introduced a new attribute on the Model object
>> >> called serial_num (also an integer, equal to model.id unless specified
>> >> otherwise). That attribute is only used when writing a new PDB file;
>> >> Model.__getitem__ still uses Model.id as before.
>> >>
>> >> Perhaps that's surprising now that we read the serial numbers, but it
>> >> kept backward compatibility. Plus, it preserves list-like behavior
>> >> (item access via integers), even though the models are actually stored
>> >> in a dict.
>> >>
>> >> So!
>> >>
>> >> In the mmCIF parser, the calls to structure_builder.init_model should
>> >> be given two arguments instead of one: an integer id counting from 0,
>> >> and then another integer (probably) containing the model "serial
>> >> number" specified in the mmCIF file. In the event that an mmCIF file
>> >> doesn't specify the model number, the serial number should be the same
>> >> as the sequential id.
>> >>
>> >> Cool? This will also help us convert between PDB and mmCIF formats in
>> >> the future.
>> >>
>> >> As for accessing the models by their serial number, using string keys
>> >> seems like an effective workaround, but still obviously a workaround
>> >> rather than an ideal situation. Let's discuss that a little more,
>> >> perhaps file another bug when we've reached some consensus.
>> >>
>> >> Best,
>> >> Eric
>> >
>> >
>> > Hi Eric,
>> >
>> > I believe I've implemented the model_id/serial_id system found in PDB:
>> >
>> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d
>> >
>> > Please let me know if you think that looks right. I couldn't find an
>> > mmCIF file without a model column to test, but I believe in that case
>> > it will assign model_id and serial_id to 0. Would that be the correct
>> > behavior?
>> >
>> > I also modified the unit test to check the model serial_num.
>> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6
>> >
>> > Currently serial_num is int() of the CIF model column. Regarding
>> > access by string serial_num, I am concerned that the int/string access
>> > would be too subtle (structure[0] == structure['1']; structure[1] ==
>> > structure['2']?). Perhaps an accessor function? i.e.
>> > structure.get_model('1')
>> >
>> > Let me know if you think I should write get_model() or something along
>> > those lines.
>> >
>> > Lenna
>>
>> I left another nitpick on b453a, but besides that it looks exactly right to me.
>>
>> The string/int distinction would indeed be weird, especially for newer
>> Python users coming from Perl or Javascript. I don't see a direct
>> analogue for get_model(serial_num) in the other Entities (Residue,
>> Chain, Model, Structure), so I'm inclined to put off the decision for
>> now (i.e. leave it out of this patch set).
>>
>> -Eric
>
>
> Eric,
>
> Okay, I've changed the bad model num generic warning to a
> PDBConstructionException.
>
> New pull request to get MMCIF to the same state as PDB:
> https://github.com/biopython/biopython/pull/36
>
> So are chains accessed by 0, 1, 2 or by A, B, C?
>
> Lenna

Cool, I just merged the pull request. Thanks!

As Jo?o said, chains are accessed by the letter ID via __getitem__
(implemented in Bio.PDB.Entity). You can get at them either way
through the child_list and child_dict attributes, too. Kind of a
thrill. I suppose we could eventually refactor the Entity-based
classes to use a single data structure (OrderedDict, namedtuple, numpy
array with named columns/rows?) in place of child_dict and child_list,
and clean up some of the redundant accessors.

-E


From anaryin at gmail.com  Tue Apr 24 18:25:15 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 24 Apr 2012 20:25:15 +0200
Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models
 (closes 2943)
In-Reply-To: <CAMC681nrx7iX7tdSr3F0L+Mm6g5LWw20gSQOj18Qga1dvh2T0w@mail.gmail.com>
References: <CAK610_7Ek1p9CrpBqv2euQhJV0_vLSmJ2+wdC2UO3w2V-wsMtg@mail.gmail.com>
	<CAKVJ-_4A6k-f2DVHXrDF-+K-QjJ+Me-sM0+ZC+dQTw7PYXS2XQ@mail.gmail.com>
	<CAMC681m+PmBr3v3ihpbmRL0_CHNQn=C7HKmC2098=ECuG2D1CA@mail.gmail.com>
	<CAK610_4NyNhQSmC8gm7w2iy5r36b9WTi-p0Y-cNLfOJ=ehbRnw@mail.gmail.com>
	<CAMC681n8ZbOy80xiPRaY9zzCw3V3Au-oWiftoS1B36zEfNqoKA@mail.gmail.com>
	<CAK610_635VJmQobsLU4ViueGuvsKbFG=_+SDdiDFq_Ogvqz3gg@mail.gmail.com>
	<CAMC681nrx7iX7tdSr3F0L+Mm6g5LWw20gSQOj18Qga1dvh2T0w@mail.gmail.com>
Message-ID: <CAJ9sUYPG57JbOq-=ax5aqquVOHagsxQNQVbf=z8UOg5LsuJ0hQ@mail.gmail.com>

I cannot agree more with Eric on this. Child dict and child list should be
for sure refactored into something unique and easier to understand (and
use). Also because we should take care of that memory leak... (try running
the parser over a lot of PDBs and you will see memory going up).

Cheers,

Jo?o


From p.j.a.cock at googlemail.com  Tue Apr 24 20:07:03 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 24 Apr 2012 21:07:03 +0100
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>
	<4F91E4CF.8040602@med.nyu.edu>
	<CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>
	<67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu>
Message-ID: <CAKVJ-_6J6LPVggPojU2mutOVob=v6oVv9-Gx=E=R1fQEg5zVkg@mail.gmail.com>

On Tue, Apr 24, 2012 at 7:24 PM, Irwin Jungreis <ILJungr at csail.mit.edu> wrote:
> Hello Andrew and Peter.
>

Hi again Irwin,

> The size penalty of bgz versus gzip for .maf files is quite small. For
> example, compressing the 6-way C. elegans alignment .maf files is 108.9 MB
> with gzip and 112 MB with bgz, a difference of less than 3%. (Each is
> smaller than the uncompressed file by a factor of about 4 or 5.)

That's good - and given the nature of the MAF format in line with
what I was hoping for - see also the overheads I got for FASTA,
SwissProt and UniProt XML here:
http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html

> I am not very familiar with biopython, so I've been using my own utilities.
> To work with alignments I create an index file consisting of a 32-byte
> record for each maf block. Each record ?contains the block start on the
> reference species chromosome, the block length on the reference species, and
> the virtual offset of the block start in the .maf file. I then have a
> utility that will extract the alignment for a given set of spliced regions,
> e.g., chrX:11568015-11569059+chrX:11569364-11569395 on the '-' strand, and
> output it as a list of pairs (assembly name, base string).
>
> I'd be happy to share, but I have no idea how this would fit into the
> existing biopython infrastructure.
>
> Best,
> Irwin

Ah - I must have misinterpreted your earlier email (off list). I'd
assumed you were using Andrew's Biopython branch which
indexes MAF files using an SQLite database of offsets. But
in practice the principle is the same - BGZF lets you have
good compression of MAF files and random access. Thank
you for clarifying this.

If you use Python at all perhaps you'd have some feedback
on Andrew's indexing plans? That would be great - Andrew's
done a great job explaining the proposed code usage here:
http://biopython.org/wiki/Multiple_Alignment_Format

Regards,

Peter


From redmine at redmine.open-bio.org  Wed Apr 25 02:33:04 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 25 Apr 2012 02:33:04 +0000
Subject: [Biopython-dev] [Biopython - Feature #3344] (New) Bio.PDB.Entity
	classes need a __contains__ method
Message-ID: <redmine.issue-3344.20120425023304@redmine.open-bio.org>


Issue #3344 has been reported by Eric Talevich.

----------------------------------------
Feature #3344: Bio.PDB.Entity classes need a __contains__ method
https://redmine.open-bio.org/issues/3344

Author: Eric Talevich
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


The various objects constructed by Bio.PDB have list-like and dict-like behaviors, for the most part. However, the not all of the relevant magic methods have been implemented. (E.g. `residue["CA"]` works, but `"CA" in residue` does not.)

We could do more to support the list-like and dict-like behaviors, but let's start with __contains__.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Thu Apr 26 03:36:04 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 26 Apr 2012 03:36:04 +0000
Subject: [Biopython-dev] [Biopython - Bug #3169] (Closed) to_one_letter_code
	in Bio.SCOP.Raf is old
References: <redmine.issue-3169.20110112094643@redmine.open-bio.org>
Message-ID: <redmine.journal-14828.20120426033604@redmine.open-bio.org>


Issue #3169 has been updated by Eric Talevich.

Status changed from New to Closed
% Done changed from 0 to 100

We've committed this fix now:
https://github.com/biopython/biopython/pull/35
----------------------------------------
Bug #3169: to_one_letter_code in Bio.SCOP.Raf is old
https://redmine.open-bio.org/issues/3169

Author: Hongbo Zhu
Status: Closed
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.56
URL: 


Hi, 

The dictionary to_one_letter_code in Bio.SCOP.Raf is a bit old now. The current dictionary is based on a table taken from the RAF release notes of ASTRAL. This is an old table and some new three-letter codes in the PDB are not found in it (e.g. M3L in 2X4W). ASTRAL does not use the table since v1.73. Rather, PDB Chemical Component Dictionary is used. See http://astral.berkeley.edu/seq.cgi?get=raf-edit-comments;ver=1.75

"Beginning with ASTRAL 1.73, the PDB's chemical dictionary is used to translate chemically modified residues, instead of the translation table from ASTRAL 1.55."

The PDB Chemical Component Dictionary can be obtained from: http://deposit.pdb.org/cc_dict_tut.html .

I have parsed the dictionary and there are 12054 three-letter codes (as of Jan 2011). Among them, most correspond to a one-letter code '?'. Still, there are 1245 three-letter codes corresponding to a one-letter code other than '?' (the list is attached in the end). Therefore, I suggest to update the to_one_letter_code dictionary in Bio.SCOP.Raf.

Best regards,
hongbo zhu

to_one_letter_code = {
    '00C':'C','01W':'X','0A0':'D','0A1':'Y','0A2':'K',
    '0A8':'C','0AA':'V','0AB':'V','0AC':'G','0AD':'G',
    '0AF':'W','0AG':'L','0AH':'S','0AK':'D','0AM':'A',
    '0AP':'C','0AU':'U','0AV':'A','0AZ':'P','0BN':'F',
    '0C ':'C','0CS':'A','0DC':'C','0DG':'G','0DT':'T',
    '0G ':'G','0NC':'A','0SP':'A','0U ':'U','0YG':'YG',
    '10C':'C','125':'U','126':'U','127':'U','128':'N',
    '12A':'A','143':'C','175':'ASG','193':'X','1AP':'A',
    '1MA':'A','1MG':'G','1PA':'F','1PI':'A','1PR':'N',
    '1SC':'C','1TQ':'W','1TY':'Y','200':'F','23F':'F',
    '23S':'X','26B':'T','2AD':'X','2AG':'G','2AO':'X',
    '2AR':'A','2AS':'X','2AT':'T','2AU':'U','2BD':'I',
    '2BT':'T','2BU':'A','2CO':'C','2DA':'A','2DF':'N',
    '2DM':'N','2DO':'X','2DT':'T','2EG':'G','2FE':'N',
    '2FI':'N','2FM':'M','2GT':'T','2HF':'H','2LU':'L',
    '2MA':'A','2MG':'G','2ML':'L','2MR':'R','2MT':'P',
    '2MU':'U','2NT':'T','2OM':'U','2OT':'T','2PI':'X',
    '2PR':'G','2SA':'N','2SI':'X','2ST':'T','2TL':'T',
    '2TY':'Y','2VA':'V','32S':'X','32T':'X','3AH':'H',
    '3AR':'X','3CF':'F','3DA':'A','3DR':'N','3GA':'A',
    '3MD':'D','3ME':'U','3NF':'Y','3TY':'X','3XH':'G',
    '4AC':'N','4BF':'Y','4CF':'F','4CY':'M','4DP':'W',
    '4F3':'GYG','4FB':'P','4FW':'W','4HT':'W','4IN':'X',
    '4MF':'N','4MM':'X','4OC':'C','4PC':'C','4PD':'C',
    '4PE':'C','4PH':'F','4SC':'C','4SU':'U','4TA':'N',
    '5AA':'A','5AT':'T','5BU':'U','5CG':'G','5CM':'C',
    '5CS':'C','5FA':'A','5FC':'C','5FU':'U','5HP':'E',
    '5HT':'T','5HU':'U','5IC':'C','5IT':'T','5IU':'U',
    '5MC':'C','5MD':'N','5MU':'U','5NC':'C','5PC':'C',
    '5PY':'T','5SE':'U','5ZA':'TWG','64T':'T','6CL':'K',
    '6CT':'T','6CW':'W','6HA':'A','6HC':'C','6HG':'G',
    '6HN':'K','6HT':'T','6IA':'A','6MA':'A','6MC':'A',
    '6MI':'N','6MT':'A','6MZ':'N','6OG':'G','70U':'U',
    '7DA':'A','7GU':'G','7JA':'I','7MG':'G','8AN':'A',
    '8FG':'G','8MG':'G','8OG':'G','9NE':'E','9NF':'F',
    '9NR':'R','9NV':'V','A  ':'A','A1P':'N','A23':'A',
    'A2L':'A','A2M':'A','A34':'A','A35':'A','A38':'A',
    'A39':'A','A3A':'A','A3P':'A','A40':'A','A43':'A',
    'A44':'A','A47':'A','A5L':'A','A5M':'C','A5O':'A',
    'A66':'X','AA3':'A','AA4':'A','AAR':'R','AB7':'X',
    'ABA':'A','ABR':'A','ABS':'A','ABT':'N','ACB':'D',
    'ACL':'R','AD2':'A','ADD':'X','ADX':'N','AEA':'X',
    'AEI':'D','AET':'A','AFA':'N','AFF':'N','AFG':'G',
    'AGM':'R','AGT':'X','AHB':'N','AHH':'X','AHO':'A',
    'AHP':'A','AHS':'X','AHT':'X','AIB':'A','AKL':'D',
    'ALA':'A','ALC':'A','ALG':'R','ALM':'A','ALN':'A',
    'ALO':'T','ALQ':'X','ALS':'A','ALT':'A','ALY':'K',
    'AP7':'A','APE':'X','APH':'A','API':'K','APK':'K',
    'APM':'X','APP':'X','AR2':'R','AR4':'E','ARG':'R',
    'ARM':'R','ARO':'R','ARV':'X','AS ':'A','AS2':'D',
    'AS9':'X','ASA':'D','ASB':'D','ASI':'D','ASK':'D',
    'ASL':'D','ASM':'X','ASN':'N','ASP':'D','ASQ':'D',
    'ASU':'N','ASX':'B','ATD':'T','ATL':'T','ATM':'T',
    'AVC':'A','AVN':'X','AYA':'A','AYG':'AYG','AZK':'K',
    'AZS':'S','AZY':'Y','B1F':'F','B1P':'N','B2A':'A',
    'B2F':'F','B2I':'I','B2V':'V','B3A':'A','B3D':'D',
    'B3E':'E','B3K':'K','B3L':'X','B3M':'X','B3Q':'X',
    'B3S':'S','B3T':'X','B3U':'H','B3X':'N','B3Y':'Y',
    'BB6':'C','BB7':'C','BB9':'C','BBC':'C','BCS':'C',
    'BCX':'C','BE2':'X','BFD':'D','BG1':'S','BGM':'G',
    'BHD':'D','BIF':'F','BIL':'X','BIU':'I','BJH':'X',
    'BLE':'L','BLY':'K','BMP':'N','BMT':'T','BNN':'A',
    'BNO':'X','BOE':'T','BOR':'R','BPE':'C','BRU':'U',
    'BSE':'S','BT5':'N','BTA':'L','BTC':'C','BTR':'W',
    'BUC':'C','BUG':'V','BVP':'U','BZG':'N','C  ':'C',
    'C12':'TYG','C1X':'K','C25':'C','C2L':'C','C2S':'C',
    'C31':'C','C32':'C','C34':'C','C36':'C','C37':'C',
    'C38':'C','C3Y':'C','C42':'C','C43':'C','C45':'C',
    'C46':'C','C49':'C','C4R':'C','C4S':'C','C5C':'C',
    'C66':'X','C6C':'C','C99':'TFG','CAF':'C','CAL':'X',
    'CAR':'C','CAS':'C','CAV':'X','CAY':'C','CB2':'C',
    'CBR':'C','CBV':'C','CCC':'C','CCL':'K','CCS':'C',
    'CCY':'CYG','CDE':'X','CDV':'X','CDW':'C','CEA':'C',
    'CFL':'C','CFY':'FCYG','CG1':'G','CGA':'E','CGU':'E',
    'CH ':'C','CH6':'MYG','CH7':'KYG','CHF':'X','CHG':'X',
    'CHP':'G','CHS':'X','CIR':'R','CJO':'GYG','CLE':'L',
    'CLG':'K','CLH':'K','CLV':'AFG','CM0':'N','CME':'C',
    'CMH':'C','CML':'C','CMR':'C','CMT':'C','CNU':'U',
    'CP1':'C','CPC':'X','CPI':'X','CQR':'GYG','CR0':'TLG',
    'CR2':'GYG','CR5':'G','CR7':'KYG','CR8':'HYG','CRF':'TWG',
    'CRG':'THG','CRK':'MYG','CRO':'GYG','CRQ':'QYG','CRU':'E',
    'CRW':'ASG','CRX':'ASG','CS0':'C','CS1':'C','CS3':'C',
    'CS4':'C','CS8':'N','CSA':'C','CSB':'C','CSD':'C',
    'CSE':'C','CSF':'C','CSH':'SHG','CSI':'G','CSJ':'C',
    'CSL':'C','CSO':'C','CSP':'C','CSR':'C','CSS':'C',
    'CSU':'C','CSW':'C','CSX':'C','CSY':'SYG','CSZ':'C',
    'CTE':'W','CTG':'T','CTH':'T','CUC':'X','CWR':'S',
    'CXM':'M','CY0':'C','CY1':'C','CY3':'C','CY4':'C',
    'CYA':'C','CYD':'C','CYF':'C','CYG':'C','CYJ':'X',
    'CYM':'C','CYQ':'C','CYR':'C','CYS':'C','CZ2':'C',
    'CZO':'GYG','CZZ':'C','D11':'T','D1P':'N','D3 ':'N',
    'D33':'N','D3P':'G','D3T':'T','D4M':'T','D4P':'X',
    'DA ':'A','DA2':'X','DAB':'A','DAH':'F','DAL':'A',
    'DAR':'R','DAS':'D','DBB':'T','DBM':'N','DBS':'S',
    'DBU':'T','DBY':'Y','DBZ':'A','DC ':'C','DC2':'C',
    'DCG':'G','DCI':'X','DCL':'X','DCT':'C','DCY':'C',
    'DDE':'H','DDG':'G','DDN':'U','DDX':'N','DFC':'C',
    'DFG':'G','DFI':'X','DFO':'X','DFT':'N','DG ':'G',
    'DGH':'G','DGI':'G','DGL':'E','DGN':'Q','DHA':'A',
    'DHI':'H','DHL':'X','DHN':'V','DHP':'X','DHU':'U',
    'DHV':'V','DI ':'I','DIL':'I','DIR':'R','DIV':'V',
    'DLE':'L','DLS':'K','DLY':'K','DM0':'K','DMH':'N',
    'DMK':'D','DMT':'X','DN ':'N','DNE':'L','DNG':'L',
    'DNL':'K','DNM':'L','DNP':'A','DNR':'C','DNS':'K',
    'DOA':'X','DOC':'C','DOH':'D','DON':'L','DPB':'T',
    'DPH':'F','DPL':'P','DPP':'A','DPQ':'Y','DPR':'P',
    'DPY':'N','DRM':'U','DRP':'N','DRT':'T','DRZ':'N',
    'DSE':'S','DSG':'N','DSN':'S','DSP':'D','DT ':'T',
    'DTH':'T','DTR':'W','DTY':'Y','DU ':'U','DVA':'V',
    'DXD':'N','DXN':'N','DYG':'DYG','DYS':'C','DZM':'A',
    'E  ':'A','E1X':'A','EDA':'A','EDC':'G','EFC':'C',
    'EHP':'F','EIT':'T','ENP':'N','ESB':'Y','ESC':'M',
    'EXY':'L','EY5':'N','EYS':'X','F2F':'F','FA2':'A',
    'FA5':'N','FAG':'N','FAI':'N','FCL':'F','FFD':'N',
    'FGL':'G','FGP':'S','FHL':'X','FHO':'K','FHU':'U',
    'FLA':'A','FLE':'L','FLT':'Y','FME':'M','FMG':'G',
    'FMU':'N','FOE':'C','FOX':'G','FP9':'P','FPA':'F',
    'FRD':'X','FT6':'W','FTR':'W','FTY':'Y','FZN':'K',
    'G  ':'G','G25':'G','G2L':'G','G2S':'G','G31':'G',
    'G32':'G','G33':'G','G36':'G','G38':'G','G42':'G',
    'G46':'G','G47':'G','G48':'G','G49':'G','G4P':'N',
    'G7M':'G','GAO':'G','GAU':'E','GCK':'C','GCM':'X',
    'GDP':'G','GDR':'G','GFL':'G','GGL':'E','GH3':'G',
    'GHG':'Q','GHP':'G','GL3':'G','GLH':'Q','GLM':'X',
    'GLN':'Q','GLQ':'E','GLU':'E','GLX':'Z','GLY':'G',
    'GLZ':'G','GMA':'E','GMS':'G','GMU':'U','GN7':'G',
    'GND':'X','GNE':'N','GOM':'G','GPL':'K','GS ':'G',
    'GSC':'G','GSR':'G','GSS':'G','GSU':'E','GT9':'C',
    'GTP':'G','GVL':'X','GYC':'CYG','GYS':'SYG','H2U':'U',
    'H5M':'P','HAC':'A','HAR':'R','HBN':'H','HCS':'X',
    'HDP':'U','HEU':'U','HFA':'X','HGL':'X','HHI':'H',
    'HHK':'AK','HIA':'H','HIC':'H','HIP':'H','HIQ':'H',
    'HIS':'H','HL2':'L','HLU':'L','HMF':'A','HMR':'R',
    'HOL':'N','HPC':'F','HPE':'F','HPQ':'F','HQA':'A',
    'HRG':'R','HRP':'W','HS8':'H','HS9':'H','HSE':'S',
    'HSL':'S','HSO':'H','HTI':'C','HTN':'N','HTR':'W',
    'HV5':'A','HVA':'V','HY3':'P','HYP':'P','HZP':'P',
    'I  ':'I','I2M':'I','I58':'K','I5C':'C','IAM':'A',
    'IAR':'R','IAS':'D','IC ':'C','IEL':'K','IEY':'HYG',
    'IG ':'G','IGL':'G','IGU':'G','IIC':'SHG','IIL':'I',
    'ILE':'I','ILG':'E','ILX':'I','IMC':'C','IML':'I',
    'IOY':'F','IPG':'G','IPN':'N','IRN':'N','IT1':'K',
    'IU ':'U','IYR':'Y','IYT':'T','JJJ':'C','JJK':'C',
    'JJL':'C','JW5':'N','K1R':'C','KAG':'G','KCX':'K',
    'KGC':'K','KOR':'M','KPI':'K','KST':'K','KYQ':'K',
    'L2A':'X','LA2':'K','LAA':'D','LAL':'A','LBY':'K',
    'LC ':'C','LCA':'A','LCC':'N','LCG':'G','LCH':'N',
    'LCK':'K','LCX':'K','LDH':'K','LED':'L','LEF':'L',
    'LEH':'L','LEI':'V','LEM':'L','LEN':'L','LET':'X',
    'LEU':'L','LG ':'G','LGP':'G','LHC':'X','LHU':'U',
    'LKC':'N','LLP':'K','LLY':'K','LME':'E','LMQ':'Q',
    'LMS':'N','LP6':'K','LPD':'P','LPG':'G','LPL':'X',
    'LPS':'S','LSO':'X','LTA':'X','LTR':'W','LVG':'G',
    'LVN':'V','LYM':'K','LYN':'K','LYR':'K','LYS':'K',
    'LYX':'K','LYZ':'K','M0H':'C','M1G':'G','M2G':'G',
    'M2L':'K','M2S':'M','M3L':'K','M5M':'C','MA ':'A',
    'MA6':'A','MA7':'A','MAA':'A','MAD':'A','MAI':'R',
    'MBQ':'Y','MBZ':'N','MC1':'S','MCG':'X','MCL':'K',
    'MCS':'C','MCY':'C','MDH':'X','MDO':'ASG','MDR':'N',
    'MEA':'F','MED':'M','MEG':'E','MEN':'N','MEP':'U',
    'MEQ':'Q','MET':'M','MEU':'G','MF3':'X','MFC':'GYG',
    'MG1':'G','MGG':'R','MGN':'Q','MGQ':'A','MGV':'G',
    'MGY':'G','MHL':'L','MHO':'M','MHS':'H','MIA':'A',
    'MIS':'S','MK8':'L','ML3':'K','MLE':'L','MLL':'L',
    'MLY':'K','MLZ':'K','MME':'M','MMT':'T','MND':'N',
    'MNL':'L','MNU':'U','MNV':'V','MOD':'X','MP8':'P',
    'MPH':'X','MPJ':'X','MPQ':'G','MRG':'G','MSA':'G',
    'MSE':'M','MSL':'M','MSO':'M','MSP':'X','MT2':'M',
    'MTR':'T','MTU':'A','MTY':'Y','MVA':'V','N  ':'N',
    'N10':'S','N2C':'X','N5I':'N','N5M':'C','N6G':'G',
    'N7P':'P','NA8':'A','NAL':'A','NAM':'A','NB8':'N',
    'NBQ':'Y','NC1':'S','NCB':'A','NCX':'N','NCY':'X',
    'NDF':'F','NDN':'U','NEM':'H','NEP':'H','NF2':'N',
    'NFA':'F','NHL':'E','NIT':'X','NIY':'Y','NLE':'L',
    'NLN':'L','NLO':'L','NLP':'L','NLQ':'Q','NMC':'G',
    'NMM':'R','NMS':'T','NMT':'T','NNH':'R','NP3':'N',
    'NPH':'C','NRP':'LYG','NRQ':'MYG','NSK':'X','NTY':'Y',
    'NVA':'V','NYC':'TWG','NYG':'NYG','NYM':'N','NYS':'C',
    'NZH':'H','O12':'X','O2C':'N','O2G':'G','OAD':'N',
    'OAS':'S','OBF':'X','OBS':'X','OCS':'C','OCY':'C',
    'ODP':'N','OHI':'H','OHS':'D','OIC':'X','OIP':'I',
    'OLE':'X','OLT':'T','OLZ':'S','OMC':'C','OMG':'G',
    'OMT':'M','OMU':'U','ONE':'U','ONL':'X','OPR':'R',
    'ORN':'A','ORQ':'R','OSE':'S','OTB':'X','OTH':'T',
    'OTY':'Y','OXX':'D','P  ':'G','P1L':'C','P1P':'N',
    'P2T':'T','P2U':'U','P2Y':'P','P5P':'A','PAQ':'Y',
    'PAS':'D','PAT':'W','PAU':'A','PBB':'C','PBF':'F',
    'PBT':'N','PCA':'E','PCC':'P','PCE':'X','PCS':'F',
    'PDL':'X','PDU':'U','PEC':'C','PF5':'F','PFF':'F',
    'PFX':'X','PG1':'S','PG7':'G','PG9':'G','PGL':'X',
    'PGN':'G','PGP':'G','PGY':'G','PHA':'F','PHD':'D',
    'PHE':'F','PHI':'F','PHL':'F','PHM':'F','PIV':'X',
    'PLE':'L','PM3':'F','PMT':'C','POM':'P','PPN':'F',
    'PPU':'A','PPW':'G','PQ1':'N','PR3':'C','PR5':'A',
    'PR9':'P','PRN':'A','PRO':'P','PRS':'P','PSA':'F',
    'PSH':'H','PST':'T','PSU':'U','PSW':'C','PTA':'X',
    'PTH':'Y','PTM':'Y','PTR':'Y','PU ':'A','PUY':'N',
    'PVH':'H','PVL':'X','PYA':'A','PYO':'U','PYX':'C',
    'PYY':'N','QLG':'QLG','QUO':'G','R  ':'A','R1A':'C',
    'R1B':'C','R1F':'C','R7A':'C','RC7':'HYG','RCY':'C',
    'RIA':'A','RMP':'A','RON':'X','RT ':'T','RTP':'N',
    'S1H':'S','S2C':'C','S2D':'A','S2M':'T','S2P':'A',
    'S4A':'A','S4C':'C','S4G':'G','S4U':'U','S6G':'G',
    'SAC':'S','SAH':'C','SAR':'G','SBL':'S','SC ':'C',
    'SCH':'C','SCS':'C','SCY':'C','SD2':'X','SDG':'G',
    'SDP':'S','SEB':'S','SEC':'A','SEG':'A','SEL':'S',
    'SEM':'X','SEN':'S','SEP':'S','SER':'S','SET':'S',
    'SGB':'S','SHC':'C','SHP':'G','SHR':'K','SIB':'C',
    'SIC':'DC','SLA':'P','SLR':'P','SLZ':'K','SMC':'C',
    'SME':'M','SMF':'F','SMP':'A','SMT':'T','SNC':'C',
    'SNN':'N','SOC':'C','SOS':'N','SOY':'S','SPT':'T',
    'SRA':'A','SSU':'U','STY':'Y','SUB':'X','SUI':'DG',
    'SUN':'S','SUR':'U','SVA':'S','SVX':'S','SVZ':'X',
    'SYS':'C','T  ':'T','T11':'F','T23':'T','T2S':'T',
    'T2T':'N','T31':'U','T32':'T','T36':'T','T37':'T',
    'T38':'T','T39':'T','T3P':'T','T41':'T','T48':'T',
    'T49':'T','T4S':'T','T5O':'U','T5S':'T','T66':'X',
    'T6A':'A','TA3':'T','TA4':'X','TAF':'T','TAL':'N',
    'TAV':'D','TBG':'V','TBM':'T','TC1':'C','TCP':'T',
    'TCQ':'X','TCR':'W','TCY':'A','TDD':'L','TDY':'T',
    'TFE':'T','TFO':'A','TFQ':'F','TFT':'T','TGP':'G',
    'TH6':'T','THC':'T','THO':'X','THR':'T','THX':'N',
    'THZ':'R','TIH':'A','TLB':'N','TLC':'T','TLN':'U',
    'TMB':'T','TMD':'T','TNB':'C','TNR':'S','TOX':'W',
    'TP1':'T','TPC':'C','TPG':'G','TPH':'X','TPL':'W',
    'TPO':'T','TPQ':'Y','TQQ':'W','TRF':'W','TRG':'K',
    'TRN':'W','TRO':'W','TRP':'W','TRQ':'W','TRW':'W',
    'TRX':'W','TS ':'N','TST':'X','TT ':'N','TTD':'T',
    'TTI':'U','TTM':'T','TTQ':'W','TTS':'Y','TY2':'Y',
    'TY3':'Y','TYB':'Y','TYI':'Y','TYN':'Y','TYO':'Y',
    'TYQ':'Y','TYR':'Y','TYS':'Y','TYT':'Y','TYU':'N',
    'TYX':'X','TYY':'Y','TZB':'X','TZO':'X','U  ':'U',
    'U25':'U','U2L':'U','U2N':'U','U2P':'U','U31':'U',
    'U33':'U','U34':'U','U36':'U','U37':'U','U8U':'U',
    'UAR':'U','UCL':'U','UD5':'U','UDP':'N','UFP':'N',
    'UFR':'U','UFT':'U','UMA':'A','UMP':'U','UMS':'U',
    'UN1':'X','UN2':'X','UNK':'X','UR3':'U','URD':'U',
    'US1':'U','US2':'U','US3':'T','US5':'U','USM':'U',
    'V1A':'C','VAD':'V','VAF':'V','VAL':'V','VB1':'K',
    'VDL':'X','VLL':'X','VLM':'X','VMS':'X','VOL':'X',
    'X  ':'G','X2W':'E','X4A':'N','X9Q':'AFG','XAD':'A',
    'XAE':'N','XAL':'A','XAR':'N','XCL':'C','XCP':'X',
    'XCR':'C','XCS':'N','XCT':'C','XCY':'C','XGA':'N',
    'XGL':'G','XGR':'G','XGU':'G','XTH':'T','XTL':'T',
    'XTR':'T','XTS':'G','XTY':'N','XUA':'A','XUG':'G',
    'XX1':'K','XXY':'THG','XYG':'DYG','Y  ':'A','YCM':'C',
    'YG ':'G','YOF':'Y','YRR':'N','YYG':'G','Z  ':'C',
    'ZAD':'A','ZAL':'A','ZBC':'C','ZCY':'C','ZDU':'U',
    'ZFB':'X','ZGU':'G','ZHP':'N','ZTH':'T','ZZJ':'A' }


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Fri Apr 27 03:59:13 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 27 Apr 2012 03:59:13 +0000
Subject: [Biopython-dev] [Biopython - Bug #3346] (New) patch for legacy
	parser to support BLASTX 2.2.25+
Message-ID: <redmine.issue-3346.20120427035913@redmine.open-bio.org>


Issue #3346 has been reported by John Comeau.

----------------------------------------
Bug #3346: patch for legacy parser to support BLASTX 2.2.25+
https://redmine.open-bio.org/issues/3346

Author: John Comeau
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


it may also work with 2.2.26+, I have not tested. patched parser passes regression tests as per Peter Cock's instructions.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From andrew.sczesnak at med.nyu.edu  Fri Apr 27 19:57:19 2012
From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak)
Date: Fri, 27 Apr 2012 15:57:19 -0400
Subject: [Biopython-dev] BGZF support,
	was Re: Biopython 1.60 plans and beyond
In-Reply-To: <CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>
References: <CAKVJ-_6xDOnV4YiGuYKo8xFi=1WeL0oX+RqRD5QKFw14VKKYbQ@mail.gmail.com>	<4F91E4CF.8040602@med.nyu.edu>
	<CAKVJ-_4k==uN0UYa17-xPV6OMjE-Wm5Yuohf=bzGKB5vwXmKVQ@mail.gmail.com>
Message-ID: <4F9AFA1F.6030103@med.nyu.edu>

Peter,

> It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py
> and I'm willing to do this myself for MAF (while going over your index work -
> something I want to do anyway). The only potential catch is avoiding offset
> arithmetic.

I have no problem with you doing this if you're willing. It would be 
great to have some code review of MafIndex as well.


Best,
Andrew


From MatatTHC at gmx.de  Sat Apr 28 07:15:35 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Sat, 28 Apr 2012 09:15:35 +0200
Subject: [Biopython-dev] SeqIO circular
In-Reply-To: <CALNFT0hrc+T-0xWesCuK0E5X8=mcDCqXoRRJJ4ms2qAibWXhTg@mail.gmail.com>
References: <CALNFT0jq=VTwSDv-4x7ZrHoQRLajCUHY8NGPMw9cDuGnwwNiuw@mail.gmail.com>
	<CAKVJ-_7MpLRCModFfMdRPcVDjk42nVCJ--OwNBnAJv3wNcns_A@mail.gmail.com>
	<CALNFT0jTxFSbqn+f3hS-KZ2Z09xsgoKPFSow1BO3PdDGrJ7hag@mail.gmail.com>
	<CALNFT0hrc+T-0xWesCuK0E5X8=mcDCqXoRRJJ4ms2qAibWXhTg@mail.gmail.com>
Message-ID: <CALNFT0h1udmBBF+TZrXhv22q5SBNJE5RBmtr+bMfmAsQMabX2g@mail.gmail.com>

Dear developers,

I would like to suggest a quick "fix" for the problem. Currently the
parser just returns true per default for the circular property. This
is a wrong piece of information for all circular sequences.
Furthermore its not possible to detect if the parser did return true
because it is its default value or if its really from the data. So I
suggest to return None if the parser does not parse the information.

What do you think? This should be possible with minimal effort.

The user could then implement a workaround on its own (like using the
old parser as fallback, or just searching the first line of t)

Regards,
Matthias

2012/4/22 Matthias Bernt <MatatTHC at gmx.de>:
> Hi,
>
> since this bug seems to be of low priority I decided to try my best to
> help a bit and search the web a bit.
> It seems that the property is stored in PrimarySeq or Seq ?in bioperl.
> See for instance:
>
> http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm
> http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm
>
> Or also:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>
> This seems to be realised as boolean variable or function.
>
> Regards,
> Matthias
>
> 2012/4/4 Matthias Bernt <MatatTHC at gmx.de>:
>> Hi,
>>
>> are there any news on this? May I help somehow? But I have to admit
>> that I barely speak perl and have no experience with bioperl. If
>> someone tells me where to look I might still try it.
>>
>> Matthias
>>
>> 2012/3/29 Peter Cock <p.j.a.cock at googlemail.com>:
>>> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
>>>> Hi,
>>>>
>>>> Is it possible to get the property if a genome is circular / linear
>>>> from SeqIO applied to genbank files? I could not find it.
>>>>
>>>> There is also a related bugreport:
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578
>>>>
>>>> I used the old parser before and switched to SeqIO which I really like
>>>> for the possibilities to parse different formats... but I really need
>>>> the information.
>>>
>>> Does anyone happen to have a BioPerl + BioSQL setup installed
>>> and working? IIRC checking that to make sure however we
>>> store the circular was compatible was the only real hurdle.
>>>
>>> Peter


From w.arindrarto at gmail.com  Sat Apr 28 12:08:35 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Sat, 28 Apr 2012 14:08:35 +0200
Subject: [Biopython-dev] Google Summer of Code Project: SearchIO in Biopython
Message-ID: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>

Hello everyone,

This is Wibowo Arindrarto (or Bow, for short), one of the Google Summer of
Code students who will work on Biopython over this summer.

I will be working with Peter to add support for parsing search outputs from
programs like BLAST and HMMER to Biopython, so that it's easier to extract
information from their outputs. Having used some of these programs quite a
lot myself, I'm really looking forward to implementing the feature.
However, I do understand that it won't be just me who will use the module,
but also many other Biopython user. So for everyone who is interested in
giving a say, input, or critiques along the way, feel free to do so :).

The official coding period starts in about a month from now. Until then, I
will be doing all the preparatory work required so that coding will proceed
as smooth as possible. These will include preparing the test cases and
preparing the SearchIO attribute / object naming convention as well as
discussing anything related to its proposed implementation.

Finally, here are some links related to the project that might interest you.

1. My main biopython branch for development:
https://github.com/bow/biopython/tree/searchio. Since I will be building on
top of Peter's SearchIO branch (
https://github.com/peterjc/biopython/tree/search-io-test), right now it
only contains Peter's branch rebased against the latest master.

2. My GSoC proposal, which outlines my plans and timeline for the project:
http://bit.ly/searchio-proposal

3. The proposed SearchIO naming convention (not 100% complete as of now,
but will be filled along the way): http://bit.ly/searchio-terms. One of the
main goals of the project is to implement a common interface for BLAST et
al, which requires SearchIO to have common attribute names that refers to
different search output attributes. The link contains my proposed naming
convention, which is still very open to change and discussion. Feel free to
comment on the document and add your own ideas.

4. My blog, in which I will write weekly posts about the project's
progress: http://bow.web.id/blog

5. An extra repo for all other auxiliary files and scripts that doesn't go
into Biopython's code: https://github.com/bow/gsoc.

That's it for now. Thanks for taking time to read it :). I'm looking
forward to a productive summer with Biopython.

Have a nice weekend,
Bow


From p.j.a.cock at googlemail.com  Sun Apr 29 11:00:42 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 29 Apr 2012 12:00:42 +0100
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
Message-ID: <CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>

Hi Bow,

Thanks for updating the list. I'm replying just on the dev list
as I'm focusing on implementation discussion in this reply.

On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> 1. My main biopython branch for development:
> https://github.com/bow/biopython/tree/searchio. Since I will be building on
> top of Peter's SearchIO branch (
> https://github.com/peterjc/biopython/tree/search-io-test), right now it
> only contains Peter's branch rebased against the latest master.

Just to be clear - you don't have to start from that branch ;)
http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html

As I said before, that may not be the best approach. The idea
behind that code was to focus on the HSPs (in BLAST terms),
and for the low level parsers to iterate over each HSP. Higher
level wrappers can then batch these up by query/subject, or
into the larger grouping of all the results for one query -
which was the exposed high level Bio.SearchIO.parse
function.

That branch introduced a SearchResult object which was
essentially something like a list or dict (like an OrderedDict
in some ways), with some (unnecessary?) error checking for
consistent contents (all from the same query). It also introduced
a TopMatches object which was essentially list list (again,
with some error checking for consistent contents).

The advantage of using simple objects (OrderedDict
and list) is simplicity and hopefully performance. But
specific classes have the advantage of allowing more
user friendly str/repr etc.

The idea on this branch of focusing on iteration over the
HSPs at the low level was it allowed a lot of flexibility, and
the low level parser could be used in conjunction with
indexing to see to a particular HSP and parse it, or goto
the results for a particular query+match and parse its
HSPs  (not implemented on my old branch, but that was
the plan).

However, while this makes perfect sense for say the BLAST
tabular output, it isn't quite such a good match for all the
possible datatypes.

For instance, BLAST plain text/html includes an e-value for
a query/subject combination which is calculated from all the
HSPs for that query/subject (taking into account order etc -
I'd have to check the O'Reilly BLAST book for the details).
This isn't in the tabular output, but the point is that it isn't a
property of the individual HSPs, but of the match (group of
HSPs).

I think we need to consider the other main formats, and if
all their important information lies at the HSP level or not.
Perhaps iteration at the query+match level (groups of
HSPs) would be best overall?

Bow - If some of that doesn't make sense, I can try to clarify
by email on the list, and/or we can talk about it at our next
video chat. Also see if you can get the BLAST book from
your library - it will probably be quite useful in this project
even though it describes the 'legacy' BLAST suite:

"BLAST" by Ian Korf, Mark Yandell, Joseph Bedell
Publisher: O'Reilly Media, Released: July 2003

Regards,

Peter


From w.arindrarto at gmail.com  Sun Apr 29 16:42:14 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Sun, 29 Apr 2012 18:42:14 +0200
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
Message-ID: <CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>

On Sun, Apr 29, 2012 at 13:00, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Hi Bow,
>
> Thanks for updating the list. I'm replying just on the dev list
> as I'm focusing on implementation discussion in this reply.
>
> On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > 1. My main biopython branch for development:
> > https://github.com/bow/biopython/tree/searchio. Since I will be building
> > on
> > top of Peter's SearchIO branch (
> > https://github.com/peterjc/biopython/tree/search-io-test), right now it
> > only contains Peter's branch rebased against the latest master.
>
> Just to be clear - you don't have to start from that branch ;)
> http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html

Ok :). I wasn't so sure about how much code from your previous branch
that I will end up using, so I decided to rebase everything and then
see later how much of it can be used. But it's also easier to start clean :).

> As I said before, that may not be the best approach. The idea
> behind that code was to focus on the HSPs (in BLAST terms),
> and for the low level parsers to iterate over each HSP. Higher
> level wrappers can then batch these up by query/subject, or
> into the larger grouping of all the results for one query -
> which was the exposed high level Bio.SearchIO.parse
> function.
>
> That branch introduced a SearchResult object which was
> essentially something like a list or dict (like an OrderedDict
> in some ways), with some (unnecessary?) error checking for
> consistent contents (all from the same query). It also introduced
> a TopMatches object which was essentially list list (again,
> with some error checking for consistent contents).
>
> The advantage of using simple objects (OrderedDict
> and list) is simplicity and hopefully performance. But
> specific classes have the advantage of allowing more
> user friendly str/repr etc.
>
> The idea on this branch of focusing on iteration over the
> HSPs at the low level was it allowed a lot of flexibility, and
> the low level parser could be used in conjunction with
> indexing to see to a particular HSP and parse it, or goto
> the results for a particular query+match and parse its
> HSPs ?(not implemented on my old branch, but that was
> the plan).
>
> However, while this makes perfect sense for say the BLAST
> tabular output, it isn't quite such a good match for all the
> possible datatypes.
>
> For instance, BLAST plain text/html includes an e-value for
> a query/subject combination which is calculated from all the
> HSPs for that query/subject (taking into account order etc -
> I'd have to check the O'Reilly BLAST book for the details).
> This isn't in the tabular output, but the point is that it isn't a
> property of the individual HSPs, but of the match (group of
> HSPs).
>
> I think we need to consider the other main formats, and if
> all their important information lies at the HSP level or not.
> Perhaps iteration at the query+match level (groups of
> HSPs) would be best overall?
>
> Bow - If some of that doesn't make sense, I can try to clarify
> by email on the list, and/or we can talk about it at our next
> video chat. Also see if you can get the BLAST book from
> your library - it will probably be quite useful in this project
> even though it describes the 'legacy' BLAST suite:
>
> "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell
> Publisher: O'Reilly Media, Released: July 2003
>
> Regards,
>
> Peter

I think I got the gist of it (please correct me if I'm wrong). Some
information about the search, such as the sequence-wide e-value, may
not be present in the HSP level. Ignoring them could let us focus on a
perhaps simpler and more flexible implementation with better
performance, but at the cost of usefulness of the data itself since we
are throwing away information.

What I have in mind now is actually closer to iteration on the
query+subject level. To be clear first, the hierarchy of the objects
that I propose is this:

* Search object, to represent the entire search session.
* Result object, to represent a search with one query against the
database. Depending on the number of queries, we could have one to
several Result objects contained in a Search.
* Hit object, to represent a sequence hit. Depending on the search, we
could also have multiple Hits in one Result object.
* and finally, HSP object, to represent individual alignments.

Iteration is done on the Results level, so the information is parsed
on the search query level, not just a single HSPs (I wrote a  very
short description about what I'm planning the objects to be in here as
well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
information parsing over performance and simplicity of the
format-specific parsers, this is the way to go. There are other
formats, too, that contains sequence-level search information not
present in the alignment (e.g. HMMER text output). What do you think
about this?

Thanks for the BLAST book suggestion. I'll see if I can find it in my
library in the mean time.

regards,
Bow


From p.j.a.cock at googlemail.com  Mon Apr 30 09:49:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 30 Apr 2012 10:49:27 +0100
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
	<CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
Message-ID: <CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>

On Sun, Apr 29, 2012 at 5:42 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
>
> I think I got the gist of it (please correct me if I'm wrong). Some
> information about the search, such as the sequence-wide e-value, may
> not be present in the HSP level. Ignoring them could let us focus on a
> perhaps simpler and more flexible implementation with better
> performance, but at the cost of usefulness of the data itself since we
> are throwing away information.

Yes.

> What I have in mind now is actually closer to iteration on the
> query+subject level. To be clear first, the hierarchy of the objects
> that I propose is this:
>
> * Search object, to represent the entire search session.
> * Result object, to represent a search with one query against the
> database. Depending on the number of queries, we could have one to
> several Result objects contained in a Search.
> * Hit object, to represent a sequence hit. Depending on the search, we
> could also have multiple Hits in one Result object.
> * and finally, HSP object, to represent individual alignments.
>
> Iteration is done on the Results level, so the information is parsed
> on the search query level, not just a single HSPs (I wrote a ?very
> short description about what I'm planning the objects to be in here as
> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
> information parsing over performance and simplicity of the
> format-specific parsers, this is the way to go. There are other
> formats, too, that contains sequence-level search information not
> present in the alignment (e.g. HMMER text output). What do you think
> about this?

That sounds good .

If iteration is done on the Results level, when/how would your
Search object be used?

Peter


From w.arindrarto at gmail.com  Mon Apr 30 10:08:52 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 30 Apr 2012 12:08:52 +0200
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
	<CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
	<CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>
Message-ID: <CADEGkF5C-dq7jd+JmcFzsY-X2Poiqs8vDZNt3zrUS5kaewA0tw@mail.gmail.com>

>> What I have in mind now is actually closer to iteration on the
>> query+subject level. To be clear first, the hierarchy of the objects
>> that I propose is this:
>>
>> * Search object, to represent the entire search session.
>> * Result object, to represent a search with one query against the
>> database. Depending on the number of queries, we could have one to
>> several Result objects contained in a Search.
>> * Hit object, to represent a sequence hit. Depending on the search, we
>> could also have multiple Hits in one Result object.
>> * and finally, HSP object, to represent individual alignments.
>>
>> Iteration is done on the Results level, so the information is parsed
>> on the search query level, not just a single HSPs (I wrote a ?very
>> short description about what I'm planning the objects to be in here as
>> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
>> information parsing over performance and simplicity of the
>> format-specific parsers, this is the way to go. There are other
>> formats, too, that contains sequence-level search information not
>> present in the alignment (e.g. HMMER text output). What do you think
>> about this?
>
> That sounds good .
>
> If iteration is done on the Results level, when/how would your
> Search object be used?
>
> Peter

I'm thinking of using the Search object as the object returned by
SearchIO.parse or SearchIO.read. That way, we can store attributes
common to the different search queries in it. For example:

>>> search  = SearchIO.parse('blast_result.xml', 'blast-xml')
>>> search.format
'blast-xml'
>>> search.algorithm
'blastx'
>>> search.version
'2.2.26+'
>>> search.database
'refseq_protein'
>>> search.results
<generator object results at ....>

And iteration over the results would be done like this (for example):
>>> for result in search.results:
... print result.query, print len(result)

Additionaly, we can also define __iter__ and next for Search so we can
just do the following:
>>> for result in search:
... print result.query, print len(result)

What do you think?


Bow


From p.j.a.cock at googlemail.com  Mon Apr 30 10:57:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 30 Apr 2012 11:57:27 +0100
Subject: [Biopython-dev] [Biopython] Google Summer of Code Project:
	SearchIO in Biopython
In-Reply-To: <CADEGkF5C-dq7jd+JmcFzsY-X2Poiqs8vDZNt3zrUS5kaewA0tw@mail.gmail.com>
References: <CADEGkF5rcZ+w32zc8Uz0fiTUVUDprY-ot5j=wtYZO0tsyvsu8Q@mail.gmail.com>
	<CAKVJ-_6GmqyyEiqbhecAvEvRjPmvZbu_8Qrp1Tbe3KCcBfXsRQ@mail.gmail.com>
	<CADEGkF7-HBRpcwTXzpFMuJk5UGJ3dv=Vnxrw1DUouuswoF2h-Q@mail.gmail.com>
	<CAKVJ-_4G8SedQn8jBB03OROcs-F6hj1T9=01V+NTZfPOVRgyrQ@mail.gmail.com>
	<CADEGkF5C-dq7jd+JmcFzsY-X2Poiqs8vDZNt3zrUS5kaewA0tw@mail.gmail.com>
Message-ID: <CAKVJ-_7R8DK7KtCKOEGLr_wMR5ci2ZiAXHZnXWvZuN3-=whv9w@mail.gmail.com>

On Mon, Apr 30, 2012 at 11:08 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
>
> I'm thinking of using the Search object as the object returned by
> SearchIO.parse or SearchIO.read. That way, we can store attributes
> common to the different search queries in it. For example:
>
>>>> search ?= SearchIO.parse('blast_result.xml', 'blast-xml')
>>>> search.format
> 'blast-xml'
>>>> search.algorithm
> 'blastx'
>>>> search.version
> '2.2.26+'
>>>> search.database
> 'refseq_protein'
>>>> search.results
> <generator object results at ....>
>
> And iteration over the results would be done like this (for example):
>>>> for result in search.results:
> ... print result.query, print len(result)
>
> Additionaly, we can also define __iter__ and next for Search so we can
> just do the following:
>>>> for result in search:
> ... print result.query, print len(result)
>
> What do you think?

I think you'll get in a mess with multiple iterators all sharing the
same handle and competing over using it - but maybe I'm not
grasping what you have in mind.

Initially keep it simple: The primary public API would be

for result in Bio.SearchIO.parse(...):
     print result.query, print len(result)

where each iteration gives a complete result set for one query.

Peter

P.S. With SearchIO subject to name space discussions ;)