From p.j.a.cock at googlemail.com Thu Mar 1 07:02:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 12:02:58 +0000 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: <4F4BAE23.7070402@gmail.com> References: <4F4BAE23.7070402@gmail.com> Message-ID: On Mon, Feb 27, 2012 at 4:24 PM, Robert Buels wrote: > Hi all, > > As kindly pointed out by Reece Hart, the previous email I sent out calling > for Google Summer of Code project ideas, had the wrong due date for project > ideas in it. > > I actually want them to all be in place by Friday, March 2, which is this > coming Friday. > See http://lists.open-bio.org/pipermail/biopython/2012-February/007726.html for the original complete email. That deadline is upon us (tomorrow), so where are we with GSoC 2012 ideas? http://biopython.org/wiki/Google_Summer_of_Code Are any of the areas touched on in the "Biopython 1.60 plans and beyond" thread suitable? Python 3? --------- In terms of 'software engineering' we might be able to put together something for Python 3 support (there are still some C extensions to do), but I'm not sure if there is enough work there. SearchIO? --------- I'm wondering if a Biopython SearchIO would make a good project, that I might supervise. This name is obviously based on BioPerl. I would be aiming for iterator based parser/writer framework (like SeqIO and AlignIO) for pairwise 'sequence' searches initially, but have also been thinking about indexing - at least by query, ideally also by match, to allow random access akin to what Bio.SeqIO.index offers. In some cases the results would also be pairwise sequence alignments, in which case some code can be shared/linked with AlignIO. In other cases all you get is co-ordinates of the query and match plus some kind of score. Therefore this could include a hierarchical SearchIO result object structure for minimal matches up to full pairwise alignments. I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not really sequence vs sequence, but HMM vs sequence), RPS-BLAST (again not really sequence vs sequence). Perhaps this could also tie into the Bio.Motif code as well (if we consider things like PSSM vs sequence in the same framework). You can already do some of this in Biopython (e.g. BLAST XML parsing, and there is some HMMER work on branches), but I'm hoping for a unified API here. Peter From daniel at treparel.com Thu Mar 1 07:21:42 2012 From: daniel at treparel.com (=?UTF-8?Q?Dani=C3=ABl_van_Adrichem?=) Date: Thu, 1 Mar 2012 13:21:42 +0100 Subject: [Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists) Message-ID: Hello list, Firstly I want to report a bug plus suggested fix. Today I noticed a bug which got triggered by missing local DTDs. I was still using 1.58 which does not have the new DTDs. Missing the DTDs locally should be handled by downloading them. This worked for the first DTD, but then on the second one (which is a dependency of the first one) I got a HTTP 404. After investigating I found that the module was making a request for "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd" Note the backslash right after DTD. It gets turned into a %5C and causes the 404. The cause of this is usage of os.path.join to concatenate the URL. I am running this on windows, on a platform where the file system uses a forward slash this would work just fine. please find attached a patch to fix this issue. Secondly I want to comment on the recent change in Bio.Entrez.efetch (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change did fix the problem, I think this might be achieved in a cleaner way. Please see the code that is used to format the options on the url (in Bio.Entrez._open): options = urllib.urlencode(params, doseq=True) the doseq argument specifically. Its documentation states: "If any values in the query arg are sequences and doseq is true, each sequence element is converted to a separate parameter." So this was the reason for the "id=1&id=2&id=3" formatting. Without doseq set this would turn into: "id=1,2,3" If this doseq functionality is not needed for other params (I am unsure of this), I suggest to revert the change in efetch() and use doseq=False (which is default argument) Thanks! -- Dani?l van Adrichem Treparel Information Solutions b.v. Delftechpark 26 2628XH Delft The Netherlands -------------- next part -------------- A non-text attachment was scrubbed... Name: Parser.py.diff Type: application/octet-stream Size: 592 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Thu Mar 1 08:34:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 13:34:39 +0000 Subject: [Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists) In-Reply-To: References: Message-ID: 2012/3/1 Dani?l van Adrichem : > Hello list, > > Firstly I want to report a bug plus suggested fix. > > Today I noticed a bug which got triggered by missing local DTDs. I was > still using 1.58 which does not have the new DTDs. > > Missing the DTDs locally should be handled by downloading them. This > worked for the first DTD, but then on the second one (which is a > dependency of the first one) I got a HTTP 404. > > After investigating I found that the module was making a request for > "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd" > Note the backslash right after DTD. It gets turned into a %5C and > causes the 404. That DTD should be in Biopython 1.59 - and hopefully the other DTD you mentioned but did not name. Please let us know if there are any more we've missed. https://github.com/biopython/biopython/commit/5f08ccdfe0706f9073bce441609aa86b1ea9d0f4 > The cause of this is usage of os.path.join to concatenate the URL. I > am running this on windows, on a platform where the file system uses a > forward slash this would work just fine. > > please find attached a patch to fix this issue. That makes perfect sense, although as written your patch could result in too many slashes being used - thus: https://github.com/biopython/biopython/commit/c93b32bab5526a830e2cb14f0db782ee1b687715 Would you like to be thanked in the NEWS file and listed as a contributor (in the CONTRIB file)? > Secondly I want to comment on the recent change in Bio.Entrez.efetch > (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change > did fix the problem, I think this might be achieved in a cleaner way. > > Please see the code that is used to format the options on the url (in > Bio.Entrez._open): > > options = urllib.urlencode(params, doseq=True) > > the doseq argument specifically. Its documentation states: > "If any values in the query arg are sequences and doseq is true, each > sequence element is converted to a separate parameter." > > So this was the reason for the "id=1&id=2&id=3" formatting. Without > doseq set this would turn into: "id=1,2,3" > > If this doseq functionality is not needed for other params (I am > unsure of this), I suggest to revert the change in efetch() and use > doseq=False (which is default argument) Very good question - Michiel? Thanks, Peter From daniel at treparel.com Thu Mar 1 09:59:42 2012 From: daniel at treparel.com (=?UTF-8?Q?Dani=C3=ABl_van_Adrichem?=) Date: Thu, 1 Mar 2012 15:59:42 +0100 Subject: [Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists) In-Reply-To: References:

Message-ID: On 01/03/2012, Peter Cock wrote: > 2012/3/1 Dani?l van Adrichem : >> Hello list, >> >> Firstly I want to report a bug plus suggested fix. >> >> Today I noticed a bug which got triggered by missing local DTDs. I was >> still using 1.58 which does not have the new DTDs. >> >> Missing the DTDs locally should be handled by downloading them. This >> worked for the first DTD, but then on the second one (which is a >> dependency of the first one) I got a HTTP 404. >> >> After investigating I found that the module was making a request for >> "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd" >> Note the backslash right after DTD. It gets turned into a %5C and >> causes the 404. > > That DTD should be in Biopython 1.59 - and hopefully the other > DTD you mentioned but did not name. Please let us know if there > are any more we've missed. > > https://github.com/biopython/biopython/commit/5f08ccdfe0706f9073bce441609aa86b1ea9d0f4 > I haven't encountered any missing DTDs since I updated to 1.59 >> The cause of this is usage of os.path.join to concatenate the URL. I >> am running this on windows, on a platform where the file system uses a >> forward slash this would work just fine. >> >> please find attached a patch to fix this issue. > > That makes perfect sense, although as written your patch could > result in too many slashes being used - thus: Preventing double slashes is a good thing, nice. https://github.com/biopython/biopython/commit/c93b32bab5526a830e2cb14f0db782ee1b687715 > > Would you like to be thanked in the NEWS file and listed as a contributor > (in the CONTRIB file)? It is only a single line patch, but if you insist I am fine with it :) >> Secondly I want to comment on the recent change in Bio.Entrez.efetch >> (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change >> did fix the problem, I think this might be achieved in a cleaner way. >> >> Please see the code that is used to format the options on the url (in >> Bio.Entrez._open): >> >> options = urllib.urlencode(params, doseq=True) >> >> the doseq argument specifically. Its documentation states: >> "If any values in the query arg are sequences and doseq is true, each >> sequence element is converted to a separate parameter." >> >> So this was the reason for the "id=1&id=2&id=3" formatting. Without >> doseq set this would turn into: "id=1,2,3" >> >> If this doseq functionality is not needed for other params (I am >> unsure of this), I suggest to revert the change in efetch() and use >> doseq=False (which is default argument) > > Very good question - Michiel? Ok, what I wrote here isn't really accurate. Using urllib.urlencode({'id': range(3))}) returns 'id=%5B0%2C+1%2C+2%5D' note the %5B (square bracket open) and %5D (square bracket close). Apparently urlencode takes str(range(3)), which is '[0, 1, 2]' Weirdly enough the URL with the [ and ] surrounding the id list seems to be accepted, which is why I thought my suggestion worked. So looking at it again I suggest to keep the code as it is right now. Maybe only make sure the iterable consists of strings only, since ','.join does not accept anything else. something like this would do I think: keywords["id"] = ",".join(map(str, keywds["id"])) > > Thanks, Thank you -- Dani?l van Adrichem Treparel Information Solutions b.v. Delftechpark 26 2628XH Delft The Netherlands From eric.talevich at gmail.com Thu Mar 1 12:49:19 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 1 Mar 2012 12:49:19 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: On Thu, Mar 1, 2012 at 7:02 AM, Peter Cock wrote: > On Mon, Feb 27, 2012 at 4:24 PM, Robert Buels wrote: > > Hi all, > > > > As kindly pointed out by Reece Hart, the previous email I sent out > calling > > for Google Summer of Code project ideas, had the wrong due date for > project > > ideas in it. > > > > I actually want them to all be in place by Friday, March 2, which is this > > coming Friday. > > > > See > http://lists.open-bio.org/pipermail/biopython/2012-February/007726.html > for the original complete email. > > That deadline is upon us (tomorrow), so where are we with GSoC 2012 ideas? > http://biopython.org/wiki/Google_Summer_of_Code > > Are any of the areas touched on in the "Biopython 1.60 plans and beyond" > thread suitable? > Perhaps: Bio.Struct ---------- We have a lot of ideas and incomplete pieces of code from previous GSoCs that could be sorted out in one summer. However, taking on another GSoC student might just add to the heap; this might need to be Eric and Jo?o's Summer of Code instead. Here's one semi-coherent project idea that could fly: Overhaul Biopython's parsing infrastructure for protein primary, secondary and tertiary structures - Refactor PDBParser and parse_pdb_header to allow parsing amino-acid sequences from SEQRES lines (header) and ATOM records (body) without building the PDB structure object, i.e. without using numpy - Write a pure-Python replacement for parsing mmCIF files. (The module MMCIF2Dict already does almost all the work; lex+yacc just manages a fairly simple state machine for recognizing comments, special sub-sections, etc.) - Wrap the parsers for PDB, PDBML and mmCIF under a common I/O interface under the Bio.Struct namespace - Add parsing support for protein secondary structures, based on the relevant PDB records or (perhaps) DSSP output. (Note that Jo?o did some work on this already.) Variants -------- So, from the Biopython 1.60 thread: - James Casbon has offered to merge PyVCF into Biopython, right? - BCF, the binary form of VCF (via blocked gzip), may also be worthwhile to support - GVF, the Genome Variation Format, appears to be intended to be competitive with VCF. It's probably at least as well thought-out as VCF, sight unseen. It's based on GFF. Synthesizing the above, we have a GSoC project that looks like: - Help merge PyVCF into Python (w/ James's support -- I don't mean to volunteer him for this in absentia)? - Write a GVF parser that emits the same object type as PyVCF, potentially also using existing GFF code - Time permitting, look into blocked gzip support for VCF (BCF), also looking at SAM/BAM for inspiration and reusable code. > SearchIO? > --------- > > I'm wondering if a Biopython SearchIO would make a good project, > that I might supervise. This name is obviously based on BioPerl. I > would be aiming for iterator based parser/writer framework (like SeqIO > and AlignIO) for pairwise 'sequence' searches initially, but have also > been thinking about indexing - at least by query, ideally also by match, > to allow random access akin to what Bio.SeqIO.index offers. > > In some cases the results would also be pairwise sequence alignments, > in which case some code can be shared/linked with AlignIO. In other > cases all you get is co-ordinates of the query and match plus some > kind of score. Therefore this could include a hierarchical SearchIO > result object structure for minimal matches up to full pairwise alignments. > > I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not > really sequence vs sequence, but HMM vs sequence), RPS-BLAST > (again not really sequence vs sequence). Perhaps this could also tie > into the Bio.Motif code as well (if we consider things like PSSM vs > sequence in the same framework). > > You can already do some of this in Biopython (e.g. BLAST XML > parsing, and there is some HMMER work on branches), but I'm > hoping for a unified API here. > > Interesting. It would be very nice if the objects emitted by SearchIO could be easily fed into GenomeDiagram. -Eric From anaryin at gmail.com Thu Mar 1 13:00:41 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 1 Mar 2012 19:00:41 +0100 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: > > Bio.Struct > ---------- > > We have a lot of ideas and incomplete pieces of code from > previous GSoCs that could be sorted out in one summer. > However, taking on another GSoC student might just add to > the heap; this might need to be Eric and Jo?o's Summer of > Code instead. > The new student would have to be familiar with the regular Bio.PDB code plus whatever code I wrote and Mikael wrote. Maybe a bit too much? If by "*this might need to be Eric and Jo?o's Summer of Code instead.*" you mean we could work together one it like in a SoC project, I think it would be the best idea. Making a plan just like for SoC but working outside of it leaving the vacancy for another person/project. Otherwise I don't know how well will OBF take yet another Bio.PDB project since the previous two haven't been merged... From p.j.a.cock at googlemail.com Thu Mar 1 13:03:49 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 18:03:49 +0000 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: 2012/3/1 Eric Talevich : > > Here's one semi-coherent project idea that could fly: > > Overhaul Biopython's parsing infrastructure for protein > primary, secondary and tertiary structures > > - Refactor PDBParser and parse_pdb_header to allow parsing > ? amino-acid sequences from SEQRES lines (header) and ATOM > ? records (body) without building the PDB structure object, > ? i.e. without using numpy > - Write a pure-Python replacement for parsing mmCIF files. > ? (The module MMCIF2Dict already does almost all the work; > ? lex+yacc just manages a fairly simple state machine for > ? recognizing comments, special sub-sections, etc.) > - Wrap the parsers for PDB, PDBML and mmCIF under a common > ? I/O interface under the Bio.Struct namespace > - Add parsing support for protein secondary structures, > ? based on the relevant PDB records or (perhaps) DSSP > ? output. (Note that Jo?o did some work on this already.) Do you think you could mentor that? One serious downside would be even more work on PDB related code which will make future merging even harder. We do need to tackle the GSoC back log as a priority. > Variants > -------- > > So, from the Biopython 1.60 thread: > > - James Casbon has offered to merge PyVCF into Biopython, right? > - BCF, the binary form of VCF (via blocked gzip), may also > ? be worthwhile to support > - GVF, the Genome Variation Format, appears to be intended > ? to be competitive with VCF. It's probably at least as well > ? thought-out as VCF, sight unseen. It's based on GFF. > > Synthesizing the above, we have a GSoC project that looks like: > > - Help merge PyVCF into Python (w/ James's support -- I > ? don't mean to volunteer him for this in absentia)? > - Write a GVF parser that emits the same object type as > ? PyVCF, potentially also using existing GFF code > - Time permitting, look into blocked gzip support for VCF > ? (BCF), also looking at SAM/BAM for inspiration and > ? reusable code. Sounds interesting - who might be willing to mentor it? >> SearchIO? >> --------- >> >> I'm wondering if a Biopython SearchIO would make a good project, >> that I might supervise. This name is obviously based on BioPerl. I >> would be aiming for iterator based parser/writer framework (like SeqIO >> and AlignIO) for pairwise 'sequence' searches initially, but have also >> been thinking about indexing - at least by query, ideally also by match, >> to allow random access akin to what Bio.SeqIO.index offers. >> >> In some cases the results would also be pairwise sequence alignments, >> in which case some code can be shared/linked with AlignIO. In other >> cases all you get is co-ordinates of the query and match plus some >> kind of score. Therefore this could include a hierarchical SearchIO >> result object structure for minimal matches up to full pairwise >> alignments. >> >> I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not >> really sequence vs sequence, but HMM vs sequence), RPS-BLAST >> (again not really sequence vs sequence). Perhaps this could also tie >> into the Bio.Motif code as well (if we consider things like PSSM vs >> sequence in the same framework). >> >> You can already do some of this in Biopython (e.g. BLAST XML >> parsing, and there is some HMMER work on branches), but I'm >> hoping for a unified API here. >> > > Interesting. It would be very nice if the objects emitted by SearchIO > could be easily fed into GenomeDiagram. Funnily enough, that is one of my motivations - specifically for doing ACT style diagrams comparing multiple genomes to each other. I've just started putting some examples into the Tutorial on this today, where I say ideally you'd parse some BLAST output or whatever, but here I'm manually typing in a list of links to draw ;) Peter From chris.mit7 at gmail.com Thu Mar 1 13:03:57 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 1 Mar 2012 13:03:57 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: I'm unsure if this is the best place for this, but I would be willing to undertake the VCF work as a GSoC student. I've been working on structural variants in whole genome sequencing/rnaseq/protein levels already, so this would dove tail nicely into my existing work (and be a nice thing for a CV :)) Chris On Thu, Mar 1, 2012 at 1:00 PM, Jo?o Rodrigues wrote: > > > > Bio.Struct > > ---------- > > > > We have a lot of ideas and incomplete pieces of code from > > previous GSoCs that could be sorted out in one summer. > > However, taking on another GSoC student might just add to > > the heap; this might need to be Eric and Jo?o's Summer of > > Code instead. > > > > The new student would have to be familiar with the regular Bio.PDB code > plus whatever code I wrote and Mikael wrote. Maybe a bit too much? > > If by "*this might need to be Eric and Jo?o's Summer of Code instead.*" you > mean we could work together one it like in a SoC project, I think it would > be the best idea. Making a plan just like for SoC but working outside of it > leaving the vacancy for another person/project. > > Otherwise I don't know how well will OBF take yet another Bio.PDB project > since the previous two haven't been merged... > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Thu Mar 1 13:14:55 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 1 Mar 2012 13:14:55 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: On Thu, Mar 1, 2012 at 1:00 PM, Jo?o Rodrigues wrote: > Bio.Struct >> ---------- >> >> We have a lot of ideas and incomplete pieces of code from >> previous GSoCs that could be sorted out in one summer. >> However, taking on another GSoC student might just add to >> the heap; this might need to be Eric and Jo?o's Summer of >> Code instead. >> > > The new student would have to be familiar with the regular Bio.PDB code > plus whatever code I wrote and Mikael wrote. Maybe a bit too much? > > If by "*this might need to be Eric and Jo?o's Summer of Code instead.*" > you mean we could work together one it like in a SoC project, I think it > would be the best idea. Making a plan just like for SoC but working outside > of it leaving the vacancy for another person/project. > > Otherwise I don't know how well will OBF take yet another Bio.PDB project > since the previous two haven't been merged... > Those are my thoughts exactly. :) From eric.talevich at gmail.com Thu Mar 1 13:30:19 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 1 Mar 2012 13:30:19 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: 2012/3/1 Peter Cock > 2012/3/1 Eric Talevich : > > > > Here's one semi-coherent project idea that could fly: > > > > Overhaul Biopython's parsing infrastructure for protein > > primary, secondary and tertiary structures > > > > - Refactor PDBParser and parse_pdb_header to allow parsing > > amino-acid sequences from SEQRES lines (header) and ATOM > > records (body) without building the PDB structure object, > > i.e. without using numpy > > - Write a pure-Python replacement for parsing mmCIF files. > > (The module MMCIF2Dict already does almost all the work; > > lex+yacc just manages a fairly simple state machine for > > recognizing comments, special sub-sections, etc.) > > - Wrap the parsers for PDB, PDBML and mmCIF under a common > > I/O interface under the Bio.Struct namespace > > - Add parsing support for protein secondary structures, > > based on the relevant PDB records or (perhaps) DSSP > > output. (Note that Jo?o did some work on this already.) > > Do you think you could mentor that? One serious downside > would be even more work on PDB related code which will > make future merging even harder. We do need to tackle the > GSoC back log as a priority. > I would serve if called upon, but I think it's best if we set this one aside for E&J SoC (JESoC?) rather than GSoC this year. > > > Variants > > -------- > > > > So, from the Biopython 1.60 thread: > > > > - James Casbon has offered to merge PyVCF into Biopython, right? > > - BCF, the binary form of VCF (via blocked gzip), may also > > be worthwhile to support > > - GVF, the Genome Variation Format, appears to be intended > > to be competitive with VCF. It's probably at least as well > > thought-out as VCF, sight unseen. It's based on GFF. > > > > Synthesizing the above, we have a GSoC project that looks like: > > > > - Help merge PyVCF into Python (w/ James's support -- I > > don't mean to volunteer him for this in absentia)? > > - Write a GVF parser that emits the same object type as > > PyVCF, potentially also using existing GFF code > > - Time permitting, look into blocked gzip support for VCF > > (BCF), also looking at SAM/BAM for inspiration and > > reusable code. > > Sounds interesting - who might be willing to mentor it? > Does someone feel comfortable asking James for his thoughts on this? I'm not especially well qualified to mentor this, though I could assist as a secondary mentor if needed. Any other Biopython devs/users well acquainted with VCF/PyVCF? From rodrigo.faccioli at gmail.com Thu Mar 1 13:44:14 2012 From: rodrigo.faccioli at gmail.com (Rodrigo Faccioli) Date: Thu, 1 Mar 2012 15:44:14 -0300 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: Hi, Although I'm not an specialist to be mentor, I have experience to implement at PDBParser the reading of SEQRES section. In fact, I already have implemented it and I'm able to share it for BioPython project. Best regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structural Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-8739 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 Personal Blogg - http://rodrigofaccioli.blogspot.com/ On Thu, Mar 1, 2012 at 3:30 PM, Eric Talevich wrote: > 2012/3/1 Peter Cock > > > 2012/3/1 Eric Talevich : > > > > > > Here's one semi-coherent project idea that could fly: > > > > > > Overhaul Biopython's parsing infrastructure for protein > > > primary, secondary and tertiary structures > > > > > > - Refactor PDBParser and parse_pdb_header to allow parsing > > > amino-acid sequences from SEQRES lines (header) and ATOM > > > records (body) without building the PDB structure object, > > > i.e. without using numpy > > > - Write a pure-Python replacement for parsing mmCIF files. > > > (The module MMCIF2Dict already does almost all the work; > > > lex+yacc just manages a fairly simple state machine for > > > recognizing comments, special sub-sections, etc.) > > > - Wrap the parsers for PDB, PDBML and mmCIF under a common > > > I/O interface under the Bio.Struct namespace > > > - Add parsing support for protein secondary structures, > > > based on the relevant PDB records or (perhaps) DSSP > > > output. (Note that Jo?o did some work on this already.) > > > > Do you think you could mentor that? One serious downside > > would be even more work on PDB related code which will > > make future merging even harder. We do need to tackle the > > GSoC back log as a priority. > > > > I would serve if called upon, but I think it's best if we set this one > aside for E&J SoC (JESoC?) rather than GSoC this year. > > > > > > > Variants > > > -------- > > > > > > So, from the Biopython 1.60 thread: > > > > > > - James Casbon has offered to merge PyVCF into Biopython, right? > > > - BCF, the binary form of VCF (via blocked gzip), may also > > > be worthwhile to support > > > - GVF, the Genome Variation Format, appears to be intended > > > to be competitive with VCF. It's probably at least as well > > > thought-out as VCF, sight unseen. It's based on GFF. > > > > > > Synthesizing the above, we have a GSoC project that looks like: > > > > > > - Help merge PyVCF into Python (w/ James's support -- I > > > don't mean to volunteer him for this in absentia)? > > > - Write a GVF parser that emits the same object type as > > > PyVCF, potentially also using existing GFF code > > > - Time permitting, look into blocked gzip support for VCF > > > (BCF), also looking at SAM/BAM for inspiration and > > > reusable code. > > > > Sounds interesting - who might be willing to mentor it? > > > > Does someone feel comfortable asking James for his thoughts on this? > > I'm not especially well qualified to mentor this, though I could assist as > a secondary mentor if needed. Any other Biopython devs/users well > acquainted with VCF/PyVCF? > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Thu Mar 1 20:43:02 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 01 Mar 2012 20:43:02 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: <87399rkcbd.fsf@fastmail.fm> Peter and Eric; > > Variants > > -------- > > Synthesizing the above, we have a GSoC project that looks like: > > > > - Help merge PyVCF into Python (w/ James's support -- I > > ? don't mean to volunteer him for this in absentia)? > > - Write a GVF parser that emits the same object type as > > ? PyVCF, potentially also using existing GFF code > > - Time permitting, look into blocked gzip support for VCF > > ? (BCF), also looking at SAM/BAM for inspiration and > > ? reusable code. > > Sounds interesting - who might be willing to mentor it? This is a great idea. Reece and I proposed a variant project last year, and Reece has already e-mailed me this year about trying again. He was planning on re-vamping the description on the GSoC page for 2012: http://biopython.org/wiki/Google_Summer_of_Code so hopefully we can incorporate several aspects of this. From my experience I would prioritize BCF/Tabix files since you see a lot of those in practice. For GVF we could certainly leverage the GFF parser since it is GFF with variant keywords. Practically, I would love to settle on one format for this and VCF seems to have the most tool uptake so far. > >> SearchIO? > >> --------- +1 for this as well. Great ideas, Brad From p.j.a.cock at googlemail.com Fri Mar 2 06:53:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 2 Mar 2012 11:53:54 +0000 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: <87399rkcbd.fsf@fastmail.fm> References: <4F4BAE23.7070402@gmail.com>

<87399rkcbd.fsf@fastmail.fm> Message-ID: On Fri, Mar 2, 2012 at 1:43 AM, Brad Chapman wrote: > > Peter and Eric; > >> > Variants >> > -------- >> > Synthesizing the above, we have a GSoC project that looks like: >> > >> > - Help merge PyVCF into Python (w/ James's support -- I >> > ? don't mean to volunteer him for this in absentia)? >> > - Write a GVF parser that emits the same object type as >> > ? PyVCF, potentially also using existing GFF code >> > - Time permitting, look into blocked gzip support for VCF >> > ? (BCF), also looking at SAM/BAM for inspiration and >> > ? reusable code. >> >> Sounds interesting - who might be willing to mentor it? > > This is a great idea. Reece and I proposed a variant project last year, > and Reece has already e-mailed me this year about trying again. He was > planning on re-vamping the description on the GSoC page for 2012: > > http://biopython.org/wiki/Google_Summer_of_Code Excellent - can you and/or Reece polish that wiki text today? We don't need it to be perfect or that detailed at this stage, do we? > so hopefully we can incorporate several aspects of this. From my > experience I would prioritize BCF/Tabix files since you see a lot of > those in practice. Right. It sounds like my BGZF code (blocked gzip) should be helpful for BCF as well. > For GVF we could certainly leverage the GFF parser since it is GFF with > variant keywords. Practically, I would love to settle on one format for > this and VCF seems to have the most tool uptake so far. That could go in as a potential aim too then. >> >> SearchIO? >> >> --------- > > +1 for this as well. Great ideas, > Brad I've started to write up that on the wiki page now. Peter From eric.talevich at gmail.com Fri Mar 2 09:44:13 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 2 Mar 2012 09:44:13 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: On Fri, Mar 2, 2012 at 9:02 AM, James Casbon wrote: > On 1 March 2012 18:30, Eric Talevich wrote: > > 2012/3/1 Peter Cock > >> 2012/3/1 Eric Talevich : > >> > Variants > >> > -------- > >> > > >> > So, from the Biopython 1.60 thread: > >> > > >> > - James Casbon has offered to merge PyVCF into Biopython, right? > >> > - BCF, the binary form of VCF (via blocked gzip), may also > >> > be worthwhile to support > >> > - GVF, the Genome Variation Format, appears to be intended > >> > to be competitive with VCF. It's probably at least as well > >> > thought-out as VCF, sight unseen. It's based on GFF. > >> > > >> > Synthesizing the above, we have a GSoC project that looks like: > >> > > >> > - Help merge PyVCF into Python (w/ James's support -- I > >> > don't mean to volunteer him for this in absentia)? > >> > - Write a GVF parser that emits the same object type as > >> > PyVCF, potentially also using existing GFF code > >> > - Time permitting, look into blocked gzip support for VCF > >> > (BCF), also looking at SAM/BAM for inspiration and > >> > reusable code. > >> > >> Sounds interesting - who might be willing to mentor it? > >> > > > > Does someone feel comfortable asking James for his thoughts on this? > > > > I'm not especially well qualified to mentor this, though I could assist > as > > a secondary mentor if needed. Any other Biopython devs/users well > > acquainted with VCF/PyVCF? > > I'm willing to co-mentor this. > > I would think the easiest route would be to initially add bcf support, > then GVF. > Thanks for volunteering, James! I've added your name to the list of mentors on the wiki page: biopython.org/wiki/Google_Summer_of_Code From reece at harts.net Fri Mar 2 13:25:25 2012 From: reece at harts.net (Reece Hart) Date: Fri, 2 Mar 2012 10:25:25 -0800 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: <87399rkcbd.fsf@fastmail.fm> References: <4F4BAE23.7070402@gmail.com>

<87399rkcbd.fsf@fastmail.fm> Message-ID: > > > > - Help merge PyVCF into Python (w/ James's support -- I > > > don't mean to volunteer him for this in absentia)? > > > - Write a GVF parser that emits the same object type as > > > PyVCF, potentially also using existing GFF code > > > - Time permitting, look into blocked gzip support for VCF > > > (BCF), also looking at SAM/BAM for inspiration and > > > reusable code. > > > > Sounds interesting - who might be willing to mentor it? > > Reece and I proposed a variant project last year, > and Reece has already e-mailed me this year about trying again. He was > planning on re-vamping the description on the GSoC page for 2012: Sorry... I was offline for a couple of days. I'm still very interested in pursuing this. Perhaps James, Brad, and I can co-mentor one or more projects around 1) variant representation, 2) variant conversion/canonicalization, 3) variant IO. I have good familiarity with HGVS, BED, and VCF. I don't have a good enough handle to know the best strategy for using BioPython internals to achieve the aims. I can't guarantee that I'll finish edits today, but I'll aspire to that. -Reece From chapmanb at 50mail.com Fri Mar 2 15:51:24 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 02 Mar 2012 15:51:24 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

<87399rkcbd.fsf@fastmail.fm> Message-ID: <87booeu3oz.fsf@fastmail.fm> Reece; > I'm still very interested in pursuing this. Perhaps James, Brad, and I can > co-mentor one or more projects around 1) variant representation, 2) variant > conversion/canonicalization, 3) variant IO. I have good familiarity with > HGVS, BED, and VCF. I don't have a good enough handle to know the best > strategy for using BioPython internals to achieve the aims. This is perfect, thanks for sending along the google doc writeup. That looks great as is, I'd suggest adding it to the wiki. We still have a week to edit before the official Google deadline and then time to edit again before students start submitting. Thanks again for this, looking forward to seeing some awesome student applications, Brad From redmine at redmine.open-bio.org Sun Mar 4 13:45:43 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 4 Mar 2012 18:45:43 +0000 Subject: [Biopython-dev] [Biopython - Feature #3330] (New) Delay imports of sre_compile and CodonTable during SeqIO import Message-ID: Issue #3330 has been reported by Martin Mokrej?. ---------------------------------------- Feature #3330: Delay imports of sre_compile and CodonTable during SeqIO import https://redmine.open-bio.org/issues/3330 Author: Martin Mokrej? Status: New Priority: Normal Assignee: Category: Target version: URL: I had a look why biopython takes a long while to parse my FASTA files. I used the standard trace module of python and see that it imports too many other packages that I really need. For example, the Codontable import is really costly, also sre_compile import is expensive. Could they be imported during their first actual use? Something lazy evaluation (http://wiki.python.org/moin/PythonSpeed/PerformanceTips)? $ python -m trace --trace -- test.py 1>/tmp/test.out 2>test.err $ cat test.py #! /usr/bin/env python from Bio import SeqIO for _record in SeqIO.parse("test.fasta", 'fasta'): print _record.id $ cat test.fasta >test aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa $ awk '{print $1}' /tmp/test.out | sed -e 's/(.*//' | sort | uniq -c | sort -r | head -n 100 180797 sre_compile.py 177955 CodonTable.py 64743 sre_parse.py 17858 --- 1410 locale.py 1212 QualityIO.py 1177 urllib.py 1077 __init__.py 963 IUPACData.py 561 urlparse.py 488 re.py 365 base64.py 272 collections.py 258 os.py [cut] $ Whatever you can do in this regard will be much appreciated. ;-) I have biopython-1.58. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Mar 4 14:35:25 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 4 Mar 2012 19:35:25 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Lenna Peterson. Is the desire to use a C parser due to performance concerns (30k+ line files)? I can think of two cross-platform alternatives to flex. Maybe Windows 9 will be *nix based and we can escape these problems. 1) There is at least one python implementation of lex: http://www.dabeaz.com/ply/ (BSD license) 2) The mmCIF parser could possibly be written in core python. There's a perl CIF parser: http://pdb.sdsc.edu/STAR/index.html (UCSD license) I don't have experience with lexical analysis, but it seems like cross-platform support for CIF is important. In which direction would my efforts be best directed? ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Mar 4 14:57:28 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 4 Mar 2012 19:57:28 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Peter Cock. There was a discussion about this some time ago on our dev mailing list, the consensus as I recall was to aim for a pure Python implementation. That should allow use to support Windows easily, but also support Jython and PyPy, and make supporting Python 3 fairly simple too (have a look at Bio/SeqIO/SffIO.py for and example of a binary parser that works on Python 2 and 3, the only hard bit is explicit bytes vs unicode). If you would like to work on this, that would be great. I'd suggest a branch on github - but please sign up to the biopython-dev mailing list too. Thanks. ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Mar 4 15:21:13 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 4 Mar 2012 20:21:13 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Eric Talevich. Lenna Peterson wrote: > Is the desire to use a C parser due to performance concerns (30k+ line files)? Presumably. Our PDB parser is pure-Python, and the original author has noted dissatisfaction with its speed. The trade-off is portability, and with PyPy getting faster and more widely usable, the argument for pure Python probably wins now. > 1) There is at least one python implementation of lex: http://www.dabeaz.com/ply/ (BSD license) PLY is written entirely in Python, and appears to be supported on all the Python versions we support. I haven't used it, but it looks like a good option. Not sure if we would need to add PLY as a dependency, or if it generates Python files we could check in to Git and distribute directly. > 2) The mmCIF parser could possibly be written in core python. This would probably not be difficult. I'm not sure what to expect in terms of performance between flex, PLY, and manual Python "if" statements and string methods. The mmCIF format looks quite machine-friendly, and I think regular expressions could be mostly avoided. Lenna, if you have some time and interest to look into this, the files to modify or replace are: Bio/PDB/MMCIF2Dict.py Bio/PDB/mmCIF/mmcif.lex The options are: (a) Write (or use PLY to generate) a pure-Python version of the module Bio.PDB.mmCIF.MMCIFlex. This is currently compiled as a C extension, but a Python version of it could be imported as a backup if the C version isn't available. (b) Modify MMCIF2Dict directly, and implement the state machine there. I suppose you'd have a separate function/method that reads one line at a time from the file, checks the current state and the contents of the line (e.g. line.startswith('#')), updates the state if needed, and emits tokens. ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From reece at harts.net Sun Mar 4 22:20:46 2012 From: reece at harts.net (Reece Hart) Date: Sun, 4 Mar 2012 19:20:46 -0800 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: <87booeu3oz.fsf@fastmail.fm> References: <4F4BAE23.7070402@gmail.com>

<87399rkcbd.fsf@fastmail.fm> <87booeu3oz.fsf@fastmail.fm> Message-ID: On Fri, Mar 2, 2012 at 12:51 PM, Brad Chapman wrote: > I'd suggest adding it to the wiki. We still have a > week to edit before the official Google deadline and then time to edit > again before students start submitting. > The variant representation component was updated on the wiki. All comments welcome. -Reece From chapmanb at 50mail.com Mon Mar 5 21:12:01 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 05 Mar 2012 21:12:01 -0500 Subject: [Biopython-dev] BOSC 2012 Call for Abstracts Message-ID: <87y5re32by.fsf@fastmail.fm> Call for Abstracts for the 13th Annual Bioinformatics Open Source Conference (BOSC 2012) A Special Interest Group (SIG) of ISMB 2012 Dates: July 13-14, 2012 Location: Long Beach, California Web site: http://www.open-bio.org/wiki/BOSC_2012 Email: bosc at open-bio.org BOSC announcements mailing list: http://lists.open-bio.org/mailman/listinfo/bosc-announce Important Dates: April 13, 2012: Deadline for submitting abstracts May 7, 2012: Notification of accepted talk abstracts emailed to authors July 11-12, 2012: Codefest 2012 programming session July 13-14, 2012: BOSC 2012 July 15-17, 2012: ISMB 2012 The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. To be considered for acceptance, software systems representing the central topic in a presentation submitted to BOSC must be licensed with a recognized Open Source License, and be freely available for download in source code form. We invite you to submit one-page abstracts for talks and posters. This year's session topics are: - Cloud and Parallel Computing - Linked Data - Genome-scale Data Management - Data Visualization and Imaging - Translational Bioinformatics - Software Interoperability (possibly a joint session with BSI-SIG, the Bioinformatics Software Interoperability SIG) - Bioinformatics Open Source Project Updates - Interfacing with Industry (panel) Thanks to generous sponsorship from Eagle Genomics and an anonymous donor, we are pleased to announce a competition for three Student Travel Awards. Each winner will be awarded $250 to defray the costs of travel to BOSC 2012. For instructions on submitting your abstract, please visit http://www.open-bio.org/wiki/BOSC_2012#Submitting_Abstracts BOSC 2012 Organizing Committee: Nomi Harris (chair), Jan Aerts, Brad Chapman, Peter Cock, Chris Fields, Erwin Frise, Peter Rice From redmine at redmine.open-bio.org Tue Mar 6 20:20:34 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 7 Mar 2012 01:20:34 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Lenna Peterson. Peter: I've got a github branch going and I'm on the mailing list. Eric Talevich wrote: > PLY is written entirely in Python, and appears to be supported on all the Python versions we support. I haven't used it, but it looks like a good option. > > Not sure if we would need to add PLY as a dependency, or if it generates Python files we could check in to Git and distribute directly. > > (a) Write (or use PLY to generate) a pure-Python version of the module Bio.PDB.mmCIF.MMCIFlex. This is currently compiled as a C extension, but a Python version of it could be imported as a backup if the C version isn't available. I've got a good start on (a). It seems like I need to import PLY's lex module. Is the etiquette to include ply/lex.py in the mmCIF module (as far as I can tell, the author/license allow this), or to list PLY as a dependency? The full yacc functionality from PLY is not needed (just the lex-style tokenizing). ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Mar 7 06:52:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 7 Mar 2012 11:52:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Peter Cock. Probably it would be best to leave PLY as an external dependency. If this works nicely we can define it in the setup.py as a requirement for automatic installation via easy_install or pip. Note ply is in pypi http://pypi.python.org/pypi/ply/ and supports Python 2 and 3 already. If using PLY turns out to be non-trivial, then a pure Python CIF parser might be advantageous. ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From clements at galaxyproject.org Wed Mar 7 11:48:27 2012 From: clements at galaxyproject.org (Dave Clements) Date: Wed, 7 Mar 2012 08:48:27 -0800 Subject: [Biopython-dev] Openings at Galaxy Message-ID: Hello all, Want to work on one of the fastest growing open source bioinformatics projects around? The Galaxy Project, a highly successful high throughput data analysis platform for Life Sciences with over 15,000 users worldwide, is hiring. The Taylor Lab in the Biology and Mathematics & Computer Science at Emory University is looking for software engineers and postdoctoral scholars to work on the Galaxy Project. We are seeking software engineers with expertise in distributed computing and systems programming, web-based visualization and visual analytics, informatics and data analysis and integration, and bioinformatics application areas such as re-sequencing, de novo assembly, metagenomics, transcriptome analysis and epigenetics. These are full time positions located in Atlanta, GA. See the official posting for full details. Postdoctoral applicants should have expertise in Bioinformatics and Computational Biology and research interests that complement but extend the lab's current interests: The Galaxy project; distributed and high-performance computing for data intensive science; vertebrate functional genomics; and genomics and epigenomic mechanisms of gene regulation, the role of transcription factors and chromatin structure in global gene expression, development, and differentiation. See the announcement for full details. The Nekrutenko Lab at the Huck Institutes of Life Sciences at Penn State is seeking highly opinionated and biologically inclined Postdoctoral researchers within the Galaxy Project to develop best practices for analysis of next-generation sequencing data in all areas of Life Sciences where NGS is used. Successful candidates will join a vibrant research group at the Center for Comparative Genomics and Bioinformatics at Penn State University. Please send your CV to jobs at galaxyproject.org. Links: http://wiki.g2.bx.psu.edu/GalaxyIsHiring http://galaxyproject.org/ http://bx.mathcs.emory.edu/joining/ http://www.bx.psu.edu/ Thanks, Dave C -- http://galaxyproject.org/GCC2012 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From p.j.a.cock at googlemail.com Wed Mar 7 12:18:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 17:18:21 +0000 Subject: [Biopython-dev] Openings at Galaxy In-Reply-To: References: Message-ID: On Wed, Mar 7, 2012 at 4:48 PM, Dave Clements wrote: > Hello all, > > Want to work on one of the fastest growing open source bioinformatics > projects around? The Galaxy Project, a highly successful high throughput > data analysis platform for Life Sciences with over 15,000 users worldwide, > is hiring. > ... > http://wiki.g2.bx.psu.edu/GalaxyIsHiring What Dave forgot to mention is Galaxy itself is written in Python :) Also, although it doesn't (currently) use Biopython internally, some of the user submitted Galaxy Tools do - and both Brad and I are quite heavy Galaxy users (looking after local installations at our workplaces) and have contributed to the project. Peter From MatatTHC at gmx.de Fri Mar 9 02:53:01 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Fri, 09 Mar 2012 08:53:01 +0100 Subject: [Biopython-dev] LocationParserError Message-ID: <20120309075301.15030@gmx.net> Hi, I just got the new RefSeq 52 release and found a really strange error causing an exception: Bio.GenBank.LocationParserError: join(complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905) the accession is: NC_016406. Any ideas? Matthias -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de From p.j.a.cock at googlemail.com Fri Mar 9 04:14:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 9 Mar 2012 09:14:18 +0000 Subject: [Biopython-dev] LocationParserError In-Reply-To: <20120309075301.15030@gmx.net> References: <20120309075301.15030@gmx.net> Message-ID: On Fri, Mar 9, 2012 at 7:53 AM, Matthias Bernt wrote: > Hi, > > I just got the new RefSeq 52 release and found a really strange error > causing an exception: > > Bio.GenBank.LocationParserError: join(complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905) > > the accession is: NC_016406. Any ideas? > > Matthias That is the most complicated trans_splicing feature I've seen in a while :) It says gene nad1 is (trans) spliced from four bits, three from the Silene vulgarise mitochondria chr1 (i.e. NC_016406.1) from both strands, and one from mitochondria chr3 (NC_016402.1). Just in case there was any confusion, the human readable note confirms this - not the CDS feature has a join of five parts, while the gene has a join of just four - there is an intron too. Looks like a bug in our parser... Peter From MatatTHC at gmx.de Fri Mar 9 05:06:16 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Fri, 9 Mar 2012 11:06:16 +0100 Subject: [Biopython-dev] LocationParserError In-Reply-To: References: <20120309075301.15030@gmx.net> Message-ID: Just in case you need more test cases. I send all cases I found (all in mitochondria). NC_016406 join(complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905) NC_016402 join(complement(NC_016406.1:149815..150200),complement(NC_016406.1:293787..295573),6618..6676,NC_016406.1:181647..181905) NC_016348 join(complement(NC_016362.1:36881..37266),86404..88057,complement(NC_016391.1:55668..55726),complement(NC_016355.1:144070..144328)) NC_016352 join(NC_016397.1:85333..87643,23246..23267,NC_016390.1:38484..40058) NC_016355 join(complement(NC_016362.1:36881..37266),NC_016348.1:86404..88057,complement(NC_016391.1:55668..55726),complement(144070..144328)) NC_016358 join(NC_016382.1:95989..97509,31194..34899) NC_016362 join(complement(36881..37266),NC_016348.1:86404..88057,complement(NC_016391.1:55668..55726),complement(NC_016355.1:144070..144328)) NC_016382 join(95989..97509,NC_016358.1:31194..34899) NC_016390 join(NC_016397.1:85333..87643,NC_016352.1:23246..23267,38484..40058) NC_016391 join(complement(NC_016362.1:36881..37266),NC_016348.1:86404..88057,complement(55668..55726),complement(NC_016355.1:144070..144328)) NC_016397 join(85333..87643,NC_016352.1:23246..23267,NC_016390.1:38484..40058) Matthias 2012/3/9 Peter Cock : > On Fri, Mar 9, 2012 at 7:53 AM, Matthias Bernt wrote: >> Hi, >> >> I just got the new RefSeq 52 release and found a really strange error >> causing an exception: >> >> Bio.GenBank.LocationParserError: join(complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905) >> >> the accession is: NC_016406. Any ideas? >> >> Matthias > > That is the most complicated trans_splicing feature I've seen in a while :) > > It says gene nad1 is (trans) spliced from four bits, three from the Silene > vulgarise mitochondria chr1 (i.e. NC_016406.1) from both strands, and > one from mitochondria chr3 (NC_016402.1). Just in case there was any > confusion, the human readable note confirms this - not the CDS feature > has a join of five parts, while the gene has a join of just four - there is > an intron too. > > Looks like a bug in our parser... > > Peter From p.j.a.cock at googlemail.com Fri Mar 9 06:23:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 9 Mar 2012 11:23:58 +0000 Subject: [Biopython-dev] LocationParserError In-Reply-To: References: <20120309075301.15030@gmx.net>

Message-ID: On Fri, Mar 9, 2012 at 10:06 AM, Matthias Bernt wrote: > Just in case you need more test cases. I send all cases I found (all > in mitochondria). Trying with the current release (Biopython 1.59) I didn't get an exception with NC_016406 but something wasn't quite right - I was missing the external exon... which appears to be a bug in Entrez. Here is NC_016406 from Entrez using GenBank (with parts), http://www.ncbi.nlm.nih.gov/nuccore/NC_016406.1?report=gbwithparts&log$=seqview gene join(complement(149815..150200), complement(293787..295573),181647..181905) /gene="nad1" /trans_splicing /note="exons 1, 2, 3, and 5 on chromosome 1 are trans-spliced with exon 4 on chromosome 3 to form the complete coding region" /db_xref="GeneID:11447159" CDS join(complement(149815..150200), complement(295492..295573),complement(293787..293978), 181647..181905) /gene="nad1" /trans_splicing /note="exons 1, 2, 3, and 5 on chromosome 1 are trans-spliced with exon 4 on chromosome 3 to form the complete coding region" /codon_start=1 /transl_except=(pos:complement(150198..150200),aa:Met) /product="NADH dehydrogenase subunit 1" /protein_id="YP_004935334.1" /db_xref="GI:357967323" /db_xref="GeneID:11447159" /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP" Here is NC_016406 from Entrez using GenBank (default, not with parts): http://www.ncbi.nlm.nih.gov/nuccore/NC_016406.1?report=genbank&log$=seqview gene join(complement(149815..150200), complement(293787..295573),NC_016402.1:6618..6676, 181647..181905) /gene="nad1" /trans_splicing /note="exons 1, 2, 3, and 5 on chromosome 1 are trans-spliced with exon 4 on chromosome 3 to form the complete coding region" /db_xref="GeneID:11447159" CDS join(complement(149815..150200), complement(295492..295573),complement(293787..293978), NC_016402.1:6618..6676,181647..181905) /gene="nad1" /trans_splicing /note="exons 1, 2, 3, and 5 on chromosome 1 are trans-spliced with exon 4 on chromosome 3 to form the complete coding region" /codon_start=1 /transl_except=(pos:complement(150198..150200),aa:Met) /product="NADH dehydrogenase subunit 1" /protein_id="YP_004935334.1" /db_xref="GI:357967323" /db_xref="GeneID:11447159" /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP" Can you see the difference? Using Genbank "with parts" the external location in this gene and CDS feature has been lost! I will report this bug to the NCBI. However, with that hurdle out of the way I found the problem in Biopython - the regular expression for an external sequence reference wasn't allowing for an underscore. The fix itself is very trivial, in Bio/GenBank/__init__.py we replace this line: _complex_location = r"([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \ with: _complex_location = r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \ The commit does this and adds a few tests (and fixes a typo): https://github.com/biopython/biopython/commit/16efc7bc51b5ccef7f81f443d4b52f490f6fc354 If you are happy installing from source, you can download the latest code from GitHub, or via git at the command line. Peter From MatatTHC at gmx.de Fri Mar 9 07:35:37 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Fri, 9 Mar 2012 13:35:37 +0100 Subject: [Biopython-dev] LocationParserError In-Reply-To: References: <20120309075301.15030@gmx.net>

Message-ID: Hi, I also got the files from refseq (through the ftp server: ftp://ftp.ncbi.nih.gov/refseq/release/mitochondrion/mitochondrion.1.genomic.gbff.gz). FYI I'm using 1.57. It's good to know that this is fixed for future releases. Matthias From anaryin at gmail.com Fri Mar 9 10:07:38 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 9 Mar 2012 16:07:38 +0100 Subject: [Biopython-dev] MMCIF Parser Message-ID: Hi all, I was recently in a conference in Heidelberg and I got to know that the PDBe is interested in collaborating with us in building a consolidated Python module for structural bioinformatics. From what I understood they already used our code sometimes. Since there is some movement on the MMCif parser front, maybe it's a good idea to query them and see if they have something implemented already? Asking first not to step on anyone's toes, but it might save time? Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From p.j.a.cock at googlemail.com Fri Mar 9 10:11:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 9 Mar 2012 15:11:04 +0000 Subject: [Biopython-dev] MMCIF Parser In-Reply-To: References: Message-ID: On Fri, Mar 9, 2012 at 3:07 PM, Jo?o Rodrigues wrote: > Hi all, > > I was recently in a conference in Heidelberg and I got to know that the > PDBe is interested in collaborating with us in building a consolidated > Python module for structural bioinformatics. From what I understood they > already used our code sometimes. > > Since there is some movement on the MMCif parser front, maybe it's a good > idea to query them and see if they have something implemented already? > Asking first not to step on anyone's toes, but it might save time? > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao Sounds good - you're one of the experts on the Bio.PDB code now after all, so a good person to talk to them. Peter From arklenna at gmail.com Fri Mar 9 14:00:18 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 9 Mar 2012 14:00:18 -0500 Subject: [Biopython-dev] MMCIF Parser In-Reply-To: References:

Message-ID: <03340EC2-004A-49BD-B790-B8F62D7B04B9@gmail.com> I am in the process of implementing the formal grammar of CIF in PLY (python lex yacc). The result should be a strict, robust, extensible CIF parser. It's going very smoothly, and I plan on continuing it regardless as a learning exercise in lexical analysis. Please let me know if PDBe has a robust mmCIF python parser that would make mine redundant. Lenna On Mar 9, 2012, at 10:11, Peter Cock wrote: > On Fri, Mar 9, 2012 at 3:07 PM, Jo?o Rodrigues wrote: >> Hi all, >> >> I was recently in a conference in Heidelberg and I got to know that the >> PDBe is interested in collaborating with us in building a consolidated >> Python module for structural bioinformatics. From what I understood they >> already used our code sometimes. >> >> Since there is some movement on the MMCif parser front, maybe it's a good >> idea to query them and see if they have something implemented already? >> Asking first not to step on anyone's toes, but it might save time? >> >> Jo?o [...] Rodrigues >> http://nmr.chem.uu.nl/~joao > > Sounds good - you're one of the experts on the Bio.PDB code now > after all, so a good person to talk to them. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From anaryin at gmail.com Fri Mar 9 14:13:26 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 9 Mar 2012 20:13:26 +0100 Subject: [Biopython-dev] MMCIF Parser In-Reply-To: <03340EC2-004A-49BD-B790-B8F62D7B04B9@gmail.com> References:

<03340EC2-004A-49BD-B790-B8F62D7B04B9@gmail.com> Message-ID: Hi Lenna, First of all, sorry to come so late in the discussion but as I said before, I was in a conference so I didn't really read my email that frequently.. The PDBe have their own parsers and I am yet to find out what kind of dependencies and even if it maps to the same SMCRA model we use. I will keep you informed. I sent them an email today and am waiting for the reply. I will eventually bring the discussion here so maybe we can take the best of both parsers. Nevertheless, thanks for the time and effort you are putting, it will surely be put to good use! :) Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao No dia 9 de Mar?o de 2012 20:00, Lenna Peterson escreveu: > I am in the process of implementing the formal grammar of CIF in PLY > (python lex yacc). The result should be a strict, robust, extensible CIF > parser. > > It's going very smoothly, and I plan on continuing it regardless as a > learning exercise in lexical analysis. > > Please let me know if PDBe has a robust mmCIF python parser that would > make mine redundant. > > Lenna > > > On Mar 9, 2012, at 10:11, Peter Cock wrote: > > > On Fri, Mar 9, 2012 at 3:07 PM, Jo?o Rodrigues > wrote: > >> Hi all, > >> > >> I was recently in a conference in Heidelberg and I got to know that the > >> PDBe is interested in collaborating with us in building a consolidated > >> Python module for structural bioinformatics. From what I understood they > >> already used our code sometimes. > >> > >> Since there is some movement on the MMCif parser front, maybe it's a > good > >> idea to query them and see if they have something implemented already? > >> Asking first not to step on anyone's toes, but it might save time? > >> > >> Jo?o [...] Rodrigues > >> http://nmr.chem.uu.nl/~joao > > > > Sounds good - you're one of the experts on the Bio.PDB code now > > after all, so a good person to talk to them. > > > > Peter > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From redmine at redmine.open-bio.org Mon Mar 12 17:49:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Mar 2012 21:49:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Lenna Peterson. I've been reading about flex and found the following line in the book "flex & bison" (John Levine, O'Reilly): > Most flex programs now use @ %option noyywrap @ and provide their own main routine, so they don?t need the flex library. This suggests the flex headers aren't absolutely required for the C module, which would make *nix detection possible. I'll look into this. Also, I have a prototype with PLY that works fine for small files but doesn't parse a full 30k line CIF file in an acceptable time. ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Fri Mar 16 06:21:43 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 16 Mar 2012 11:21:43 +0100 Subject: [Biopython-dev] MMCIF Parser In-Reply-To: References:

<03340EC2-004A-49BD-B790-B8F62D7B04B9@gmail.com> Message-ID: Hi all, I added Glen from PDBe to the thread. I asked him to have a look to this 'bug' report, his reply is below. "Had a look bug #2619 and it seems the thread was reignited recently by Lenna Peterson so we'll be keeping an eye on it. In terms of an mmCIF Parser we currently use a parser provided by our RCSB partners.It also has C dependencies and after using it, there is much that could be improved, in particular, we'd also like a pure python implementation" Id say Lenna to go ahead and keep on the effort on the parser. Maybe you could share the code on github or so to garner some comments and suggestions? one question though: why is everyone using C for it? i never really used this format so sorry for the ignorance.. Cheers, Jo?o No dia 9 de Mar de 2012 20:13, "Jo?o Rodrigues" escreveu: > Hi Lenna, > > First of all, sorry to come so late in the discussion but as I said > before, I was in a conference so I didn't really read my email that > frequently.. > > > The PDBe have their own parsers and I am yet to find out what kind of > dependencies and even if it maps to the same SMCRA model we use. I will > keep you informed. I sent them an email today and am waiting for the reply. > I will eventually bring the discussion here so maybe we can take the best > of both parsers. Nevertheless, thanks for the time and effort you are > putting, it will surely be put to good use! :) > > > Best, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > No dia 9 de Mar?o de 2012 20:00, Lenna Peterson escreveu: > >> I am in the process of implementing the formal grammar of CIF in PLY >> (python lex yacc). The result should be a strict, robust, extensible CIF >> parser. >> >> It's going very smoothly, and I plan on continuing it regardless as a >> learning exercise in lexical analysis. >> >> Please let me know if PDBe has a robust mmCIF python parser that would >> make mine redundant. >> >> Lenna >> >> >> On Mar 9, 2012, at 10:11, Peter Cock wrote: >> >> > On Fri, Mar 9, 2012 at 3:07 PM, Jo?o Rodrigues >> wrote: >> >> Hi all, >> >> >> >> I was recently in a conference in Heidelberg and I got to know that the >> >> PDBe is interested in collaborating with us in building a consolidated >> >> Python module for structural bioinformatics. From what I understood >> they >> >> already used our code sometimes. >> >> >> >> Since there is some movement on the MMCif parser front, maybe it's a >> good >> >> idea to query them and see if they have something implemented already? >> >> Asking first not to step on anyone's toes, but it might save time? >> >> >> >> Jo?o [...] Rodrigues >> >> http://nmr.chem.uu.nl/~joao >> > >> > Sounds good - you're one of the experts on the Bio.PDB code now >> > after all, so a good person to talk to them. >> > >> > Peter >> > >> > _______________________________________________ >> > Biopython-dev mailing list >> > Biopython-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > From arklenna at gmail.com Fri Mar 16 18:21:25 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 16 Mar 2012 18:21:25 -0400 Subject: [Biopython-dev] MMCIF Parser In-Reply-To: References:

<03340EC2-004A-49BD-B790-B8F62D7B04B9@gmail.com> Message-ID: Hi Jo?o, Thanks for bringing Glen into the discussion. My code is in this github branch: https://github.com/lennax/biopython/tree/ply2 It requires PLY, which I haven't added to setup.py yet. Available here:?http://www.dabeaz.com/ply/ Currently the lex (tokenizer) portion runs fine (~30k lines in 3.6 sec on my machine). But the yacc (parser) portion hangs on files over ~5k lines - it's certainly running worse than linear. I'm trying to debug the problem, but I'm also considering using the approach from the current MMCIF2Dict module (which uses python to parse). Re: your question about C, CIF is only moderately complex but I think the issue is that the files tend to be very long. Lexical analysis can be a computationally intensive process, so any improvements in efficiency that C can offer are beneficial. However, I haven't done any performance comparisons between C (flex/bison) and python (PLY). But considering Jython etc. haven't implemented the C API yet, I'm focusing on a pure python implementation. Lenna On Fri, Mar 16, 2012 at 6:21 AM, Jo?o Rodrigues wrote: > > Hi all, > > I added Glen from PDBe to the thread. I asked him to have a look to this 'bug' report, his reply is below. > > "Had a look bug #2619 and it seems the thread was reignited recently by Lenna Peterson so we'll be keeping an eye on it. In terms of an mmCIF Parser we currently use a parser provided by our RCSB partners.It also has C dependencies and after using it, there is much that could be improved, in particular, we'd also like a pure python implementation" > > Id say Lenna to go ahead and keep on the effort on the parser. Maybe you could share the code on github or so to garner some comments and suggestions? > > one question though: why is everyone using C for it? i never really used this format so sorry for the ignorance.. > > Cheers, > > Jo?o > > No dia 9 de Mar de 2012 20:13, "Jo?o Rodrigues" escreveu: > >> Hi Lenna, >> >> First of all, sorry to come so late in the discussion but as I said before, I was in a conference so I didn't really read my email that frequently.. >> >> >> The PDBe have their own parsers and I am yet to find out what kind of dependencies and even if it maps to the same SMCRA model we use. I will keep you informed. I sent them an email today and am waiting for the reply. I will eventually bring the discussion here so maybe we can take the best of both parsers. Nevertheless, thanks for the time and effort you are putting, it will surely be put to good use! :) >> >> >> Best, >> >> Jo?o [...] Rodrigues >> http://nmr.chem.uu.nl/~joao >> >> >> >> No dia 9 de Mar?o de 2012 20:00, Lenna Peterson escreveu: >>> >>> I am in the process of implementing the formal grammar of CIF in PLY (python lex yacc). The result should be a strict, robust, extensible CIF parser. >>> >>> It's going very smoothly, and I plan on continuing it regardless as a learning exercise in lexical analysis. >>> >>> Please let me know if PDBe has a robust mmCIF python parser that would make mine redundant. >>> >>> Lenna >>> >>> >>> On Mar 9, 2012, at 10:11, Peter Cock wrote: >>> >>> > On Fri, Mar 9, 2012 at 3:07 PM, Jo?o Rodrigues wrote: >>> >> Hi all, >>> >> >>> >> I was recently in a conference in Heidelberg and I got to know that the >>> >> PDBe is interested in collaborating with us in building a consolidated >>> >> Python module for structural bioinformatics. From what I understood they >>> >> already used our code sometimes. >>> >> >>> >> Since there is some movement on the MMCif parser front, maybe it's a good >>> >> idea to query them and see if they have something implemented already? >>> >> Asking first not to step on anyone's toes, but it might save time? >>> >> >>> >> Jo?o [...] Rodrigues >>> >> http://nmr.chem.uu.nl/~joao >>> > >>> > Sounds good - you're one of the experts on the Bio.PDB code now >>> > after all, so a good person to talk to them. >>> > >>> > Peter >>> > >>> > _______________________________________________ >>> > Biopython-dev mailing list >>> > Biopython-dev at lists.open-bio.org >>> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> >> From p.j.a.cock at googlemail.com Wed Mar 21 11:27:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Mar 2012 15:27:31 +0000 Subject: [Biopython-dev] GSoC SearchIO project Message-ID: Hello all, I'm pleased to see that the GSoC SearchIO project idea I put up has sparked some interest: http://biopython.org/wiki/Google_Summer_of_Code So far three students have inquired about it - and I will try to help them all with feedback on their GSoC proposals if they decide to apply. However, the review process is competitive (including the other OBF projects like BioPerl etc), and we would not want to select multiple students to work on the very same area. i.e. At most one GSoC student would be funded to work on a Biopython SearchIO. Remember that the outline ideas we put up on the wiki are just suggestions - GSoC applicants can and are encouraged to come up with their own ideas. If you can think of something else please get in touch - on this mailing list if you like, or directly with an existing Biopython developer who you think might be able to mentor you. The best projects will be those linked to the kind of analysis you are already doing - e.g. in your degree project. Now for some more detail about what I had in mind for SearchIO. I may be writing too much at this point, so I should stress that you can suggest different approaches/ideas. ------------------------------------------------------------- Grouping of SearchIO results: Terminology differs between tools (which is going to complicate documenting this new code), so here I will try to use the terms from BLAST. BLAST output files can be very large - it is not uncommon to search 20,0000 predicted genes against the NCBI NR (non-redundant) database to perform some crude annotation transfer. This means it is naive and impractical in general to try to load an entire file in memory. Instead, following the API of Bio.SeqIO, Bio.AlignIO, Bio.Phylo, etc we expect to use an iterator approach. In normal BLAST you compare one or more query sequences (usually in a FASTA file) against one or more subject sequences (in a BLAST database or a FASTA file). Depending on the score thresholds, the number of matches will vary. Each query sequence may match one or more subject sequence. Each query-subject sequence may give more than one pairwise alignment (HSP). For example, if your query sequence has a single copy of domain X (at coordinates 10-100), but there is a similar subject sequence with two copies of domain X (at coordinates 5-95 and 106-194), you might expect to get two pairwise alignments (HSPs), one matching query:10-100 with subject:5-95, and the other matching query:10-100 with subject:106-194. This means that there are potentially multiple levels of structure which we might want to iterate over. Simplistically I think that Bio.SearchIO.parse(...) should iterate over the results for each query. This means if your query file had 20,000 sequences, using a for loop with Bio.SearchIO.parse(...) would give you 20,000 results. [Well, up to 20,000 results. In some file formats like BLAST tabular, when a query has no results, there is nothing in the output file for it]. Also as in Bio.SeqIO etc, a sister convenience function Bio.SearchIO.read(...) would be for when the results are for one and only one query sequence. As in Bio.SearchIO.parse(...), the Bio.SearchIO.parse(...) code should make a single pass though the file WITHOUT using any handle tell/seek commands. That way it can be used with any handle object, including stdin (output piped directly into Python) and network handles. Each of these query sequence results would give you zero or more match (subject) sequences, and for each matched sequence there would be one or more pairwise alignment (possibly with the sequence information as in BLAST XML, possibly not as in standard BLAST tabular output). ------------------------------------------------------------- With regards to the object hierarchy: For Bio.SeqIO everything uses SeqRecord objects, which are designed to be extendable via the annotations dictionary etc. For Bio.AlignIO everything uses the same MultipleSequenceAlignment object (which uses SeqRecord objects inside it). For Bio.Phylo, there is a common base class for the trees, but there are also most specialised subclasses for the more detailed tree file formats. The aim is to have a common representation regardless of the file format, making working with and converting between different file format as easy as possible. With Bio.SearchIO, some file formats include pairwise alignments (e.g. FASTA -m10, BLAST XML, EMBOSS needle/water) while others do not and only give you match positions as co-ordinates (e.g. BLAST's standard 12 column tabular output). What I was picturing was a base HSP (using BLAST's terminology) class describing a match with coordinates, and a subclass which also holds a pairwise alignments. The idea would be we could in theory inter-convert between two rich file formats like FASTA -m10 and BLAST XML, or from a rich format to a simple format (e.g. BLAST XML to BLAST tabular). However, you can't convert the standard BLAST tabular output to BLAST XML because so much data is missing. As part of the GSoC work, I would expect you to write unit tests covering this kind of interconversion. It is not as easy as it might first seem. In fact, even converting from BLAST XML to the standard 12 column BLAST tabular output is surprisingly difficult. I wrote a Python script to do this as a Galaxy Tool, available on the Galaxy Tool Shed and on my Galaxy Bitbucket repository under the tools branch, have a look at the comments about sequence complexity masking and its impact: https://bitbucket.org/peterjc/galaxy-central/src/default/tools/ncbi_blast_plus/blastxml_to_tabular.py https://bitbucket.org/peterjc/galaxy-central/src/default/tools/ncbi_blast_plus/blastxml_to_tabular.xml http://toolshed.g2.bx.psu.edu/ Note that Galaxy has a separate BLAST XML to tabular tool which produces a different set of columns including some useful ones that the BLAST+ command line tools don't offer: https://bitbucket.org/galaxy/galaxy-central/src/default/tools/metag_tools/megablast_xml_parser.py https://bitbucket.org/galaxy/galaxy-central/src/default/tools/metag_tools/megablast_xml_parser.xml ------------------------------------------------------------- There is some existing code from when I was exploring this idea last year on this branch, looking at parsing FASTA -m10, BLAST XML, text (all using the existing code in Biopython) and BLAST tabular. This may or may not be helpful. Looking at it now there are things I would do differently if I started again today. https://github.com/peterjc/biopython/tree/search-io-test ------------------------------------------------------------- Indexing: I'm assuming the reader is familiar with the existing Bio.SeqIO.index(...) and Bio.SeqIO.index_db(...) functionality. The idea there is you have a Python dictionary like object were the keys are record identifiers, and the values are SeqRecord objects which are parsed on demand. This works by recording the file offset where the record begins. At a minimum I would hope to be able to do the same for SearchIO, where the keys would be query sequence identifiers, and the values would be their search results. That should be quite straightforward, but will still be quite a bit of work to cover all the supported file formats. What is more interesting (and this starts to link with the object heirachy) is double indexing on query name and subject name, to get the list of HSPs for the combination - without having to parse any other results for that query. Perhaps this could be done with a dictionary interface where the keys can also be tuples of (query, subject) identifiers - but that isn't the only option. ------------------------------------------------------------- I hope I haven't scared you all! Peter From p.j.a.cock at googlemail.com Wed Mar 21 12:23:02 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Mar 2012 16:23:02 +0000 Subject: [Biopython-dev] Fwd: [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References: Message-ID: Forwarding for public discussion (here or on github) P.S. Would anyone object to pull request emails going to the dev list? Peter ---------- Forwarded message ---------- From: Lenna Peterson Date: Wed, Mar 21, 2012 at 6:28 AM Subject: [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) To: Peter Cock (All filenames are relative to Bio/PDB) A small modification of the flex input (mmCIF/mmcif.lex) allows flex to produce generated C (mmCIF/lex.yy.c) that can be compiled without the flex header. Flex generated on Debian stable. This allows the MMCIFlex module to be built on any platform that supports C modules. A pure Python implementation is in the works. Further modifications to mmCIF/mmcif.lex and mmCIF/MMCIFlexmodule.c -- function prototype corrections to suppress compiler warnings. MMCIF2Dict was producing an invalid dict, so I've changed it to subclass dict and it now functions as expected. Module tested on Debian stable and Mac OS X 10.6.8 Snow Leopard (both Python 2.6.7). You can merge this Pull Request by running: ?git pull https://github.com/lennax/biopython MMCIFlex Or you can view, comment on it, or merge it online at: ?https://github.com/biopython/biopython/pull/31 -- Commit Summary -- * Remove flex header dependency of CIF parser. * Update MMCIFParser call of MMCIF2Dict. * Cleaned up import. * Subclassed dict. * Restored MMCIFParser call to MMCIF2Dict. * Removed main() from lex input. * Restored newline. * Fix C prototype warnings. -- File Changes -- M Bio/PDB/MMCIF2Dict.py (32) M Bio/PDB/mmCIF/MMCIFlexmodule.c (6) M Bio/PDB/mmCIF/lex.yy.c (1456) M Bio/PDB/mmCIF/mmcif.lex (6) M setup.py (12) -- Patch Links -- ?https://github.com/biopython/biopython/pull/31.patch ?https://github.com/biopython/biopython/pull/31.diff --- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/31 From arklenna at gmail.com Wed Mar 21 14:55:02 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 21 Mar 2012 14:55:02 -0400 Subject: [Biopython-dev] Fwd: [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Just added a minor commit (a header for the generated C). I'm very close (as in, possibly today) to being able to deploy a pure Python (PLY) version of this parser for Jython/PyPy etc. A few questions about that: 1. Should addition of a pure Python parser be a separate pull request or should I add it to this one? 2. How would I add the PLY dependency to setup.py? Lenna On Wed, Mar 21, 2012 at 12:23 PM, Peter Cock wrote: > Forwarding for public discussion (here or on github) > > P.S. Would anyone object to pull request emails going to the dev list? > > Peter > > ---------- Forwarded message ---------- > From: Lenna Peterson > > Date: Wed, Mar 21, 2012 at 6:28 AM > Subject: [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) > To: Peter Cock > > > (All filenames are relative to Bio/PDB) > > A small modification of the flex input (mmCIF/mmcif.lex) allows flex > to produce generated C (mmCIF/lex.yy.c) that can be compiled without > the flex header. Flex generated on Debian stable. > > This allows the MMCIFlex module to be built on any platform that > supports C modules. A pure Python implementation is in the works. > > Further modifications to mmCIF/mmcif.lex and mmCIF/MMCIFlexmodule.c -- > function prototype corrections to suppress compiler warnings. > > MMCIF2Dict was producing an invalid dict, so I've changed it to > subclass dict and it now functions as expected. > > Module tested on Debian stable and Mac OS X 10.6.8 Snow Leopard (both > Python 2.6.7). > > > You can merge this Pull Request by running: > > ?git pull https://github.com/lennax/biopython MMCIFlex > > Or you can view, comment on it, or merge it online at: > > ?https://github.com/biopython/biopython/pull/31 > > -- Commit Summary -- > > * Remove flex header dependency of CIF parser. > * Update MMCIFParser call of MMCIF2Dict. > * Cleaned up import. > * Subclassed dict. > * Restored MMCIFParser call to MMCIF2Dict. > * Removed main() from lex input. > * Restored newline. > * Fix C prototype warnings. > > -- File Changes -- > > M Bio/PDB/MMCIF2Dict.py (32) > M Bio/PDB/mmCIF/MMCIFlexmodule.c (6) > M Bio/PDB/mmCIF/lex.yy.c (1456) > M Bio/PDB/mmCIF/mmcif.lex (6) > M setup.py (12) > > -- Patch Links -- > > ?https://github.com/biopython/biopython/pull/31.patch > ?https://github.com/biopython/biopython/pull/31.diff > > --- > Reply to this email directly or view it on GitHub: > https://github.com/biopython/biopython/pull/31 > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Wed Mar 21 15:55:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Mar 2012 19:55:27 +0000 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: On Wednesday, March 21, 2012, Lenna Peterson wrote: > Just added a minor commit (a header for the generated C). > > I'm very close (as in, possibly today) to being able to deploy a pure > Python (PLY) version of this parser for Jython/PyPy etc. > > A few questions about that: > > 1. Should addition of a pure Python parser be a separate pull request > or should I add it to this one? I'd prefer a second pull request after this one is done. > > 2. How would I add the PLY dependency to setup.py? > Is it in PyPI? If then the same as how we define NumPy as a dependency. Peter From p.j.a.cock at googlemail.com Wed Mar 21 19:24:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Mar 2012 23:24:01 +0000 Subject: [Biopython-dev] Fwd: [SO-devel] NCBI GFF3 support In-Reply-To: References: Message-ID: Good news for GFF3! The long anticipated NCBI GFF3 corrections are happening now. http://blastedbio.blogspot.co.uk/2011/08/why-are-ncbi-gff3-files-still-broken.html This should make putting together a good test suite for Brad's GFF code much easier :) (For anyone not aware, the SO ontology developers mailing list also serves as the GFF3 standard discussion mailing list.) Peter ---------- Forwarded message ---------- From: Murphy, Terence (NIH/NLM/NCBI) [C] Date: Wed, Mar 21, 2012 at 6:15 PM Subject: [SO-devel] NCBI GFF3 support To: "SO developers (song-devel at lists.sourceforge.net)" Hi All, I?m pleased to announce that NCBI has updated their GFF3 export software to the latest specifications (1.20), and is in the process of updating files on the NCBI Genomes FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/). Files are now available for the NCBI annotations of the latest assemblies for human, cow, dog, pig, chicken, and many others, and will be provided as part of future releases. See the README files in each species directory for further details. For example, the human GRCh37.p5 annotation in top level (chromosome) coordinates is available at: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh37.p5_top_level.gff3.gz Files in the /Bacteria, /Viruses, and other subdirectories are being updated as part of rolling update cycles. Files with this header were produced with the new writer: ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter We?ve folded in a few bug fixes since we started using the new writer in production, and are working to refresh all the files in the near future. So you may see a few anomalies in files produced by annotwriter earlier this year. Files produced in March or later should be almost fine, with the exception of a problem with the ?is_circular=? tag starting with a lowercase 'i' (thanks to Peter for catching that so quickly). annotwriter is available for download as part of the NCBI C++ Toolkit, but the public toolkit isn?t updated very often so the current version is missing many updates made in the last year. An updated version of the toolkit is tentatively scheduled to be released in the next few months, so I would wait for that before trying to use annotwriter yourself for ASN to GFF3 conversion. Please contact the NCBI Service Desk (info at ncbi.nlm.nih.gov) if you have any questions or suggestions, or you can contact me directly or through this listserv. Enjoy! -Terence ----- Terence Murphy, Ph.D. RefSeq Project NCBI/NLM/NIH/DHHS 45 Center Drive, Room 4AS.37D-82 Bethesda, MD? 20892-6510 Phone: 00-1-301-402-0990 e-mail: murphyte at ncbi.nlm.nih.gov From p.j.a.cock at googlemail.com Thu Mar 22 07:18:17 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 22 Mar 2012 11:18:17 +0000 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Hi Lenna, Sorry for the delay, I thought I wrote this on the github pull request. This may be a silly question, but do we have any unit tests for MMCIF? I had a quick look yesterday and couldn't see any. I would be much happier with a basic unit test so I can check the functionality before and after your fix. Hopefully you can come up with a small data file and a minimal bit of code to check the parser, which we can turn into a new unit test, say test_MMCIF.py Also should this work? >>> from Bio.PDB import MMCIF2Dict Traceback (most recent call last): File "", line 1, in File "Bio/PDB/MMCIF2Dict.py", line 10, in import Bio.PDB.mmCIF.MMCIFlex as MMCIFlex ImportError: No module named MMCIFlex [We can post-pone Python 3 support to a new issue] Peter From ajingnk at gmail.com Thu Mar 22 11:07:02 2012 From: ajingnk at gmail.com (Jing Lu) Date: Thu, 22 Mar 2012 11:07:02 -0400 Subject: [Biopython-dev] GSoC project Message-ID: Hi all, My name is Jing Lu, a bioinformatics PhD student from UofM. During my research, for biopython, I usually use Bio.PDB, Bio.SeqIO, and Bio.SVDSuperimposer(maybe I can write a small package for structure alignment). I hope I can contribute to biopython community, and participate GSoC this summer. I have research experience in both next generation sequencing and chemical informatics. But, I am not very sure about who can be my mentor and what is needed for biopython. Thank you all. Best regards, Jing -- Jing Lu Ph.D student in Bioinformatics Department of Computational Medicine and Bioinformaitcs, University of Michigan, Ann Arbor, MI 48105, US From elke at inf.ethz.ch Thu Mar 22 12:17:29 2012 From: elke at inf.ethz.ch (Elke Schaper) Date: Thu, 22 Mar 2012 17:17:29 +0100 Subject: [Biopython-dev] providing python code on tandem repeat detectors to Biopython? References: <9362057C-E59B-4E25-996D-537D9D80DDAB@inf.ethz.ch> Message-ID: <6102D50F-DD66-4E21-B708-0A313CAB8E85@inf.ethz.ch> Hi, We've been working on sequence tandem repeat detectors during the past couple of months in my research lab. A couple of parsers for commonly used detectors, and a repeat class (basically an extension to sequence alignments) have come up along the way. Would there principally be any interest in introducing the code to Biopython? Thanks, Elke name: Elke Schaper institute: Professur f. Informatik phone: +41 44 632 82 60 e-mail: elke at inf.ethz.ch office_location: CAB H 86.2 address: Universitaetstrasse 6 : 8092 Zuerich From arklenna at gmail.com Thu Mar 22 19:15:44 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 22 Mar 2012 19:15:44 -0400 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Hi Peter, As far as I can tell, there isn't an MMCIF unit test. I'll work on a minimal data file for testing MMCIFParser and MMCIF2Dict - may take me a day or so as I'm not yet familiar with the unittest class. Your error indicates that the C module isn't installed (the biopython site package should contain Bio/PDB/mmCIF/MMCIFlex.so). Re: Python 3, are you referring to the changes to import? Lenna On Thu, Mar 22, 2012 at 7:18 AM, Peter Cock wrote: > Hi Lenna, > > Sorry for the delay, I thought I wrote this on the github pull request. > > This may be a silly question, but do we have any unit tests for MMCIF? > I had a quick look yesterday and couldn't see any. I would be much > happier with a basic unit test so I can check the functionality before > and after your fix. Hopefully you can come up with a small data file > and a minimal bit of code to check the parser, which we can turn > into a new unit test, say test_MMCIF.py > > Also should this work? > >>>> from Bio.PDB import MMCIF2Dict > Traceback (most recent call last): > ?File "", line 1, in > ?File "Bio/PDB/MMCIF2Dict.py", line 10, in > ? ?import Bio.PDB.mmCIF.MMCIFlex as MMCIFlex > ImportError: No module named MMCIFlex > > [We can post-pone Python 3 support to a new issue] > > Peter From w.arindrarto at gmail.com Thu Mar 22 19:31:49 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 23 Mar 2012 00:31:49 +0100 Subject: [Biopython-dev] GSoC Student Applicant Message-ID: Hello Biopython-dev, I'm Bow, a Genomics Master's Student from Utrecht University. I'm one of the students interested in working in SearchIO, as mentioned previously by Peter. The reason for my interest is simply because I think the feature would be very useful for a lot of people working in biology (with Biopython, of course). I'm currently discussing my options and ideas with Peter, but I would also like to tell the community of my GSoC intentions. I've been working with Python for almost two years now, publishing some of my programs online in my Github account (http://www.github.com/bow). One of them is actually a contribution to Biopython's 1.58 release last year, the SeqIO parser for ABI trace files. Since then, I've been continuously learning more about Python to prepare myself for more challenging tasks, this one included. That's all for a short (re-)introduction from me. I'm looking forward for an opportunity to contribute more along the summer :). Cheers, Bow From chapmanb at 50mail.com Thu Mar 22 20:02:24 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 22 Mar 2012 20:02:24 -0400 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: References: Message-ID: <874ntgtca7.fsf@fastmail.fm> Bow; Thanks for the introduction and glad to hear you're putting together an application for GSoC. Please do keep us up to date as you get your proposal together so everyone can provide feedback. Have fun with it and best of luck, Brad > Hello Biopython-dev, > > I'm Bow, a Genomics Master's Student from Utrecht > University. I'm one of the students interested in working in SearchIO, as > mentioned previously by Peter. The reason for my interest is simply because > I think the feature would be very useful for a lot of people working in > biology (with Biopython, of course). I'm currently discussing my options > and ideas with Peter, but I would also like to tell the community of my > GSoC intentions. > > I've been working with Python for almost two years now, publishing some of > my programs online in my Github account (http://www.github.com/bow). One of > them is actually a contribution to Biopython's 1.58 release last year, the > SeqIO parser for ABI trace files. Since then, I've been continuously > learning more about Python to prepare myself for more challenging tasks, > this one included. > > That's all for a short (re-)introduction from me. I'm looking forward for > an opportunity to contribute more along the summer :). > > Cheers, > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chapmanb at 50mail.com Thu Mar 22 20:00:17 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 22 Mar 2012 20:00:17 -0400 Subject: [Biopython-dev] providing python code on tandem repeat detectors to Biopython? In-Reply-To: <6102D50F-DD66-4E21-B708-0A313CAB8E85@inf.ethz.ch> References: <9362057C-E59B-4E25-996D-537D9D80DDAB@inf.ethz.ch> <6102D50F-DD66-4E21-B708-0A313CAB8E85@inf.ethz.ch> Message-ID: <878vistcdq.fsf@fastmail.fm> Elke; Welcome and thanks for the e-mail. This sounds like useful functionality. The first step would be to generalize it and make the code available on GitHub. This should give folks something concrete to provide feedback on. Thanks again, Brad > We've been working on sequence tandem repeat detectors during the past > couple of months in my research lab. A couple of parsers for commonly > used detectors, and a repeat class (basically an extension to sequence > alignments) have come up along the way. > > Would there principally be any interest in introducing the code to > Biopython? From p.j.a.cock at googlemail.com Fri Mar 23 06:07:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 23 Mar 2012 10:07:45 +0000 Subject: [Biopython-dev] providing python code on tandem repeat detectors to Biopython? In-Reply-To: <878vistcdq.fsf@fastmail.fm> References: <9362057C-E59B-4E25-996D-537D9D80DDAB@inf.ethz.ch> <6102D50F-DD66-4E21-B708-0A313CAB8E85@inf.ethz.ch> <878vistcdq.fsf@fastmail.fm> Message-ID: Elke Schaper wrote: >> We've been working on sequence tandem repeat detectors during the past >> couple of months in my research lab. A couple of parsers for commonly >> used detectors, and a repeat class (basically an extension to sequence >> alignments) have come up along the way. >> >> Would there principally be any interest in introducing the code to >> Biopython? On Fri, Mar 23, 2012 at 12:00 AM, Brad Chapman wrote: > > Elke; > Welcome and thanks for the e-mail. This sounds like useful > functionality. The first step would be to generalize it and make the > code available on GitHub. This should give folks something concrete to > provide feedback on. > > Thanks again, > Brad Hi Elke, We're starting to do more and more annotation where I work, so this does sound very useful directly (as well as to the Biopython community). Which commonly used tandem repeat detectors do you have parsers for? I'm curious about their file formats - are they tool specific text files, XML, or something more general, e.g. GFF3? Brad's right that it would be great to see your current code - assuming you're happy to post it online - and that would make it much easier to give guidance on how best it might be integrated into Biopython. Assuming it is all your own work, and you are happy with the Biopython licensee (MIT/BSD style), there will be no problems. However, if you're using any GPL or LGPL code for example, that may complicate things. Thanks, Peter From p.j.a.cock at googlemail.com Fri Mar 23 07:47:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 23 Mar 2012 11:47:15 +0000 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: On Thu, Mar 22, 2012 at 11:15 PM, Lenna Peterson wrote: > > Re: Python 3, are you referring to the changes to import? > Probably, but also changes in setup.py are needed to enable C extensions under Python 3. This is because there are API changes needed in the C files (via version checking macros) to support both Python 2 and Python 3. Only some of our C code has been updated to do this. Peter From arklenna at gmail.com Fri Mar 23 19:52:12 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 23 Mar 2012 19:52:12 -0400 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Hi Peter, I've added a unit test (Tests/test_MMCIF.py). It currently tests whether a polypeptide is extracted properly. I've also come across what appears to be a full-featured Python CIF converter/validator. The license looks BSD-compatible. Reference: http://journals.iucr.org/j/issues/2006/04/00/wf5020/index.html PyPI: http://pypi.python.org/pypi/PyCifRW/3.3 Source: http://sourceforge.net/projects/pycifrw.berlios/ I haven't tested it extensively, but if we can use it, no sense reinventing the wheel. Lenna On Fri, Mar 23, 2012 at 7:47 AM, Peter Cock wrote: > On Thu, Mar 22, 2012 at 11:15 PM, Lenna Peterson wrote: >> >> Re: Python 3, are you referring to the changes to import? >> > > Probably, but also changes in setup.py are needed to enable > C extensions under Python 3. This is because there are API > changes needed in the C files (via version checking macros) > to support both Python 2 and Python 3. Only some of our C > code has been updated to do this. > > Peter From redmine at redmine.open-bio.org Mon Mar 26 18:21:24 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 26 Mar 2012 22:21:24 +0000 Subject: [Biopython-dev] [Biopython - Bug #3333] (New) PhyloXML writer fails to include is_aligned attribute with mol_seq elements Message-ID: Issue #3333 has been reported by Eric Talevich. ---------------------------------------- Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements https://redmine.open-bio.org/issues/3333 Author: Eric Talevich Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: First reported here: http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html Steps to reproduce: 1. Load a tree, convert to PhyloXML

from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()

2. Add a sequence

from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]

3. Verify that the sequence information has been set -- mol_seq has is_aligned set

print tree

Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)

4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!

print tree.format('phyloxml')

...

  c
  1.0
  
    AAA
  

...

---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Mar 26 18:21:25 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 26 Mar 2012 22:21:25 +0000 Subject: [Biopython-dev] [Biopython - Bug #3333] (New) PhyloXML writer fails to include is_aligned attribute with mol_seq elements Message-ID: Issue #3333 has been reported by Eric Talevich. ---------------------------------------- Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements https://redmine.open-bio.org/issues/3333 Author: Eric Talevich Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: First reported here: http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html Steps to reproduce: 1. Load a tree, convert to PhyloXML

from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()

2. Add a sequence

from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]

3. Verify that the sequence information has been set -- mol_seq has is_aligned set

print tree

Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)

4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!

print tree.format('phyloxml')

...

  c
  1.0
  
    AAA
  

...

-- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Wed Mar 28 17:20:53 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 28 Mar 2012 23:20:53 +0200 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: <874ntgtca7.fsf@fastmail.fm> References: <874ntgtca7.fsf@fastmail.fm> Message-ID: Hi everyone, I have just finished my first proposal draft. Here is the link: https://docs.google.com/document/d/1wi46mTZCzKooxZGWMrdZpsJ-bjpp9fD50NF60exPKqg/edit Looking forward to your comments / thoughts / critiques, Bow On Fri, Mar 23, 2012 at 01:02, Brad Chapman wrote: > > Bow; > Thanks for the introduction and glad to hear you're putting together an > application for GSoC. Please do keep us up to date as you get your > proposal together so everyone can provide feedback. Have fun with it and > best of luck, > Brad > > > Hello Biopython-dev, > > > > I'm Bow, a Genomics Master's Student from Utrecht > > University. I'm one of the students interested in working in SearchIO, as > > mentioned previously by Peter. The reason for my interest is simply > because > > I think the feature would be very useful for a lot of people working in > > biology (with Biopython, of course). I'm currently discussing my options > > and ideas with Peter, but I would also like to tell the community of my > > GSoC intentions. > > > > I've been working with Python for almost two years now, publishing some > of > > my programs online in my Github account (http://www.github.com/bow). > One of > > them is actually a contribution to Biopython's 1.58 release last year, > the > > SeqIO parser for ABI trace files. Since then, I've been continuously > > learning more about Python to prepare myself for more challenging tasks, > > this one included. > > > > That's all for a short (re-)introduction from me. I'm looking forward for > > an opportunity to contribute more along the summer :). > > > > Cheers, > > Bow > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Thu Mar 29 09:17:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 14:17:56 +0100 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: On Fri, Mar 23, 2012 at 11:52 PM, Lenna Peterson wrote: > Hi Peter, > > I've added a unit test (Tests/test_MMCIF.py). It currently tests > whether a polypeptide is extracted properly. Lovely - I've cherry-picked that (two commits) to the master branch, and made sure it is skipped gracefully if the C code is not being compiled (still the default). I'll then have a go on a few different platforms to check everything is working as expected. One thing I did spot breaks under Python 3 is this line in Bio/PDB/MMCIFParser.py, from string import letters Quoting: http://docs.python.org/release/3.1.3/whatsnew/3.0.html >> string.letters and its friends (string.lowercase and string.uppercase) >> are gone. Use string.ascii_letters etc. instead. (The reason for the >> removal is that string.letters and friends had locale-specific behavior, >> which is a bad idea for such attractively-named global ?constants?.) Do you see any risk with switching this to string.ascii_letters instead? Peter From arklenna at gmail.com Thu Mar 29 10:04:37 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 29 Mar 2012 10:04:37 -0400 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Hi Peter, Sounds great. Thanks for the update. As far as I can tell, the only thing string.letters is used for is the PDB iCode. I'm fairly certain the iCode allows only ASCII letters, so string.ascii_letters should be fine. Lenna On Thu, Mar 29, 2012 at 9:17 AM, Peter Cock wrote: > On Fri, Mar 23, 2012 at 11:52 PM, Lenna Peterson wrote: >> Hi Peter, >> >> I've added a unit test (Tests/test_MMCIF.py). It currently tests >> whether a polypeptide is extracted properly. > > Lovely - I've cherry-picked that (two commits) to the master branch, > and made sure it is skipped gracefully if the C code is not being > compiled (still the default). I'll then have a go on a few different > platforms to check everything is working as expected. > > One thing I did spot breaks under Python 3 is this line in > Bio/PDB/MMCIFParser.py, > > from string import letters > > Quoting: http://docs.python.org/release/3.1.3/whatsnew/3.0.html >>> string.letters and its friends (string.lowercase and string.uppercase) >>> are gone. Use string.ascii_letters etc. instead. (The reason for the >>> removal is that string.letters and friends had locale-specific behavior, >>> which is a bad idea for such attractively-named global ?constants?.) > > Do you see any risk with switching this to string.ascii_letters instead? > > Peter From p.j.a.cock at googlemail.com Thu Mar 29 10:05:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 15:05:46 +0100 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Hi Lenna, Have you tried your branch on Windows yet? It worked for me under my Python 2.5 setup using mingw32, C:\repositories\biopython>c:\python26\python setup.py install ... building 'Bio.PDB.mmCIF.MMCIFlex' extension creating build\temp.win32-2.5\Release\bio\pdb creating build\temp.win32-2.5\Release\bio\pdb\mmcif C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/lex.yy.c -o build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o lex.yy.c:1046: warning: 'yyunput' defined but not used C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/MMCIFlexmodule.c -o build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o writing build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def C:\cygwin\usr\bin\gcc.exe -mno-cygwin -shared -s build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def -Lc:\python25\libs -Lc:\python25\PCBuild -lpython25 -lmsvcr71 -o build\lib.win32-2.5\Bio\PDB\mmCIF\MMCIFlex.pyd ... That worked fine and test_MMCIF.py is happy. However, MSVC v9 is not: C:\repositories\biopython>c:\python26\python setup.py install ... building 'Bio.PDB.mmCIF.MMCIFlex' extension C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IBio -Ic:\python26\include -Ic:\python26\PC /TcBio/PDB/mmCIF/lex.yy.c /Fobuild\temp.win32-2.6\Release\Bio/PDB/mmCIF/lex.yy.obj lex.yy.c Bio/PDB/mmCIF/lex.yy.c(12) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory error: command '"C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe"' failed with exit status 2 The same with Python 2.7 and the Microsoft compiler. Switching from this in Bio/PDB/mmCIF.yy.c: #include to this: #include lets it compile (although with some warnings) and test_MMCIF.py passes. If should be conditional of course, but I'm unclear if that is the appropriate fix or not though. Peter From anaryin at gmail.com Thu Mar 29 10:06:10 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 29 Mar 2012 16:06:10 +0200 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: Great work Lenna, thanks for taking care of this! Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao No dia 29 de Mar?o de 2012 16:04, Lenna Peterson escreveu: > Hi Peter, > > Sounds great. Thanks for the update. > > As far as I can tell, the only thing string.letters is used for is the > PDB iCode. I'm fairly certain the iCode allows only ASCII letters, so > string.ascii_letters should be fine. > > Lenna > > On Thu, Mar 29, 2012 at 9:17 AM, Peter Cock > wrote: > > On Fri, Mar 23, 2012 at 11:52 PM, Lenna Peterson > wrote: > >> Hi Peter, > >> > >> I've added a unit test (Tests/test_MMCIF.py). It currently tests > >> whether a polypeptide is extracted properly. > > > > Lovely - I've cherry-picked that (two commits) to the master branch, > > and made sure it is skipped gracefully if the C code is not being > > compiled (still the default). I'll then have a go on a few different > > platforms to check everything is working as expected. > > > > One thing I did spot breaks under Python 3 is this line in > > Bio/PDB/MMCIFParser.py, > > > > from string import letters > > > > Quoting: http://docs.python.org/release/3.1.3/whatsnew/3.0.html > >>> string.letters and its friends (string.lowercase and string.uppercase) > >>> are gone. Use string.ascii_letters etc. instead. (The reason for the > >>> removal is that string.letters and friends had locale-specific > behavior, > >>> which is a bad idea for such attractively-named global ?constants?.) > > > > Do you see any risk with switching this to string.ascii_letters instead? > > > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Thu Mar 29 10:08:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 15:08:54 +0100 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: On Thu, Mar 29, 2012 at 3:04 PM, Lenna Peterson wrote: > Hi Peter, > > Sounds great. Thanks for the update. > > As far as I can tell, the only thing string.letters is used for is the > PDB iCode. I'm fairly certain the iCode allows only ASCII letters, so > string.ascii_letters should be fine. > > Lenna Thanks, I made that change: https://github.com/biopython/biopython/commit/a365b3ac347f9400f291769f9bcb1d62ac712c9f Peter From MatatTHC at gmx.de Thu Mar 29 10:38:05 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Thu, 29 Mar 2012 16:38:05 +0200 Subject: [Biopython-dev] SeqIO circular Message-ID: Hi, Is it possible to get the property if a genome is circular / linear from SeqIO applied to genbank files? I could not find it. There is also a related bugreport: http://bugzilla.open-bio.org/show_bug.cgi?id=2578 I used the old parser before and switched to SeqIO which I really like for the possibilities to parse different formats... but I really need the information. Matthias From andrew.sczesnak at med.nyu.edu Thu Mar 29 11:52:59 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Thu, 29 Mar 2012 11:52:59 -0400 Subject: [Biopython-dev] MAF Parser/Indexer Message-ID: <4F74855B.9000603@med.nyu.edu> Hi all, I would like to start a discussion about what is needed to make the AlignIO.MafIO parser and indexer ready for the next release. If anyone is unfamiliar with MAF (Multiple Alignment Format), it is the file format that eukaryote genome-to-genome multiple alignments produced by multiz are stored in. The exact specs are here: http://genome.ucsc.edu/FAQ/FAQformat.html#format5 Some use cases are discussed in this paper, which implements (I believe) most of the same functionality of the MafIO class in Galaxy: http://www.ncbi.nlm.nih.gov/pubmed/21775304 The branch of my biopython fork that contains the class: https://github.com/polyatail/biopython/tree/alignio-maf The class is implemented as a reader/writer compatible with the AlignIO API, but implements its own indexer (MafIO.MafIndex) based on SeqIO.index_db(). At the time, this seemed like the best way to implement this, as MAF is explicitly designed for genome-to-genome alignments while other formats are not. If we can assume a MAF file contains such an alignment, we can index it by genome coordinates and allow random access to intervals. This is especially useful since it is often desirable to retrieve the spliced multiple alignment of a multi-exonic transcript, which can be used to determine sequence conservation, construct a phylogenetic tree for a particular gene, or pull out orthologs of a large number of genes at once. The code consists of the reader, writer, and indexer classes in AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to the indexer in Tests/test_MafIO_index.py. I would really appreciate any feedback and suggestions, and if anyone has an opportunity to use this feature it would be great to get some feedback on its operation. Thanks! Andrew From p.j.a.cock at googlemail.com Thu Mar 29 11:58:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 16:58:44 +0100 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: > Hi, > > Is it possible to get the property if a genome is circular / linear > from SeqIO applied to genbank files? I could not find it. > > There is also a related bugreport: > http://bugzilla.open-bio.org/show_bug.cgi?id=2578 > > I used the old parser before and switched to SeqIO which I really like > for the possibilities to parse different formats... but I really need > the information. Does anyone happen to have a BioPerl + BioSQL setup installed and working? IIRC checking that to make sure however we store the circular was compatible was the only real hurdle. Peter From chapmanb at 50mail.com Thu Mar 29 21:22:18 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 29 Mar 2012 21:22:18 -0400 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: References: <874ntgtca7.fsf@fastmail.fm> Message-ID: <87r4wa6fxx.fsf@fastmail.fm> Bow; > I have just finished my first proposal draft. Here is the link: > https://docs.google.com/document/d/1wi46mTZCzKooxZGWMrdZpsJ-bjpp9fD50NF60exPKqg/edit > > Looking forward to your comments / thoughts / critiques, Thanks for putting this together. It would be helpful if you enabled editing, or at least comments, so we could leave feedback in the document directly. My general thoughts: - You should include your long version and move this up earlier in the document. GSoC are as much about the students as they are about the project, and reviewers will have strong interest in you as a person. - Your timeline should be much more detailed. You want it broken down week by week by planned features and specific deliverables: code, tests and documentation. Mentors use the project plan to ensure everything is on track during the summer, so it's important to be as detailed as possible. - You might want to expand a bit on your research obligations for the summer. Your research + GSoC timeline sounds like you've left yourself no chance for eating or you know, talking with other people in real life. It's good to be sure you have a realistic set of responsibilities so you don't overcommit and sacrifice either your masters work or GSoC. Hope this is helpful and thanks again for the work, Brad From arklenna at gmail.com Fri Mar 30 17:55:56 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 30 Mar 2012 17:55:56 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal Message-ID: Hi all, I realize time is short, but I am still in the planning phase of my GSoC proposal! I wanted to take a moment to formally introduce myself to the dev list. I am affiliated with Purdue University, located in Indiana, USA and best known for engineering (Neil Armstrong is a famous graduate). I hold a bachelor of arts in biology from Mount Holyoke College in Massachusetts. I have extensive wet lab experience with genetics; I'm currently working in a lab genotyping mice (the research is intestinal lipid metabolism). In August, I begin a PhD in interdisciplinary life science at Purdue, and I anticipate that my research will fall somewhere in the field of bioinformatics/computational biology. I hope to use biopython extensively! In my spare time, other than programming, I enjoy ballroom dance, science fiction novels, board games, and sailing. I've been programming for about 6 years and using python for 4; other languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL (primarily MySQL and SQLite), and C++/C. I place a high value on object oriented design and execution. I understand the basics of formal grammar and have some experience with lex/flex as well as PLY (python lex/yacc). My work so far with biopython has been on the CIF parsing module. One of my primary goals for the genomic variants project would be to implement as much polymorphism and abstraction as possible, for the benefit of both users and future developers. I'm working on a proposal for the genomic variants project, and while I understand the basics of molecular biology and genetics, I lack firsthand experience with the type of workflow that would occur in the context of genomic variants. If anyone can supply a few examples, it would be greatly appreciated. I hope to have a proposal draft ready for feedback by Monday. Regards, Lenna Peterson github.com/lennax From w.arindrarto at gmail.com Fri Mar 30 19:11:34 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 31 Mar 2012 01:11:34 +0200 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: <87r4wa6fxx.fsf@fastmail.fm> References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> Message-ID: On Fri, Mar 30, 2012 at 03:22, Brad Chapman wrote: > > > Bow; > > > I have just finished my first proposal draft. Here is the link: > > > > https://docs.google.com/document/d/1wi46mTZCzKooxZGWMrdZpsJ-bjpp9fD50NF60exPKqg/edit > > > > Looking forward to your comments / thoughts / critiques, > > Thanks for putting this together. It would be helpful if you enabled > editing, or at least comments, so we could leave feedback in the > document directly. > > My general thoughts: > > - You should include your long version and move this up earlier in the > ?document. GSoC are as much about the students as they are about the > ?project, and reviewers will have strong interest in you as a person. > > - Your timeline should be much more detailed. You want it broken down > ?week by week by planned features and specific deliverables: code, > ?tests and documentation. Mentors use the project plan to ensure > ?everything is on track during the summer, so it's important to be as > ?detailed as possible. > > - You might want to expand a bit on your research obligations for the > ?summer. Your research + GSoC timeline sounds like you've left yourself > ?no chance for eating or you know, talking with other people in real > ?life. It's good to be sure you have a realistic set of > ?responsibilities so you don't overcommit and sacrifice either your > ?masters work or GSoC. > > Hope this is helpful and thanks again for the work, > Brad Hi Brad, Thank you for the comments and suggestions. I've added a little bit more details to my personal profile and put it up front. My project details have also been broken down into single weeks. And I've edited the commenting permission. As for my other obligations, I didn't mean to give that impression. I added a little bite more detail about the project itself, but I'm not sure about the time that I should write. I estimate that at most, for each week day, I spend 8 hours doing my Master's project in my lab's campus. Since the project started, I usually use the remainder of the time (~6 hours/day) for my own personal programming projects. I plan to use the personal programming time slot for my GSoC instead, if accepted. Should I be this thorough in the proposal? Thanks again :), Bow From arklenna at gmail.com Sat Mar 31 16:48:14 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sat, 31 Mar 2012 16:48:14 -0400 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References:

Message-ID: On Thu, Mar 29, 2012 at 10:05 AM, Peter Cock wrote: > Hi Lenna, > > Have you tried your branch on Windows yet? > > It worked for me under my Python 2.5 setup using mingw32, > > C:\repositories\biopython>c:\python26\python setup.py install > ... > building 'Bio.PDB.mmCIF.MMCIFlex' extension > creating build\temp.win32-2.5\Release\bio\pdb > creating build\temp.win32-2.5\Release\bio\pdb\mmcif > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio > -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/lex.yy.c -o > build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o > lex.yy.c:1046: warning: 'yyunput' defined but not used > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio > -Ic:\python25\include -Ic:\python25\PC -c > Bio/PDB/mmCIF/MMCIFlexmodule.c -o > build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o > writing build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -shared -s > build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o > build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o > build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def > -Lc:\python25\libs -Lc:\python25\PCBuild -lpython25 -lmsvcr71 -o > build\lib.win32-2.5\Bio\PDB\mmCIF\MMCIFlex.pyd > ... > > That worked fine and test_MMCIF.py is happy. However, MSVC v9 is not: > > C:\repositories\biopython>c:\python26\python setup.py install > ... > building 'Bio.PDB.mmCIF.MMCIFlex' extension > C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe /c /nologo > /Ox /MD /W3 /GS- /DNDEBUG -IBio -Ic:\python26\include -Ic:\python26\PC > /TcBio/PDB/mmCIF/lex.yy.c > /Fobuild\temp.win32-2.6\Release\Bio/PDB/mmCIF/lex.yy.obj > lex.yy.c > Bio/PDB/mmCIF/lex.yy.c(12) : fatal error C1083: Cannot open include > file: 'unistd.h': No such file or directory > error: command '"C:\Program Files\Microsoft Visual Studio > 9.0\VC\BIN\cl.exe"' failed with exit status 2 > > The same with Python 2.7 and the Microsoft compiler. Switching > from this in Bio/PDB/mmCIF.yy.c: > > #include > > to this: > > #include > > lets it compile (although with some warnings) and test_MMCIF.py passes. > If should be conditional of course, but I'm unclear if that is the appropriate > fix or not though. > > Peter Peter, re: Windows, I have some experience with mingw but none at all with MSVC. I haven't yet figured out how to build python C modules on Windows. unistd.h is a POSIX header, so an acceptable short-term solution would be to use io.h for MSVC. If test_MMCIF.py passes on Windows with io.h, the C module is doing what we need it to do. However, I'm leery of further manual modifications to generated C. The lex.yy.c generated on Debian by flex 2.5.35 doesn't include unistd.h, so might work with MSVC. I reverted to the 2003 lex.yy.c in large part to make the diff less messy. Furthermore, I plan to experiment with flex on Windows. I suspect that lex.yy.c generated by Windows flex is more likely to compile on POSIX than the converse. I'll have time to come back to this next weekend; I'm currently working on my GSoC proposal! Lenna From redmine at redmine.open-bio.org Sat Mar 31 19:37:54 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 31 Mar 2012 23:37:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #3335] (New) NameError: global name 'StringIO' is not defined Message-ID: Issue #3335 has been reported by John Comeau. ---------------------------------------- Bug #3335: NameError: global name 'StringIO' is not defined https://redmine.open-bio.org/issues/3335 Author: John Comeau Status: New Priority: Normal Assignee: Category: Target version: URL: Bio.ParserSupport method parse_str uses StringIO but the module does not import it. I don't have a proper patch, just added the import StringIO on line 37 to fix. The modified file: bash-3.2$ grep -n StringIO /data1/igm3/sw/packages/python-2.7/lib/python2.7/site-packages/biopython-1.59-py2.7-linux-x86_64.egg/Bio/ParserSupport.py 37:import StringIO 57: return self.parse(StringIO.StringIO(string)) I realize maintaining the legacy parsers is not a high priority for the BioPython team, but I need this to work, or write my own parser, or switch to BioPerl, and the latter two options are not happy ones for me. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu Mar 1 12:02:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 12:02:58 +0000 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: <4F4BAE23.7070402@gmail.com> References: <4F4BAE23.7070402@gmail.com> Message-ID: On Mon, Feb 27, 2012 at 4:24 PM, Robert Buels wrote: > Hi all, > > As kindly pointed out by Reece Hart, the previous email I sent out calling > for Google Summer of Code project ideas, had the wrong due date for project > ideas in it. > > I actually want them to all be in place by Friday, March 2, which is this > coming Friday. > See http://lists.open-bio.org/pipermail/biopython/2012-February/007726.html for the original complete email. That deadline is upon us (tomorrow), so where are we with GSoC 2012 ideas? http://biopython.org/wiki/Google_Summer_of_Code Are any of the areas touched on in the "Biopython 1.60 plans and beyond" thread suitable? Python 3? --------- In terms of 'software engineering' we might be able to put together something for Python 3 support (there are still some C extensions to do), but I'm not sure if there is enough work there. SearchIO? --------- I'm wondering if a Biopython SearchIO would make a good project, that I might supervise. This name is obviously based on BioPerl. I would be aiming for iterator based parser/writer framework (like SeqIO and AlignIO) for pairwise 'sequence' searches initially, but have also been thinking about indexing - at least by query, ideally also by match, to allow random access akin to what Bio.SeqIO.index offers. In some cases the results would also be pairwise sequence alignments, in which case some code can be shared/linked with AlignIO. In other cases all you get is co-ordinates of the query and match plus some kind of score. Therefore this could include a hierarchical SearchIO result object structure for minimal matches up to full pairwise alignments. I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not really sequence vs sequence, but HMM vs sequence), RPS-BLAST (again not really sequence vs sequence). Perhaps this could also tie into the Bio.Motif code as well (if we consider things like PSSM vs sequence in the same framework). You can already do some of this in Biopython (e.g. BLAST XML parsing, and there is some HMMER work on branches), but I'm hoping for a unified API here. Peter From daniel at treparel.com Thu Mar 1 12:21:42 2012 From: daniel at treparel.com (=?UTF-8?Q?Dani=C3=ABl_van_Adrichem?=) Date: Thu, 1 Mar 2012 13:21:42 +0100 Subject: [Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists) Message-ID: Hello list, Firstly I want to report a bug plus suggested fix. Today I noticed a bug which got triggered by missing local DTDs. I was still using 1.58 which does not have the new DTDs. Missing the DTDs locally should be handled by downloading them. This worked for the first DTD, but then on the second one (which is a dependency of the first one) I got a HTTP 404. After investigating I found that the module was making a request for "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd" Note the backslash right after DTD. It gets turned into a %5C and causes the 404. The cause of this is usage of os.path.join to concatenate the URL. I am running this on windows, on a platform where the file system uses a forward slash this would work just fine. please find attached a patch to fix this issue. Secondly I want to comment on the recent change in Bio.Entrez.efetch (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change did fix the problem, I think this might be achieved in a cleaner way. Please see the code that is used to format the options on the url (in Bio.Entrez._open): options = urllib.urlencode(params, doseq=True) the doseq argument specifically. Its documentation states: "If any values in the query arg are sequences and doseq is true, each sequence element is converted to a separate parameter." So this was the reason for the "id=1&id=2&id=3" formatting. Without doseq set this would turn into: "id=1,2,3" If this doseq functionality is not needed for other params (I am unsure of this), I suggest to revert the change in efetch() and use doseq=False (which is default argument) Thanks! -- Dani?l van Adrichem Treparel Information Solutions b.v. Delftechpark 26 2628XH Delft The Netherlands -------------- next part -------------- A non-text attachment was scrubbed... Name: Parser.py.diff Type: application/octet-stream Size: 592 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Thu Mar 1 13:34:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 13:34:39 +0000 Subject: [Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists) In-Reply-To: References: Message-ID: 2012/3/1 Dani?l van Adrichem : > Hello list, > > Firstly I want to report a bug plus suggested fix. > > Today I noticed a bug which got triggered by missing local DTDs. I was > still using 1.58 which does not have the new DTDs. > > Missing the DTDs locally should be handled by downloading them. This > worked for the first DTD, but then on the second one (which is a > dependency of the first one) I got a HTTP 404. > > After investigating I found that the module was making a request for > "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd" > Note the backslash right after DTD. It gets turned into a %5C and > causes the 404. That DTD should be in Biopython 1.59 - and hopefully the other DTD you mentioned but did not name. Please let us know if there are any more we've missed. https://github.com/biopython/biopython/commit/5f08ccdfe0706f9073bce441609aa86b1ea9d0f4 > The cause of this is usage of os.path.join to concatenate the URL. I > am running this on windows, on a platform where the file system uses a > forward slash this would work just fine. > > please find attached a patch to fix this issue. That makes perfect sense, although as written your patch could result in too many slashes being used - thus: https://github.com/biopython/biopython/commit/c93b32bab5526a830e2cb14f0db782ee1b687715 Would you like to be thanked in the NEWS file and listed as a contributor (in the CONTRIB file)? > Secondly I want to comment on the recent change in Bio.Entrez.efetch > (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change > did fix the problem, I think this might be achieved in a cleaner way. > > Please see the code that is used to format the options on the url (in > Bio.Entrez._open): > > options = urllib.urlencode(params, doseq=True) > > the doseq argument specifically. Its documentation states: > "If any values in the query arg are sequences and doseq is true, each > sequence element is converted to a separate parameter." > > So this was the reason for the "id=1&id=2&id=3" formatting. Without > doseq set this would turn into: "id=1,2,3" > > If this doseq functionality is not needed for other params (I am > unsure of this), I suggest to revert the change in efetch() and use > doseq=False (which is default argument) Very good question - Michiel? Thanks, Peter From daniel at treparel.com Thu Mar 1 14:59:42 2012 From: daniel at treparel.com (=?UTF-8?Q?Dani=C3=ABl_van_Adrichem?=) Date: Thu, 1 Mar 2012 15:59:42 +0100 Subject: [Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists) In-Reply-To: References:

Message-ID: On 01/03/2012, Peter Cock wrote: > 2012/3/1 Dani?l van Adrichem : >> Hello list, >> >> Firstly I want to report a bug plus suggested fix. >> >> Today I noticed a bug which got triggered by missing local DTDs. I was >> still using 1.58 which does not have the new DTDs. >> >> Missing the DTDs locally should be handled by downloading them. This >> worked for the first DTD, but then on the second one (which is a >> dependency of the first one) I got a HTTP 404. >> >> After investigating I found that the module was making a request for >> "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd" >> Note the backslash right after DTD. It gets turned into a %5C and >> causes the 404. > > That DTD should be in Biopython 1.59 - and hopefully the other > DTD you mentioned but did not name. Please let us know if there > are any more we've missed. > > https://github.com/biopython/biopython/commit/5f08ccdfe0706f9073bce441609aa86b1ea9d0f4 > I haven't encountered any missing DTDs since I updated to 1.59 >> The cause of this is usage of os.path.join to concatenate the URL. I >> am running this on windows, on a platform where the file system uses a >> forward slash this would work just fine. >> >> please find attached a patch to fix this issue. > > That makes perfect sense, although as written your patch could > result in too many slashes being used - thus: Preventing double slashes is a good thing, nice. https://github.com/biopython/biopython/commit/c93b32bab5526a830e2cb14f0db782ee1b687715 > > Would you like to be thanked in the NEWS file and listed as a contributor > (in the CONTRIB file)? It is only a single line patch, but if you insist I am fine with it :) >> Secondly I want to comment on the recent change in Bio.Entrez.efetch >> (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change >> did fix the problem, I think this might be achieved in a cleaner way. >> >> Please see the code that is used to format the options on the url (in >> Bio.Entrez._open): >> >> options = urllib.urlencode(params, doseq=True) >> >> the doseq argument specifically. Its documentation states: >> "If any values in the query arg are sequences and doseq is true, each >> sequence element is converted to a separate parameter." >> >> So this was the reason for the "id=1&id=2&id=3" formatting. Without >> doseq set this would turn into: "id=1,2,3" >> >> If this doseq functionality is not needed for other params (I am >> unsure of this), I suggest to revert the change in efetch() and use >> doseq=False (which is default argument) > > Very good question - Michiel? Ok, what I wrote here isn't really accurate. Using urllib.urlencode({'id': range(3))}) returns 'id=%5B0%2C+1%2C+2%5D' note the %5B (square bracket open) and %5D (square bracket close). Apparently urlencode takes str(range(3)), which is '[0, 1, 2]' Weirdly enough the URL with the [ and ] surrounding the id list seems to be accepted, which is why I thought my suggestion worked. So looking at it again I suggest to keep the code as it is right now. Maybe only make sure the iterable consists of strings only, since ','.join does not accept anything else. something like this would do I think: keywords["id"] = ",".join(map(str, keywds["id"])) > > Thanks, Thank you -- Dani?l van Adrichem Treparel Information Solutions b.v. Delftechpark 26 2628XH Delft The Netherlands From eric.talevich at gmail.com Thu Mar 1 17:49:19 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 1 Mar 2012 12:49:19 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: On Thu, Mar 1, 2012 at 7:02 AM, Peter Cock wrote: > On Mon, Feb 27, 2012 at 4:24 PM, Robert Buels wrote: > > Hi all, > > > > As kindly pointed out by Reece Hart, the previous email I sent out > calling > > for Google Summer of Code project ideas, had the wrong due date for > project > > ideas in it. > > > > I actually want them to all be in place by Friday, March 2, which is this > > coming Friday. > > > > See > http://lists.open-bio.org/pipermail/biopython/2012-February/007726.html > for the original complete email. > > That deadline is upon us (tomorrow), so where are we with GSoC 2012 ideas? > http://biopython.org/wiki/Google_Summer_of_Code > > Are any of the areas touched on in the "Biopython 1.60 plans and beyond" > thread suitable? > Perhaps: Bio.Struct ---------- We have a lot of ideas and incomplete pieces of code from previous GSoCs that could be sorted out in one summer. However, taking on another GSoC student might just add to the heap; this might need to be Eric and Jo?o's Summer of Code instead. Here's one semi-coherent project idea that could fly: Overhaul Biopython's parsing infrastructure for protein primary, secondary and tertiary structures - Refactor PDBParser and parse_pdb_header to allow parsing amino-acid sequences from SEQRES lines (header) and ATOM records (body) without building the PDB structure object, i.e. without using numpy - Write a pure-Python replacement for parsing mmCIF files. (The module MMCIF2Dict already does almost all the work; lex+yacc just manages a fairly simple state machine for recognizing comments, special sub-sections, etc.) - Wrap the parsers for PDB, PDBML and mmCIF under a common I/O interface under the Bio.Struct namespace - Add parsing support for protein secondary structures, based on the relevant PDB records or (perhaps) DSSP output. (Note that Jo?o did some work on this already.) Variants -------- So, from the Biopython 1.60 thread: - James Casbon has offered to merge PyVCF into Biopython, right? - BCF, the binary form of VCF (via blocked gzip), may also be worthwhile to support - GVF, the Genome Variation Format, appears to be intended to be competitive with VCF. It's probably at least as well thought-out as VCF, sight unseen. It's based on GFF. Synthesizing the above, we have a GSoC project that looks like: - Help merge PyVCF into Python (w/ James's support -- I don't mean to volunteer him for this in absentia)? - Write a GVF parser that emits the same object type as PyVCF, potentially also using existing GFF code - Time permitting, look into blocked gzip support for VCF (BCF), also looking at SAM/BAM for inspiration and reusable code. > SearchIO? > --------- > > I'm wondering if a Biopython SearchIO would make a good project, > that I might supervise. This name is obviously based on BioPerl. I > would be aiming for iterator based parser/writer framework (like SeqIO > and AlignIO) for pairwise 'sequence' searches initially, but have also > been thinking about indexing - at least by query, ideally also by match, > to allow random access akin to what Bio.SeqIO.index offers. > > In some cases the results would also be pairwise sequence alignments, > in which case some code can be shared/linked with AlignIO. In other > cases all you get is co-ordinates of the query and match plus some > kind of score. Therefore this could include a hierarchical SearchIO > result object structure for minimal matches up to full pairwise alignments. > > I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not > really sequence vs sequence, but HMM vs sequence), RPS-BLAST > (again not really sequence vs sequence). Perhaps this could also tie > into the Bio.Motif code as well (if we consider things like PSSM vs > sequence in the same framework). > > You can already do some of this in Biopython (e.g. BLAST XML > parsing, and there is some HMMER work on branches), but I'm > hoping for a unified API here. > > Interesting. It would be very nice if the objects emitted by SearchIO could be easily fed into GenomeDiagram. -Eric From anaryin at gmail.com Thu Mar 1 18:00:41 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 1 Mar 2012 19:00:41 +0100 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: > > Bio.Struct > ---------- > > We have a lot of ideas and incomplete pieces of code from > previous GSoCs that could be sorted out in one summer. > However, taking on another GSoC student might just add to > the heap; this might need to be Eric and Jo?o's Summer of > Code instead. > The new student would have to be familiar with the regular Bio.PDB code plus whatever code I wrote and Mikael wrote. Maybe a bit too much? If by "*this might need to be Eric and Jo?o's Summer of Code instead.*" you mean we could work together one it like in a SoC project, I think it would be the best idea. Making a plan just like for SoC but working outside of it leaving the vacancy for another person/project. Otherwise I don't know how well will OBF take yet another Bio.PDB project since the previous two haven't been merged... From p.j.a.cock at googlemail.com Thu Mar 1 18:03:49 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 18:03:49 +0000 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: 2012/3/1 Eric Talevich : > > Here's one semi-coherent project idea that could fly: > > Overhaul Biopython's parsing infrastructure for protein > primary, secondary and tertiary structures > > - Refactor PDBParser and parse_pdb_header to allow parsing > ? amino-acid sequences from SEQRES lines (header) and ATOM > ? records (body) without building the PDB structure object, > ? i.e. without using numpy > - Write a pure-Python replacement for parsing mmCIF files. > ? (The module MMCIF2Dict already does almost all the work; > ? lex+yacc just manages a fairly simple state machine for > ? recognizing comments, special sub-sections, etc.) > - Wrap the parsers for PDB, PDBML and mmCIF under a common > ? I/O interface under the Bio.Struct namespace > - Add parsing support for protein secondary structures, > ? based on the relevant PDB records or (perhaps) DSSP > ? output. (Note that Jo?o did some work on this already.) Do you think you could mentor that? One serious downside would be even more work on PDB related code which will make future merging even harder. We do need to tackle the GSoC back log as a priority. > Variants > -------- > > So, from the Biopython 1.60 thread: > > - James Casbon has offered to merge PyVCF into Biopython, right? > - BCF, the binary form of VCF (via blocked gzip), may also > ? be worthwhile to support > - GVF, the Genome Variation Format, appears to be intended > ? to be competitive with VCF. It's probably at least as well > ? thought-out as VCF, sight unseen. It's based on GFF. > > Synthesizing the above, we have a GSoC project that looks like: > > - Help merge PyVCF into Python (w/ James's support -- I > ? don't mean to volunteer him for this in absentia)? > - Write a GVF parser that emits the same object type as > ? PyVCF, potentially also using existing GFF code > - Time permitting, look into blocked gzip support for VCF > ? (BCF), also looking at SAM/BAM for inspiration and > ? reusable code. Sounds interesting - who might be willing to mentor it? >> SearchIO? >> --------- >> >> I'm wondering if a Biopython SearchIO would make a good project, >> that I might supervise. This name is obviously based on BioPerl. I >> would be aiming for iterator based parser/writer framework (like SeqIO >> and AlignIO) for pairwise 'sequence' searches initially, but have also >> been thinking about indexing - at least by query, ideally also by match, >> to allow random access akin to what Bio.SeqIO.index offers. >> >> In some cases the results would also be pairwise sequence alignments, >> in which case some code can be shared/linked with AlignIO. In other >> cases all you get is co-ordinates of the query and match plus some >> kind of score. Therefore this could include a hierarchical SearchIO >> result object structure for minimal matches up to full pairwise >> alignments. >> >> I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not >> really sequence vs sequence, but HMM vs sequence), RPS-BLAST >> (again not really sequence vs sequence). Perhaps this could also tie >> into the Bio.Motif code as well (if we consider things like PSSM vs >> sequence in the same framework). >> >> You can already do some of this in Biopython (e.g. BLAST XML >> parsing, and there is some HMMER work on branches), but I'm >> hoping for a unified API here. >> > > Interesting. It would be very nice if the objects emitted by SearchIO > could be easily fed into GenomeDiagram. Funnily enough, that is one of my motivations - specifically for doing ACT style diagrams comparing multiple genomes to each other. I've just started putting some examples into the Tutorial on this today, where I say ideally you'd parse some BLAST output or whatever, but here I'm manually typing in a list of links to draw ;) Peter From chris.mit7 at gmail.com Thu Mar 1 18:03:57 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 1 Mar 2012 13:03:57 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: I'm unsure if this is the best place for this, but I would be willing to undertake the VCF work as a GSoC student. I've been working on structural variants in whole genome sequencing/rnaseq/protein levels already, so this would dove tail nicely into my existing work (and be a nice thing for a CV :)) Chris On Thu, Mar 1, 2012 at 1:00 PM, Jo?o Rodrigues wrote: > > > > Bio.Struct > > ---------- > > > > We have a lot of ideas and incomplete pieces of code from > > previous GSoCs that could be sorted out in one summer. > > However, taking on another GSoC student might just add to > > the heap; this might need to be Eric and Jo?o's Summer of > > Code instead. > > > > The new student would have to be familiar with the regular Bio.PDB code > plus whatever code I wrote and Mikael wrote. Maybe a bit too much? > > If by "*this might need to be Eric and Jo?o's Summer of Code instead.*" you > mean we could work together one it like in a SoC project, I think it would > be the best idea. Making a plan just like for SoC but working outside of it > leaving the vacancy for another person/project. > > Otherwise I don't know how well will OBF take yet another Bio.PDB project > since the previous two haven't been merged... > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Thu Mar 1 18:14:55 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 1 Mar 2012 13:14:55 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com> Message-ID: On Thu, Mar 1, 2012 at 1:00 PM, Jo?o Rodrigues wrote: > Bio.Struct >> ---------- >> >> We have a lot of ideas and incomplete pieces of code from >> previous GSoCs that could be sorted out in one summer. >> However, taking on another GSoC student might just add to >> the heap; this might need to be Eric and Jo?o's Summer of >> Code instead. >> > > The new student would have to be familiar with the regular Bio.PDB code > plus whatever code I wrote and Mikael wrote. Maybe a bit too much? > > If by "*this might need to be Eric and Jo?o's Summer of Code instead.*" > you mean we could work together one it like in a SoC project, I think it > would be the best idea. Making a plan just like for SoC but working outside > of it leaving the vacancy for another person/project. > > Otherwise I don't know how well will OBF take yet another Bio.PDB project > since the previous two haven't been merged... > Those are my thoughts exactly. :) From eric.talevich at gmail.com Thu Mar 1 18:30:19 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 1 Mar 2012 13:30:19 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: 2012/3/1 Peter Cock > 2012/3/1 Eric Talevich : > > > > Here's one semi-coherent project idea that could fly: > > > > Overhaul Biopython's parsing infrastructure for protein > > primary, secondary and tertiary structures > > > > - Refactor PDBParser and parse_pdb_header to allow parsing > > amino-acid sequences from SEQRES lines (header) and ATOM > > records (body) without building the PDB structure object, > > i.e. without using numpy > > - Write a pure-Python replacement for parsing mmCIF files. > > (The module MMCIF2Dict already does almost all the work; > > lex+yacc just manages a fairly simple state machine for > > recognizing comments, special sub-sections, etc.) > > - Wrap the parsers for PDB, PDBML and mmCIF under a common > > I/O interface under the Bio.Struct namespace > > - Add parsing support for protein secondary structures, > > based on the relevant PDB records or (perhaps) DSSP > > output. (Note that Jo?o did some work on this already.) > > Do you think you could mentor that? One serious downside > would be even more work on PDB related code which will > make future merging even harder. We do need to tackle the > GSoC back log as a priority. > I would serve if called upon, but I think it's best if we set this one aside for E&J SoC (JESoC?) rather than GSoC this year. > > > Variants > > -------- > > > > So, from the Biopython 1.60 thread: > > > > - James Casbon has offered to merge PyVCF into Biopython, right? > > - BCF, the binary form of VCF (via blocked gzip), may also > > be worthwhile to support > > - GVF, the Genome Variation Format, appears to be intended > > to be competitive with VCF. It's probably at least as well > > thought-out as VCF, sight unseen. It's based on GFF. > > > > Synthesizing the above, we have a GSoC project that looks like: > > > > - Help merge PyVCF into Python (w/ James's support -- I > > don't mean to volunteer him for this in absentia)? > > - Write a GVF parser that emits the same object type as > > PyVCF, potentially also using existing GFF code > > - Time permitting, look into blocked gzip support for VCF > > (BCF), also looking at SAM/BAM for inspiration and > > reusable code. > > Sounds interesting - who might be willing to mentor it? > Does someone feel comfortable asking James for his thoughts on this? I'm not especially well qualified to mentor this, though I could assist as a secondary mentor if needed. Any other Biopython devs/users well acquainted with VCF/PyVCF? From rodrigo.faccioli at gmail.com Thu Mar 1 18:44:14 2012 From: rodrigo.faccioli at gmail.com (Rodrigo Faccioli) Date: Thu, 1 Mar 2012 15:44:14 -0300 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: Hi, Although I'm not an specialist to be mentor, I have experience to implement at PDBParser the reading of SEQRES section. In fact, I already have implemented it and I'm able to share it for BioPython project. Best regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structural Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-8739 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 Personal Blogg - http://rodrigofaccioli.blogspot.com/ On Thu, Mar 1, 2012 at 3:30 PM, Eric Talevich wrote: > 2012/3/1 Peter Cock > > > 2012/3/1 Eric Talevich : > > > > > > Here's one semi-coherent project idea that could fly: > > > > > > Overhaul Biopython's parsing infrastructure for protein > > > primary, secondary and tertiary structures > > > > > > - Refactor PDBParser and parse_pdb_header to allow parsing > > > amino-acid sequences from SEQRES lines (header) and ATOM > > > records (body) without building the PDB structure object, > > > i.e. without using numpy > > > - Write a pure-Python replacement for parsing mmCIF files. > > > (The module MMCIF2Dict already does almost all the work; > > > lex+yacc just manages a fairly simple state machine for > > > recognizing comments, special sub-sections, etc.) > > > - Wrap the parsers for PDB, PDBML and mmCIF under a common > > > I/O interface under the Bio.Struct namespace > > > - Add parsing support for protein secondary structures, > > > based on the relevant PDB records or (perhaps) DSSP > > > output. (Note that Jo?o did some work on this already.) > > > > Do you think you could mentor that? One serious downside > > would be even more work on PDB related code which will > > make future merging even harder. We do need to tackle the > > GSoC back log as a priority. > > > > I would serve if called upon, but I think it's best if we set this one > aside for E&J SoC (JESoC?) rather than GSoC this year. > > > > > > > Variants > > > -------- > > > > > > So, from the Biopython 1.60 thread: > > > > > > - James Casbon has offered to merge PyVCF into Biopython, right? > > > - BCF, the binary form of VCF (via blocked gzip), may also > > > be worthwhile to support > > > - GVF, the Genome Variation Format, appears to be intended > > > to be competitive with VCF. It's probably at least as well > > > thought-out as VCF, sight unseen. It's based on GFF. > > > > > > Synthesizing the above, we have a GSoC project that looks like: > > > > > > - Help merge PyVCF into Python (w/ James's support -- I > > > don't mean to volunteer him for this in absentia)? > > > - Write a GVF parser that emits the same object type as > > > PyVCF, potentially also using existing GFF code > > > - Time permitting, look into blocked gzip support for VCF > > > (BCF), also looking at SAM/BAM for inspiration and > > > reusable code. > > > > Sounds interesting - who might be willing to mentor it? > > > > Does someone feel comfortable asking James for his thoughts on this? > > I'm not especially well qualified to mentor this, though I could assist as > a secondary mentor if needed. Any other Biopython devs/users well > acquainted with VCF/PyVCF? > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Fri Mar 2 01:43:02 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 01 Mar 2012 20:43:02 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>

Message-ID: <87399rkcbd.fsf@fastmail.fm> Peter and Eric; > > Variants > > -------- > > Synthesizing the above, we have a GSoC project that looks like: > > > > - Help merge PyVCF into Python (w/ James's support -- I > > ? don't mean to volunteer him for this in absentia)? > > - Write a GVF parser that emits the same object type as > > ? PyVCF, potentially also using existing GFF code > > - Time permitting, look into blocked gzip support for VCF > > ? (BCF), also looking at SAM/BAM for inspiration and > > ? reusable code. > > Sounds interesting - who might be willing to mentor it? This is a great idea. Reece and I proposed a variant project last year, and Reece has already e-mailed me this year about trying again. He was planning on re-vamping the description on the GSoC page for 2012: http://biopython.org/wiki/Google_Summer_of_Code so hopefully we can incorporate several aspects of this. From my experience I would prioritize BCF/Tabix files since you see a lot of those in practice. For GVF we could certainly leverage the GFF parser since it is GFF with variant keywords. Practically, I would love to settle on one format for this and VCF seems to have the most tool uptake so far. > >> SearchIO? > >> --------- +1 for this as well. Great ideas, Brad From p.j.a.cock at googlemail.com Fri Mar 2 11:53:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 2 Mar 2012 11:53:54 +0000 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: <87399rkcbd.fsf@fastmail.fm> References: <4F4BAE23.7070402@gmail.com>

<87399rkcbd.fsf@fastmail.fm> Message-ID: On Fri, Mar 2, 2012 at 1:43 AM, Brad Chapman wrote: > > Peter and Eric; > >> > Variants >> > -------- >> > Synthesizing the above, we have a GSoC project that looks like: >> > >> > - Help merge PyVCF into Python (w/ James's support -- I >> > ? don't mean to volunteer him for this in absentia)? >> > - Write a GVF parser that emits the same object type as >> > ? PyVCF, potentially also using existing GFF code >> > - Time permitting, look into blocked gzip support for VCF >> > ? (BCF), also looking at SAM/BAM for inspiration and >> > ? reusable code. >> >> Sounds interesting - who might be willing to mentor it? > > This is a great idea. Reece and I proposed a variant project last year, > and Reece has already e-mailed me this year about trying again. He was > planning on re-vamping the description on the GSoC page for 2012: > > http://biopython.org/wiki/Google_Summer_of_Code Excellent - can you and/or Reece polish that wiki text today? We don't need it to be perfect or that detailed at this stage, do we? > so hopefully we can incorporate several aspects of this. From my > experience I would prioritize BCF/Tabix files since you see a lot of > those in practice. Right. It sounds like my BGZF code (blocked gzip) should be helpful for BCF as well. > For GVF we could certainly leverage the GFF parser since it is GFF with > variant keywords. Practically, I would love to settle on one format for > this and VCF seems to have the most tool uptake so far. That could go in as a potential aim too then. >> >> SearchIO? >> >> --------- > > +1 for this as well. Great ideas, > Brad I've started to write up that on the wiki page now. Peter From eric.talevich at gmail.com Fri Mar 2 14:44:13 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 2 Mar 2012 09:44:13 -0500 Subject: [Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas In-Reply-To: References: <4F4BAE23.7070402@gmail.com>