From chapmanb at 50mail.com Mon Jul 2 06:36:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 02 Jul 2012 06:36:39 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: <874npqo3ew.fsf@fastmail.fm> Lenna; Thanks for the updates and thoughts. I like the direction you're moving after taking everything you've learned from the SQL experiments. My general suggestions would be: - Leverage PyVCF for all of the backend parsing. We want to remain compatible with this since merging/interfacing with the work James and everyone is doing is a primary goal. Keeping a similar code structure is a great way to facilitate this. - For HGVS the general idea is to not be too tied to the VCF format, so I wouldn't worry about strict compatibility but rather use it to inform choices where you feel that things are mirroring VCF structure rather than more general variant representation. > Another question that may reveal my complete ignorance of haplotypes > and such: could a polyploid site ever be partially phased? e.g. a > triploid genotype of 0/1|0? It's possible but this is kind of a fringe case right now so I wouldn't especially worry about it. Thanks again, Brad From redmine at redmine.open-bio.org Tue Jul 3 04:59:57 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 3 Jul 2012 08:59:57 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] (New) Bio.GenBank format writer creates invalid start_codon entries. Message-ID: Issue #3368 has been reported by Kai Blin. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: New Priority: Normal Assignee: Kai Blin Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From hughesadam87 at gmail.com Tue Jul 3 15:19:04 2012 From: hughesadam87 at gmail.com (Adam Hughes) Date: Tue, 3 Jul 2012 15:19:04 -0400 Subject: [Biopython-dev] Conserved Domains Database Support Message-ID: Hi everyone, I'm new to the BioPython library and was wondering if there was any support for the conserved domains database from NCBI? In particular, the superfamily batch files that their webtool releases. Doing a Google search, there was some interest for this back in 2008; however, they were mainly interested in parsing the HTML output of CDD searches. Now that CDD offers a nice, regular downloadable datatype, has any BioPython support been implemented to work with this? If not, I'd like to contribute. The data is simple tab-delmited formats of domain alignments, E.G.: Q#10000 0 >WHL22.364604.0 superfamily 212291 7 290 1.01528e-138 401.1 cl09099 P-loop_NTPase superfamily 0 I had envisioned a simple class of mainly getters/setters with a few methods such as sorting by Query batches. ~Adam From p.j.a.cock at googlemail.com Tue Jul 3 18:03:29 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Jul 2012 23:03:29 +0100 Subject: [Biopython-dev] Conserved Domains Database Support In-Reply-To: References: Message-ID: On Tue, Jul 3, 2012 at 8:19 PM, Adam Hughes wrote: > Hi everyone, > > I'm new to the BioPython library and was wondering if there was any support > for the conserved domains database from NCBI? In particular, the > superfamily batch files that their webtool releases. Doing a Google > search, there was some interest for this back in 2008; however, they were > mainly interested in parsing the HTML output of CDD searches. HTML scrappers were always a bit of a pain :( > Now that CDD > offers a nice, regular downloadable datatype, has any BioPython support > been implemented to work with this? > > If not, I'd like to contribute. > > The data is simple tab-delmited formats of domain alignments, E.G.: > > Q#10000 0 >WHL22.364604.0 superfamily 212291 7 290 > 1.01528e-138 401.1 cl09099 P-loop_NTPase superfamily > 0 > > I had envisioned a simple class of mainly getters/setters with a few > methods such as sorting by Query batches. > > ~Adam That is interesting - and offers to work on Biopython are always nice. Is this a file giving domain definitions (HMM or whatever CDD uses), or precomputed search results for different query sequences? Maybe a URL would help - I've not looked at this resource for quite a while. I used to use the rpsblast tool to run local (offline) searches against CDD databases, and that offered several BLAST output flavours. Peter P.S. I'll be away with intermittent email access for the rest of the week. From w.arindrarto at gmail.com Wed Jul 4 09:03:01 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 4 Jul 2012 15:03:01 +0200 Subject: [Biopython-dev] GSoC Project Update -- 9 Message-ID: Hello everyone, The past week I have been working to add PSL parsing support and I've just posted my update here: http://bow.web.id/blog/2012/07/initial-blat-support/ Currently, we have parsing, indexing, and writing support. But this could change (writing might not be supported) due to a possible change in the current object model. I've explained a bit on why this is the case in the post, but to summarize it here, it's because we haven't got a way to properly model segmented HSP sequences. Peter and I have discussed this a bit, but we haven't figured out an elegant way to solve it for now. Aside from working on PSL, I also added more tests and started refactoring the code as it's starting to get messy. That's all my update for the past week. For this week, I'll try to look into other formats and try to come up with possible solutions to the segmented HSP problem. regards, Bow From reece at harts.net Thu Jul 5 15:40:02 2012 From: reece at harts.net (Reece Hart) Date: Thu, 5 Jul 2012 12:40:02 -0700 Subject: [Biopython-dev] [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: On Fri, Jun 29, 2012 at 11:15 PM, Lenna Peterson wrote: > For a Python variant object, are there any organizational choices that > would make it easier for future conversion of a variant to HGVS > syntax? (this is primarily directed at Reece but I'm open to all > suggestions) > Oh, no, things directed at me! That's a broad question. I'll try to answer without being long winded. The essential elements of a sequence variant are a reference to a sequence, the location, and specifics about the operation. The name, allelic depth, etc are all distinct from these elements and I would store them separately in a format-specific record or as a subclass. I don't have much experience with FeatureLocations, but that might be appropriate. Depending on how far you plan to go with VCF, you'll have to deal with Locations for breakpoints. For the Occam's Razor version a model for variation, I'd float this in the community: variation := And I'd test this against representing: - a single SNP in VCF - a compound het from VCF - a variant in RNA - a variant in CDS coords - a variant in a protein sequence - a trinuclotide repeat (Which the simple model above fails, BTW.) What makes the uber variant problem hard, I think, is several competing design axes: 1) sequence type (DNA, RNA, protein), 2) coordinate systems (really, CDS in a transcript record), 3) diversity of variant types (SNV, indel, repeat, etc), 4) diversity of auxiliary data (e.g., genotype info from VCF). HGVS makes us think outside merely VCF data: in particular, it adds the nuance of coordinate systems and multiple sequence types. I suspect you should be considering mixins and/or subclassing for some of these needs. I don't know how to solve any of this complexity. What I do know is that 1) it's too much just for your project, 2) it would be nice to have a design that can be easily extended beyond your project, and 3) therefore, part of your project should be to pave the way for extensions without tackling them. It's also a good time to put stakes in the ground around internal conventions, such as variants are always represented using interbase coordinates (= 0-based, right-open). And, if you end up handling just VCF variants, that's cool too. -Reece From p.j.a.cock at googlemail.com Sun Jul 8 15:06:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 8 Jul 2012 20:06:03 +0100 Subject: [Biopython-dev] Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki In-Reply-To: <4FF5C31F.8080502@broadinstitute.org> References: <4FF5C31F.8080502@broadinstitute.org> Message-ID: This could be important for Lenna's GSoC project. Heng Li had developed the original binary VCF format, BCF, but IIRC he wasn't keen to push it as a standard - see also http://vcftools.sourceforge.net/specs.html and http://vcftools.sourceforge.net/bcf.pdf It looks like BCF2 could be more widely used... Peter ---------- Forwarded message ---------- From: Eric Banks Date: Thu, Jul 5, 2012 at 5:38 PM Subject: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki To: "1000ANALYSIS at LIST.NIH.GOV" <1000ANALYSIS at list.nih.gov>, "vcftools-spec at lists.sourceforge.net" Hi everyone, At the last 1000G meeting we discussed BCF2, the official binary version of VCF. The quick reference guide for BCF2 is now linked from the main VCF page on the 1000G wiki; you can access it directly here: http://www.1000genomes.org/sites/1000genomes.org/files/documents/bcfv2.pdf I take no credit for the document itself, which is really the work of Heng and Mark. At this point, both the GATK and samtools can produce BCF files (and they will soon become our standard output format). We encourage other producers of VCF to consider moving over to BCF2 too. Best, Eric -- Eric Banks, PhD Broad Institute of Harvard and MIT ebanks at broadinstitute.org 617-714-7636 ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ VCFtools-spec mailing list VCFtools-spec at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/vcftools-spec From arklenna at gmail.com Mon Jul 9 00:33:57 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 9 Jul 2012 00:33:57 -0400 Subject: [Biopython-dev] GSoC python variant update 7 Message-ID: Post: http://arklenna.tumblr.com/post/26812132902/ Synopsis: This week, I wrote a script for PyVCF that can filter a file by sample as it's being parsed. It's currently named `vcf_sample_filter.py`. It's designed to be functional from the command line, the Python interpreter, or as a module. Next up: come up with a generic-via-extensibility representation of a variant. I'm working through some examples and should have a basic outline soon. Lenna From p.j.a.cock at googlemail.com Mon Jul 9 07:33:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 9 Jul 2012 12:33:44 +0100 Subject: [Biopython-dev] Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki In-Reply-To: <87hathp42x.fsf@fastmail.fm> References: <4FF5C31F.8080502@broadinstitute.org> <87hathp42x.fsf@fastmail.fm> Message-ID: On Mon, Jul 9, 2012 at 12:27 PM, Brad Chapman wrote: > > Peter; > Thanks for the heads up. I'm excited about BCF2 and am hopeful it'll > help with some of the painful parts of VCF, like subsetting large files > by samples. There is also a page about it on the Broad wiki with more details: > > http://www.broadinstitute.org/gsa/wiki/index.php/BCF2 > > In terms of the representation, this stays close to VCF so shouldn't > change a lot of the API people see. The main changes would be on the > backend side where we'd like to be able to swap in and out BCF2 and VCF > (and GVF) transparently with no visible change to the programmer. > > Brad Yes - that's what we should be aiming for, much like the SAM/BAM duality which has worked really well for sequence alignments. Note that like BAM, BCF and BCF2 are both compressed with BGZF - support for which we included in Biopython 1.60. This can be combined with the Python struct module to parse the binary data (and with a little more effort will support both Python 2 and 3, see the SFF code for pointers or ask me). Peter From chapmanb at 50mail.com Mon Jul 9 07:27:18 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 09 Jul 2012 07:27:18 -0400 Subject: [Biopython-dev] Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki In-Reply-To: References: <4FF5C31F.8080502@broadinstitute.org> Message-ID: <87hathp42x.fsf@fastmail.fm> Peter; Thanks for the heads up. I'm excited about BCF2 and am hopeful it'll help with some of the painful parts of VCF, like subsetting large files by samples. There is also a page about it on the Broad wiki with more details: http://www.broadinstitute.org/gsa/wiki/index.php/BCF2 In terms of the representation, this stays close to VCF so shouldn't change a lot of the API people see. The main changes would be on the backend side where we'd like to be able to swap in and out BCF2 and VCF (and GVF) transparently with no visible change to the programmer. Brad > This could be important for Lenna's GSoC project. > > Heng Li had developed the original binary VCF format, > BCF, but IIRC he wasn't keen to push it as a standard - > see also http://vcftools.sourceforge.net/specs.html and > http://vcftools.sourceforge.net/bcf.pdf > > It looks like BCF2 could be more widely used... > > Peter > > > ---------- Forwarded message ---------- > From: Eric Banks > Date: Thu, Jul 5, 2012 at 5:38 PM > Subject: [VCFtools-spec] The BCF2 quick reference document is up on > the 1000G wiki > To: "1000ANALYSIS at LIST.NIH.GOV" <1000ANALYSIS at list.nih.gov>, > "vcftools-spec at lists.sourceforge.net" > > > > Hi everyone, > > At the last 1000G meeting we discussed BCF2, the official binary version > of VCF. The quick reference guide for BCF2 is now linked from the main > VCF page on the 1000G wiki; you can access it directly here: > http://www.1000genomes.org/sites/1000genomes.org/files/documents/bcfv2.pdf > > I take no credit for the document itself, which is really the work of > Heng and Mark. At this point, both the GATK and samtools can produce > BCF files (and they will soon become our standard output format). We > encourage other producers of VCF to consider moving over to BCF2 too. > > Best, > Eric > > -- > Eric Banks, PhD > Broad Institute of Harvard and MIT > ebanks at broadinstitute.org > 617-714-7636 > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > VCFtools-spec mailing list > VCFtools-spec at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Mon Jul 9 18:40:18 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 Jul 2012 22:40:18 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Peter Cock. Assignee changed from Kai Blin to Biopython Dev Mailing List Apologies for apparently ignoring you - I've just changed the assignee (back to) Biopython Dev Mailing List, since no-one was getting any of these updates by email. Well, I wasn't at least :( I'm wary about changing the parser to give integers instead of ints - that seems likely to break existing scripts. The whitelist approach in https://github.com/kblin/biopython/commit/4dec86810a42743967981b74c81a6fb8e17004e4 seems a better bet. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Jul 9 18:49:46 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 Jul 2012 22:49:46 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] (Closed) Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Kai Blin. Status changed from New to Closed % Done changed from 0 to 100 Applied in changeset commit:aa594ed9a85838d43ab321b756dff07bedfbb126. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Jul 10 02:28:27 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Jul 2012 06:28:27 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Kai Blin. Peter Cock wrote: > Apologies for apparently ignoring you - I've just changed the assignee (back to) Biopython Dev Mailing List, since no-one was getting any of these updates by email. Well, I wasn't at least :( No worries. In turn, I just learned that Redmine merges the Bugzilla fields "assignee" and "QA contact" into one field, and the "QA contact" meaning is the more important one for the way this project is run. :) > > I'm wary about changing the parser to give integers instead of ints - that seems likely to break existing scripts. You mean integers instead of strings, I guess. But yes, I can see the danger of breaking existing scripts, seeing how I even had to fix the "we only parse strings" assumption twice in the remaining parser code. I'm happy with the patch you pushed, thanks a lot. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Jul 10 03:37:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Jul 2012 07:37:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Peter Cock. We changed to RedMine from Bugzilla relatively recently - still learning its quirks ;) And yes, regarding the string/integer typo. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jul 11 13:49:47 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 11 Jul 2012 17:49:47 +0000 Subject: [Biopython-dev] [Biopython - Feature #3219] Port Biopython documentation to Sphinx References: Message-ID: Issue #3219 has been updated by Wibowo Arindrarto. Hi everyone, Just as an FYI which may or may not be useful, I just stumbled on a Biopython sphinx documentation here: http://www.bio-cloud.info/Biopython/en/index.html. Its sphinx source says it was generated just about a year ago (July 2011). The creator has a personal webpage, but it's not really clear how to contact him/her. ---------------------------------------- Feature #3219: Port Biopython documentation to Sphinx https://redmine.open-bio.org/issues/3219 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: Currently we use Epydoc for the API reference documentation, and LaTeX (to PDF via pdflatex, and HTML via hevea) for the tutorial. There's some material on the wiki to consider, too. A number of Python projects, including CPython, now use Sphinx for documentation. Content is written in reStructuredText format, and can be pulled from both standalone .rst files and Python docstrings. This offers several advantages: (i) API documentation will be prettier and easier to navigate; (ii) the Tutorial will be easier to edit for those not fluent in LaTeX; (iii) Since the API reference and Tutorial will be written in the same markup, potentially even pulling from some shared sources, it will be easier to address redundant or overlapping portions between the two, avoiding inconsistencies. See: http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ Mailing list discussion: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007977.html Numpy's approach: http://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jul 11 14:13:40 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 11 Jul 2012 18:13:40 +0000 Subject: [Biopython-dev] [Biopython - Feature #3219] Port Biopython documentation to Sphinx References: Message-ID: Issue #3219 has been updated by Peter Cock. Seems they were interested in translating it into Chinese, based on a Google translation of this post: http://www.bio-cloud.info/blog/?p=57 ---------------------------------------- Feature #3219: Port Biopython documentation to Sphinx https://redmine.open-bio.org/issues/3219 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: Currently we use Epydoc for the API reference documentation, and LaTeX (to PDF via pdflatex, and HTML via hevea) for the tutorial. There's some material on the wiki to consider, too. A number of Python projects, including CPython, now use Sphinx for documentation. Content is written in reStructuredText format, and can be pulled from both standalone .rst files and Python docstrings. This offers several advantages: (i) API documentation will be prettier and easier to navigate; (ii) the Tutorial will be easier to edit for those not fluent in LaTeX; (iii) Since the API reference and Tutorial will be written in the same markup, potentially even pulling from some shared sources, it will be easier to address redundant or overlapping portions between the two, avoiding inconsistencies. See: http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ Mailing list discussion: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007977.html Numpy's approach: http://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Sun Jul 15 09:56:33 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sun, 15 Jul 2012 15:56:33 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Hi, just wanted to ask if there is any new ideas wrt. this bug. I just wanted to get the residue type from a genbank file and realised that this is also not parsed. This reminded me of this bug. While searching the web I found that the problem also affects others. http://biopython.org/pipermail/biopython-dev/2011-July/009055.html I would really like to see some progress here. And of course I would like to help. But I do not know how. I tried to dig in the bioperl sources - but the problem is that I don't speak perl. Matthias 2012/5/3 Peter Cock : > > On Saturday, April 28, 2012, Matthias Bernt wrote: >> >> Dear developers, >> >> I would like to suggest a quick "fix" for the problem. Currently the >> parser just returns true per default for the circular property. This >> is a wrong piece of information for all circular sequences. >> Furthermore its not possible to detect if the parser did return true >> because it is its default value or if its really from the data. So I >> suggest to return None if the parser does not parse the information. >> >> What do you think? This should be possible with minimal effort. >> > > > The parsing side of this is trivial - the only piece missing is > how best to present the information in the SeqRecord for > BioSQL compatibility (and perhaps some extra work on our > BioSQL bindings). That requires someone to test where > BioPerl stores this in BioSQL (as that is the reference > implementation). > > Without that, a "quick fix" will mostly likely create a bug in > our BioSQL support - in that we wouldn't store the circular > field in the same way as the other Bio* implementations. > > Peter > From arklenna at gmail.com Tue Jul 17 13:48:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 17 Jul 2012 13:48:33 -0400 Subject: [Biopython-dev] GSoC python variant update 7 Message-ID: Hi all, New blog post: http://arklenna.tumblr.com/post/27418058203/ Last week, Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I've started a very minimal Variant object (in https://github.com/lennax/biopython/blob/variant2/Bio/Variant/variant.py), using `FeatureLocation` for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1. I've created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. It can be found on the blog post. Please let me know if there are any errors in my interpretation of these variant types. Thanks, Lenna From redmine at redmine.open-bio.org Wed Jul 18 10:52:24 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Jul 2012 14:52:24 +0000 Subject: [Biopython-dev] [Biopython - Bug #3374] (New) Newick.Tree.randomized not working Message-ID: Issue #3374 has been reported by Aleksey Kladov. ---------------------------------------- Bug #3374: Newick.Tree.randomized not working https://redmine.open-bio.org/issues/3374 Author: Aleksey Kladov Status: New Priority: Normal Assignee: Category: Target version: URL: My code is
from Bio.Phylo import BaseTree, Newick

t = Newick.Tree.randomized(5)
It throws Exeption:
Traceback (most recent call last):
  File "/home/kladov/ab_lab/Rosalind/rosalind-problems/rosalind_problems/phyltree/__init__.py", line 55, in 
    t = Newick.Tree.randomized(5)
  File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/BaseTree.py", line 725, in randomized
    terminals.extend(newterms)
TypeError: 'NoneType' object is not iterable
It looks like the problem is here(file BaseTree.py):
newsplit = random.choice(terminals)
newterms = newsplit.split(branch_length=branch_length) #problem: split returns None...
if branch_stdev:
    # Add some noise to the branch lengths
    for nt in newterms:
    nt.branch_length = max(0,
        random.gauss(branch_length, branch_stdev))
terminals.remove(newsplit)
terminals.extend(newterms) # and now we try to extend with None =(
I suppose that split not only should do actual split of a clade, but also return a list of two new clades. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From chapmanb at 50mail.com Wed Jul 18 14:29:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Jul 2012 14:29:36 -0400 Subject: [Biopython-dev] [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: References: <87y5mhkxtc.fsf@fastmail.fm> Message-ID: <87hat4ly7j.fsf@fastmail.fm> Dilara; Apologies, I missed that the second mail had updated code. > This works as you pointed out because filtered_rec is explicitly defined. > Now if I want to do this > > from Bio import SeqIO > mod = (check_meanQ(rec, q_thresholdd) for rec in > SeqIO.parse("hiseq_pe_test.fastq", "fastq")) > count = SeqIO.write(mod, "filtered_hiseq_pe_test.fastq", "fastq") > print "Modified %i records" %count > > It doesn't work because of some of the records are None. So I tried doing > this The approach I'd take it to clean up check_meanQ to be explicit about the return values: > def check_meanQ(rec, q_threshold): > seqlen=len(rec) > quality_scores=array(rec.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", rec.id, "because mean Q was", > round(quality_scores.mean()) > badrec = None > if round(quality_scores.mean()) > q_threshold: > goodrec = rec > > return goodrec def check_meanQ(rec, q_threshold): quality_scores=array(rec.letter_annotations["phred_quality"]) if round(quality_scores.mean()) <= q_threshold: print "Discarded ", rec.id, "because mean Q was", round(quality_scores.mean()) return None else: return rec Then explicitly check for None values and remove them when writing: > count = SeqIO.write(mod, "filtered_hiseq_pe_test.fastq", "fastq") count = SeqIO.write((x for x in mod if x is not None), "filtered_hiseq_pe_test.fastq", "fastq") Hope this helps, Brad From w.arindrarto at gmail.com Wed Jul 18 15:49:37 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 18 Jul 2012 21:49:37 +0200 Subject: [Biopython-dev] GSoC Project Update -- 10 Message-ID: Hi everyone, I've just posted two new updates for my GSoC project, here: http://bow.web.id/blog/2012/07/parsing-blast-plain-text-files-in-searchio/ and here: http://bow.web.id/blog/2012/07/exonerate-in-searchio/ The first one is about a somewhat unofficial new format to be supported by SearchIO: the BLAST plain text output. I know that current Biopython text parser is obsoleted, but I figure it still could be useful for some to have a similar model in SearchIO. It is unofficial since it's basically a wrapper around the current parser, and after discussing things with Peter, it doesn't seem wise to say that we officially support parsing the format. Especially when NCBI itself does not guarantee a stable style between each BLAST release. I should note that I've also made a small change to the current NCBIStandalone code as there were some problems when I try to parse BLAST 2.2.26+ text output with multiple queries. The second one, is about the program I've been spending most of my time on: Exonerate. We now have three Exonerate formats that SearchIO can parse and index: `exonerate-text`, for human-readable aligments, `exonerate-vulgar`, for vulgar lines, and `exonerate-cigar`, for vulgar lines. It's one of the more interesting formats I've been working on so far :), since it has so much information in it. I've tried to capture them as sensible as possible, and I made a small demonstration using it in my post. In addition to writing these two formats, I've also written their tests. Now, having finished almost all of the parsers, I'm planning to devote more time to start writing the documentation during the coming weeks. regards, Bow From p.j.a.cock at googlemail.com Thu Jul 19 05:17:17 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 Jul 2012 10:17:17 +0100 Subject: [Biopython-dev] [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: <87hat4ly7j.fsf@fastmail.fm> References: <87y5mhkxtc.fsf@fastmail.fm> <87hat4ly7j.fsf@fastmail.fm> Message-ID: On Wed, Jul 18, 2012 at 7:29 PM, Brad Chapman wrote: > > Dilara; > Apologies, I missed that the second mail had updated code. > >> This works as you pointed out because filtered_rec is explicitly defined. >> Now if I want to do this >> >> from Bio import SeqIO >> mod = (check_meanQ(rec, q_thresholdd) for rec in >> SeqIO.parse("hiseq_pe_test.fastq", "fastq")) >> count = SeqIO.write(mod, "filtered_hiseq_pe_test.fastq", "fastq") >> print "Modified %i records" %count >> >> It doesn't work because of some of the records are None. So I tried doing >> this > > The approach I'd take it to clean up check_meanQ to be explicit about > the return values: > >> def check_meanQ(rec, q_threshold): >> seqlen=len(rec) >> quality_scores=array(rec.letter_annotations["phred_quality"]) >> if round(quality_scores.mean()) <= q_threshold: >> print "Discarded ", rec.id, "because mean Q was", >> round(quality_scores.mean()) >> badrec = None >> if round(quality_scores.mean()) > q_threshold: >> goodrec = rec >> >> return goodrec > > def check_meanQ(rec, q_threshold): > quality_scores=array(rec.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", rec.id, "because mean Q was", round(quality_scores.mean()) > return None > else: > return rec That should work - although since you are not actually modifying the record at all I'd suggest a check function returning a boolean (True for False). Then you could use this in a generator expression like this: def check_meanQ(rec, q_threshold): quality_scores=array(rec.letter_annotations["phred_quality"]) return round(quality_scores.mean()) > q_threshold records = SeqIO.parse("hiseq_pe_test.fastq", "fastq")) count = SeqIO.write((x for x in records if check_meanQ(x)), "filtered_hiseq_pe_test.fastq", "fastq") (Untested - there could be a typo in there) Peter. P.S. Since this isn't directly about new development work on Biopython itself, the main mailing list would be more appropriate for this kind of question in future. Thanks From p.j.a.cock at googlemail.com Sun Jul 22 10:19:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 22 Jul 2012 15:19:31 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations Message-ID: Dear all, One of the 'warts' in the current SeqRecord/SeqFeature object model is how non-trivial features are stored - in particular joins (in the terminology of GenBank/EMBL). Previous discussions include: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html ... http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html Consider a single gene like this from NC_000932 in our test suite: complement(join(97999..98793,69611..69724)) Currently that becomes three SeqFeature objects, a parent object present in the SeqRecord's feature list, and two child objects (one for each exon) within that parent feature's sub_features list. The parent feature gets a location which summarises the span, so start 97999-1 (Pythonic counting), end 69724, and strand -1. This usage of the sub_features property in this way has been present in Biopython for a very long time, and prevents us using it for nesting features based on the parent/child relationship models used in GFF (e.g. gene and CDS, or gene, mRNA, CDS, and exon). As Brad and I had discussed, a new separate mechanism might be added for explicit parent/child relationships between SeqFeature objects useful for GFF3, since the current name sub_features has this historical baggage. Suggestion ========== What I had proposed was we get rid of sub_features (deprecate it, so for the next couple of releases our parser and BioSQL access will populate it, but our writers and the BioSQL loader will ignore it) and replace it with a new subclass of the FeatureLocation object specifically for these compound locations. This will then map much more closely to the tables used in BioSQL, and therefore I suspect the BioPerl object model too. Once the sub_feature support is dropped, the objects for the complement(join(97999..98793,69611..69724)) example becomes just one SeqFeature object, whose location is a new CompoundLocation containing two parts (the two exons). Note that in order to handle mixed strand features and to make iteration etc simpler, the parts are stored in the biological order (5' to 3'). To put this another way, for this example I find it helps to think about example this as the old EMBL variant form of the location string: join(complement(69611..69724),complement(97999..98793)) i.e. The first part of this gene (the 5' end of the gene) is complement(69611..69724), and the last part (with the 3' end of the gene) is complement(97999..98793). For iteration over the bases of this CompoundLocation you'd get 69723, 69723, ..., 69610 (the first exon), then 98792, ..., 97998 (the second exon) which is exactly what happens now when iterating over the parent SeqFeature. This is what I have tried to do on this branch: https://github.com/peterjc/biopython/tree/f_loc4 As part of this, adding two FeatureLocations will give a CompoundLocation - similarly you can add a simple FeatureLocation and a CompoundLocation or two CompoundLocation objects. I think this makes creating a SeqFeature describing a Eukaryotic gene model MUCH simpler than with the existing approach. (A potential refinement not implemented yet would be to merge abutting exact locations automatically, so that adding 123..456 and 457..999 would give 123..999 instead of join(123..456,457..999), but that might be too much magic?) Impact ====== What does this mean for Biopython users? It will only really affect people using annotated nucleotide files, (i.e. GenBank or EMBL files), and only those doing anything clever with 'join' type features. The deprecation process will allow scripts just reading files to continue to be used unmodified in the short term. However, as the branch currently stands, scripts building SeqFeature objects using sub_features would have to be updated immediately. I believe this is only going to affect a handful of people though, and will (once done) simplify their code. Thoughts? I've tried to balance backwards compatibility with providing something more intuitive - and fixing this should help with merging the GFF support. Peter From chapmanb at 50mail.com Mon Jul 23 09:05:34 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 23 Jul 2012 09:05:34 -0400 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: <87629emxup.fsf@fastmail.fm> Peter; Thanks for working through the sub_feature issue and coming up with this proposal. I'm 100% on board with converting over to something more general and this looks like a great approach. A couple of quick thoughts: - Would it be possible to have a back-compatible 'sub_features' that reconstituted features based on the compound location? This could help us avoid breaking scripts that use sub_features, even if we no longer fill those in going forward. - How do you envision storing GFF feature hierarchies? The location object is more lightweight with only position and strand information. Nested child GFF features would have key/value pairs associated with them as well. Would we want to use sub_features (or some new nested structure) for these? Brad > Dear all, > > One of the 'warts' in the current SeqRecord/SeqFeature object > model is how non-trivial features are stored - in particular joins > (in the terminology of GenBank/EMBL). > > Previous discussions include: > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html > ... > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html > http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html > > Consider a single gene like this from NC_000932 in our test > suite: complement(join(97999..98793,69611..69724)) > > Currently that becomes three SeqFeature objects, a parent > object present in the SeqRecord's feature list, and two child > objects (one for each exon) within that parent feature's > sub_features list. The parent feature gets a location which > summarises the span, so start 97999-1 (Pythonic counting), > end 69724, and strand -1. > > This usage of the sub_features property in this way has > been present in Biopython for a very long time, and prevents > us using it for nesting features based on the parent/child > relationship models used in GFF (e.g. gene and CDS, or > gene, mRNA, CDS, and exon). > > As Brad and I had discussed, a new separate mechanism > might be added for explicit parent/child relationships > between SeqFeature objects useful for GFF3, since the > current name sub_features has this historical baggage. > > Suggestion > ========== > > What I had proposed was we get rid of sub_features > (deprecate it, so for the next couple of releases our parser > and BioSQL access will populate it, but our writers and > the BioSQL loader will ignore it) and replace it with a new > subclass of the FeatureLocation object specifically for these > compound locations. > > This will then map much more closely to the tables used > in BioSQL, and therefore I suspect the BioPerl object > model too. > > Once the sub_feature support is dropped, the objects for > the complement(join(97999..98793,69611..69724)) example > becomes just one SeqFeature object, whose location is a new > CompoundLocation containing two parts (the two exons). > > Note that in order to handle mixed strand features and > to make iteration etc simpler, the parts are stored in the > biological order (5' to 3'). To put this another way, for this > example I find it helps to think about example this as the > old EMBL variant form of the location string: > > join(complement(69611..69724),complement(97999..98793)) > > i.e. The first part of this gene (the 5' end of the gene) > is complement(69611..69724), and the last part (with > the 3' end of the gene) is complement(97999..98793). > > For iteration over the bases of this CompoundLocation > you'd get 69723, 69723, ..., 69610 (the first exon), then > 98792, ..., 97998 (the second exon) which is exactly what > happens now when iterating over the parent SeqFeature. > > This is what I have tried to do on this branch: > https://github.com/peterjc/biopython/tree/f_loc4 > > As part of this, adding two FeatureLocations will give a > CompoundLocation - similarly you can add a simple > FeatureLocation and a CompoundLocation or two > CompoundLocation objects. I think this makes creating > a SeqFeature describing a Eukaryotic gene model > MUCH simpler than with the existing approach. > > (A potential refinement not implemented yet would be > to merge abutting exact locations automatically, so that > adding 123..456 and 457..999 would give 123..999 > instead of join(123..456,457..999), but that might be > too much magic?) > > Impact > ====== > > What does this mean for Biopython users? It will only > really affect people using annotated nucleotide files, > (i.e. GenBank or EMBL files), and only those doing > anything clever with 'join' type features. > > The deprecation process will allow scripts just reading > files to continue to be used unmodified in the short > term. > > However, as the branch currently stands, scripts > building SeqFeature objects using sub_features > would have to be updated immediately. I believe > this is only going to affect a handful of people > though, and will (once done) simplify their code. > > Thoughts? I've tried to balance backwards compatibility > with providing something more intuitive - and fixing this > should help with merging the GFF support. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon Jul 23 12:02:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Jul 2012 17:02:45 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: <87629emxup.fsf@fastmail.fm> References: <87629emxup.fsf@fastmail.fm> Message-ID: On Mon, Jul 23, 2012 at 2:05 PM, Brad Chapman wrote: > > Peter; > Thanks for working through the sub_feature issue and coming up with this > proposal. I'm 100% on board with converting over to something more > general and this looks like a great approach. > > A couple of quick thoughts: > > - Would it be possible to have a back-compatible 'sub_features' that > reconstituted features based on the compound location? This could help > us avoid breaking scripts that use sub_features, even if we no longer > fill those in going forward. When you say 'use' do you mean populate and modify? Use in the read-only sense is already covered - in that any Biopython code generating complex SeqFeature objects would (in the short term) populate both the sub_feature AND the new compound location. Things get very hairy if we want to support edits to the sub_features also automatically updating the new compound location (and vice versa). So I don't want to do that. > - How do you envision storing GFF feature hierarchies? The location > object is more lightweight with only position and strand information. Only in the simple cases. In addition to single line GFF features, you have joins expressed by multiple GFF lines with a common ID. Also, it seems quite possible that GFF3 will add a new tag entry to describe fuzzy locations in future, see e.g. this thread http://sourceforge.net/mailarchive/message.php?msg_id=28240013 > Nested child GFF features would have key/value pairs associated with > them as well. Would we want to use sub_features (or some new nested > structure) for these? Absolutely a new nested structure - reusing sub_features would just cause too much confusion. This might be done with a parent attribute and/or a children list - perhaps with weak references to avoid garbage collection problems with freeing memory. Peter From kai.blin at biotech.uni-tuebingen.de Tue Jul 24 03:34:47 2012 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 24 Jul 2012 09:34:47 +0200 Subject: [Biopython-dev] How to add unit tests Message-ID: <500E5017.5020303@biotech.uni-tuebingen.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks, I've sent Wibowo a patch implementing a parser for yet another format. He asked for some tests, and I'm happy to provide them. Or at least I would be if I was clear on how to add them. Some modules seem to use doctests, some seem to have something home-grown. Where would I put the sequence files to parse during the tests? Hope you can shed some light on this, Kai - -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJQDlAXAAoJEKM5lwBiwTTPI7AH/RWsVXeymP2r6WuDeyzCL+oe S9OEqy7hQGc89ktd1HLn8LVid4baA5f31zPXaPsBdjwFfZT/8l3khjXp3JhOOsQJ wsKsqS985MiswkI0ZzTc598LhOt0oVz2cCPynLFFpj8K9f9OL5PdFKm9owS1urmP 919TBaRX7AWN/qyv3vCztMwvxrMYPz6hKw78oHikJP+i6rtEKYVyVYrvtqBBn0E4 7J/Hfkh+aqAgYR1YlWYCrNlHGM6xJpXmwwIPZp1C1Fgb2sFPsXcHLEQi9KydB7SK m+fosoow40BJbIerBYyUNGOcAkW5yuObLk99UYcYq26LEhUjDcpqNM8C2OtW4NI= =pu0a -----END PGP SIGNATURE----- From w.arindrarto at gmail.com Tue Jul 24 04:11:15 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Jul 2012 10:11:15 +0200 Subject: [Biopython-dev] How to add unit tests In-Reply-To: <500E5017.5020303@biotech.uni-tuebingen.de> References: <500E5017.5020303@biotech.uni-tuebingen.de> Message-ID: Hi Kai, Indeed there seem to be different kinds of testing in the Biopython tests suite. I can't really answer for test suites other than SearchIO, so I'll try to explain what I have in mind for SearchIO. In general, SearchIO has been using the unittest module, with roughly one test file being tested by one test function. Here are the 'consensus' that I'm using: 1. All test files are stored in a folder in the 'Tests' folder according to the program name. For example, HMMER files are all stored in the HMMER folder. 2. In each of those program-specific folder, there is a README file listing the test files and what they are. 3. The naming scheme of the test files may differ slightly between each program, but they are always consistent in the same program. Taking another example from the HMMER file, you can see that they are named like so: 'format_version_program_number.out'. So the first test file for the text output format from hmmpfam version 2.11 would be named "text_211_hmmpfam_001.out". 4. As for the contents of these files, it is basically up to you. I myself try to cover at least cases where there are single and multiple queries, and try to make them as short as possible (although sometimes it's not really that short). 4. The python test file itself are named like so: 'test_SearchIO_{format}.py'. This test file only tests parsing and/or reading-related code, and maybe some format-specific tests. If one program has several formats that differ slightly, they are grouped in one test file named 'test_SearchIO_{program}.py' Tests for indexing and writing are written in 'test_SearchIO_index.py' and 'test_SearchIO_write.py' for now. There's also a file called 'search_tests_common.py' that tests for equality between two different QueryResult objects (all their attributes and the items they contain), but so far this is only used in indexing and writing tests. 5. As for the doctests, they are meant to use the files in each program-specific folder as well. You are free to add extra files that showcases the important features of your parser; a file that's not used by the actual unittest suite. However, as you can see, the doctests are very little at the moment, as I am also still in the process of writing them. For now, I'm prioritizing the unittests first. I hope that helps :), and thanks again for the patch! regards, Bow On Tue, Jul 24, 2012 at 9:34 AM, Kai Blin wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi folks, > > I've sent Wibowo a patch implementing a parser for yet another format. > He asked for some tests, and I'm happy to provide them. Or at least I > would be if I was clear on how to add them. Some modules seem to use > doctests, some seem to have something home-grown. Where would I put > the sequence files to parse during the tests? > > Hope you can shed some light on this, > Kai > > - -- > Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de > Institute for Microbiology and Infection Medicine > Division of Microbiology/Biotechnology > Eberhard-Karls-Universit?t T?bingen > Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 > D-72076 T?bingen Fax : ++49 7071 29-5979 > Germany > Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJQDlAXAAoJEKM5lwBiwTTPI7AH/RWsVXeymP2r6WuDeyzCL+oe > S9OEqy7hQGc89ktd1HLn8LVid4baA5f31zPXaPsBdjwFfZT/8l3khjXp3JhOOsQJ > wsKsqS985MiswkI0ZzTc598LhOt0oVz2cCPynLFFpj8K9f9OL5PdFKm9owS1urmP > 919TBaRX7AWN/qyv3vCztMwvxrMYPz6hKw78oHikJP+i6rtEKYVyVYrvtqBBn0E4 > 7J/Hfkh+aqAgYR1YlWYCrNlHGM6xJpXmwwIPZp1C1Fgb2sFPsXcHLEQi9KydB7SK > m+fosoow40BJbIerBYyUNGOcAkW5yuObLk99UYcYq26LEhUjDcpqNM8C2OtW4NI= > =pu0a > -----END PGP SIGNATURE----- > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Tue Jul 24 05:33:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 10:33:14 +0100 Subject: [Biopython-dev] How to add unit tests In-Reply-To: <500E5017.5020303@biotech.uni-tuebingen.de> References: <500E5017.5020303@biotech.uni-tuebingen.de> Message-ID: On Tue, Jul 24, 2012 at 8:34 AM, Kai Blin wrote: > Hi folks, > > I've sent Wibowo a patch implementing a parser for yet another format. > He asked for some tests, and I'm happy to provide them. Or at least I > would be if I was clear on how to add them. Some modules seem to use > doctests, some seem to have something home-grown. Where would I put > the sequence files to parse during the tests? > > Hope you can shed some light on this, > Kai There is a whole chapter on our testing setup in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf In short: All the recent unit tests use the standard library unittest library. Older unit tests use a home grown print-and-compare approach where we have a copy of the expected output as a file on disk. Input files (of reasonable size, with providence information if possible - i.e. where they came from) are under the Tests folder in sub-directories by type or module. In the case of hmmer2, put them somewhere based on where Bow is putting his hmmer3 files. We also use doctests (in the code) for short illustrative examples with no external dependencies. You do use doctest style embedded examples which do have dependencies (e.g. network access), but we don't run them as unit tests to avoid test failures. We also use doctests in the LaTeX source of the tutorial, run via test_Tutorial.py - again only things without dependencies. Peter From arklenna at gmail.com Tue Jul 24 12:57:34 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Jul 2012 12:57:34 -0400 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Sun, Jul 22, 2012 at 10:19 AM, Peter Cock wrote: > Dear all, > > One of the 'warts' in the current SeqRecord/SeqFeature object > model is how non-trivial features are stored - in particular joins > (in the terminology of GenBank/EMBL). > > Previous discussions include: > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html > ... > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html > http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html > > Consider a single gene like this from NC_000932 in our test > suite: complement(join(97999..98793,69611..69724)) > > Currently that becomes three SeqFeature objects, a parent > object present in the SeqRecord's feature list, and two child > objects (one for each exon) within that parent feature's > sub_features list. The parent feature gets a location which > summarises the span, so start 97999-1 (Pythonic counting), > end 69724, and strand -1. > > This usage of the sub_features property in this way has > been present in Biopython for a very long time, and prevents > us using it for nesting features based on the parent/child > relationship models used in GFF (e.g. gene and CDS, or > gene, mRNA, CDS, and exon). > > As Brad and I had discussed, a new separate mechanism > might be added for explicit parent/child relationships > between SeqFeature objects useful for GFF3, since the > current name sub_features has this historical baggage. > > Suggestion > ========== > > What I had proposed was we get rid of sub_features > (deprecate it, so for the next couple of releases our parser > and BioSQL access will populate it, but our writers and > the BioSQL loader will ignore it) and replace it with a new > subclass of the FeatureLocation object specifically for these > compound locations. > > This will then map much more closely to the tables used > in BioSQL, and therefore I suspect the BioPerl object > model too. > > Once the sub_feature support is dropped, the objects for > the complement(join(97999..98793,69611..69724)) example > becomes just one SeqFeature object, whose location is a new > CompoundLocation containing two parts (the two exons). > > Note that in order to handle mixed strand features and > to make iteration etc simpler, the parts are stored in the > biological order (5' to 3'). To put this another way, for this > example I find it helps to think about example this as the > old EMBL variant form of the location string: > > join(complement(69611..69724),complement(97999..98793)) > > i.e. The first part of this gene (the 5' end of the gene) > is complement(69611..69724), and the last part (with > the 3' end of the gene) is complement(97999..98793). > > For iteration over the bases of this CompoundLocation > you'd get 69723, 69723, ..., 69610 (the first exon), then > 98792, ..., 97998 (the second exon) which is exactly what > happens now when iterating over the parent SeqFeature. > > This is what I have tried to do on this branch: > https://github.com/peterjc/biopython/tree/f_loc4 > > As part of this, adding two FeatureLocations will give a > CompoundLocation - similarly you can add a simple > FeatureLocation and a CompoundLocation or two > CompoundLocation objects. I think this makes creating > a SeqFeature describing a Eukaryotic gene model > MUCH simpler than with the existing approach. > > (A potential refinement not implemented yet would be > to merge abutting exact locations automatically, so that > adding 123..456 and 457..999 would give 123..999 > instead of join(123..456,457..999), but that might be > too much magic?) > > Impact > ====== > > What does this mean for Biopython users? It will only > really affect people using annotated nucleotide files, > (i.e. GenBank or EMBL files), and only those doing > anything clever with 'join' type features. > > The deprecation process will allow scripts just reading > files to continue to be used unmodified in the short > term. > > However, as the branch currently stands, scripts > building SeqFeature objects using sub_features > would have to be updated immediately. I believe > this is only going to affect a handful of people > though, and will (once done) simplify their code. > > Thoughts? I've tried to balance backwards compatibility > with providing something more intuitive - and fixing this > should help with merging the GFF support. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev Hi Peter, I have been testing the new CompoundLocation w.r.t. coordinate mapping and for the most part, I find it simplifies things. The documentation suggests using + to combine FeatureLocations, which invites the use of sum. However, sum doesn't work properly. I explain why in my StackOverflow question: http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior I have considered a number of workarounds: 1. Implementing __radd__ on FeatureLocation to return self if other == 0 allows sum() to work in place, but I am uncomfortable with hard-coding such a condition. 2. Changing the location to subclass set and use xrange for generation would easily allow a number of things: an empty location (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the 'magic' of merging abutting locations that you mention. However, using + and sum() on sets is dubious from a mathematically pure standpoint, and this would only work for ExactPositions. Note that I haven't attempted this yet and it may have disadvantages even for ExactPositions that I've failed to imagine. Let me know your thoughts. Lenna From p.j.a.cock at googlemail.com Tue Jul 24 13:19:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 18:19:31 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Tue, Jul 24, 2012 at 5:57 PM, Lenna Peterson wrote: >> This is what I have tried to do on this branch: >> https://github.com/peterjc/biopython/tree/f_loc4 >> >> As part of this, adding two FeatureLocations will give a >> CompoundLocation - similarly you can add a simple >> FeatureLocation and a CompoundLocation or two >> CompoundLocation objects. I think this makes creating >> a SeqFeature describing a Eukaryotic gene model >> MUCH simpler than with the existing approach. >> >> (A potential refinement not implemented yet would be >> to merge abutting exact locations automatically, so that >> adding 123..456 and 457..999 would give 123..999 >> instead of join(123..456,457..999), but that might be >> too much magic?) > > Hi Peter, > > I have been testing the new CompoundLocation w.r.t. coordinate mapping > and for the most part, I find it simplifies things. That's encouraging. > The documentation suggests using + to combine FeatureLocations, which > invites the use of sum. However, sum doesn't work properly. I explain > why in my StackOverflow question: > http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior Huh, I hadn't anticipated that - but I agree trying to use sum seems natural. > I have considered a number of workarounds: > > 1. Implementing __radd__ on FeatureLocation to return self if other == > 0 allows sum() to work in place, but I am uncomfortable with > hard-coding such a condition. Another idea is to define FeatureLocation or CompoundFeature addition with an integer to expose the current private method _shift. i.e. Apply an offset to the co-ordinates. Something I'd been pondering as a (previously unrelated) enhancement. In this interpretation, adding zero would have no effect on the co-ordinates and thus as a side effect should also make sum(locations) work. We'd need to test this to see if that actually works. > 2. Changing the location to subclass set and use xrange for generation > would easily allow a number of things: an empty location > (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the > 'magic' of merging abutting locations that you mention. However, using > + and sum() on sets is dubious from a mathematically pure standpoint, > and this would only work for ExactPositions. Note that I haven't > attempted this yet and it may have disadvantages even for > ExactPositions that I've failed to imagine. > > Let me know your thoughts. I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty location, but rather as a between location - in this case between the last and first base on a circular genome. In Genbank notation for a circular genome of length 1234, this would be 1234^1 (already an annoying special case we have to handle in the parser and the writer - although I'd have to check the code to see if we store this as [0:0] or [1234:1234] since both make sense). On the other hand, a CompoundLocation with zero parts might make sense. There is something to be said for simply have a single (upgraded) FeatureLocation object with a parts list, which in the typical case would be length one, and proxy methods for start/end as currently defined in CompoundLocation. Maybe I should try that on another branch... it might be more elegant overall. Peter From matthew.tien89 at gmail.com Tue Jul 24 14:36:05 2012 From: matthew.tien89 at gmail.com (Matthew Tien) Date: Tue, 24 Jul 2012 13:36:05 -0500 Subject: [Biopython-dev] Extended Amino Acid Chains Message-ID: To whom it may concern, I am currently developing a program in Biopython that creates amino acid chains from an inputted AA sequence. The program would output a single amino acid chain in an extended conformation. Is this something of interest to the developers of Biopython? I am using basic calculus to calculate the position of the atoms in the protein residues and using known protein geometries and database information from PDB.org and the Dunbrack group . This program is an extension of my current research in calculating Relative Solvent Accessibilities of protein residues. Thank you for your time, Matthew Tien -- B.S. Biochemistry, University of Texas at Austin PhD. student, University of Chicago Marcotte Lab and Wilke Group alt. Matthew.Tien at yahoo.com 361-876-0942 From rodrigo.faccioli at gmail.com Tue Jul 24 15:01:58 2012 From: rodrigo.faccioli at gmail.com (Rodrigo Faccioli) Date: Tue, 24 Jul 2012 16:01:58 -0300 Subject: [Biopython-dev] Extended Amino Acid Chains In-Reply-To: References: Message-ID: Hi Mathew, If I understood your email, I have already implemented it. However, it is not put into BioPython project yet.In this moment, I don't have time to do it alone. In [1] there is an example of my code. In my project I extended the BioPython classes and created my parser because I had to work with files and database in same code. Therefore, I believed that it was the best :-). So, if it is what you wanted, we can work together to put it into BioPython project. [1] https://dl.dropbox.com/u/4270818/workingFcfrpStructure.py Best regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structural Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-8739 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 Personal Blogg - http://rodrigofaccioli.blogspot.com/ On Tue, Jul 24, 2012 at 3:36 PM, Matthew Tien wrote: > To whom it may concern, > > I am currently developing a program in Biopython that creates amino acid > chains from an inputted AA sequence. The program would output a single > amino acid chain in an extended conformation. Is this something of interest > to the developers of Biopython? > > I am using basic calculus to calculate the position of the atoms in the > protein residues and using known protein geometries and database > information from PDB.org and the Dunbrack group >. > This program is an extension of my current research in calculating Relative > Solvent Accessibilities of protein residues. > > Thank you for your time, > Matthew Tien > > -- > B.S. Biochemistry, University of Texas at Austin > PhD. student, University of Chicago > Marcotte Lab and Wilke Group > alt. Matthew.Tien at yahoo.com > 361-876-0942 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Tue Jul 24 17:08:44 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Jul 2012 17:08:44 -0400 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: >> The documentation suggests using + to combine FeatureLocations, which >> invites the use of sum. However, sum doesn't work properly. I explain >> why in my StackOverflow question: >> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior > > Huh, I hadn't anticipated that - but I agree trying to use sum seems > natural. > >> I have considered a number of workarounds: >> >> 1. Implementing __radd__ on FeatureLocation to return self if other == >> 0 allows sum() to work in place, but I am uncomfortable with >> hard-coding such a condition. > > Another idea is to define FeatureLocation or CompoundFeature > addition with an integer to expose the current private method _shift. > i.e. Apply an offset to the co-ordinates. Something I'd been pondering > as a (previously unrelated) enhancement. In this interpretation, adding > zero would have no effect on the co-ordinates and thus as a side > effect should also make sum(locations) work. We'd need to test this > to see if that actually works. Yes, this works fine: Modifying FeatureLocation.__add__ with the condition: if isinstance(other, int): return self._shift(other) and adding FeatureLocation.__radd__: def __radd__(self, other): return self.__add__(other) After these changes, FeatureLocation(3,6) + 3 yields [6:9] and sum([FeatureLocation(3,6), FeatureLocation(10,13)]) yields join{[3:6], [10:13]}. (+ of FeatureLocations also still works, as does summing lists with length > 2) > >> 2. Changing the location to subclass set and use xrange for generation >> would easily allow a number of things: an empty location >> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the >> 'magic' of merging abutting locations that you mention. However, using >> + and sum() on sets is dubious from a mathematically pure standpoint, >> and this would only work for ExactPositions. Note that I haven't >> attempted this yet and it may have disadvantages even for >> ExactPositions that I've failed to imagine. >> >> Let me know your thoughts. > > I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty > location, but rather as a between location - in this case between > the last and first base on a circular genome. In Genbank notation > for a circular genome of length 1234, this would be 1234^1 > (already an annoying special case we have to handle in the > parser and the writer - although I'd have to check the code > to see if we store this as [0:0] or [1234:1234] since both make > sense). If the length is 1234, [1234] would be an index error. I don't think [1233:1233] would make sense either; for space-counted genomic coordinates (http://alternateallele.blogspot.co.uk/2012/03/genome-coordinate-conventions.html), the index refers to the space to the left of the base pair. By that convention, [0:0] would refer to the gap between the last base and the first base. > > On the other hand, a CompoundLocation with zero parts might > make sense. There is something to be said for simply have > a single (upgraded) FeatureLocation object with a parts list, > which in the typical case would be length one, and proxy > methods for start/end as currently defined in CompoundLocation. > Maybe I should try that on another branch... it might be more > elegant overall. > I haven't tested sum() on CompoundLocations but I would guess they would need similar treatment to FeatureLocation. Should CompoundLocation + int also shift each part? I agree that an "upgraded" FeatureLocation could be more elegant. From p.j.a.cock at googlemail.com Tue Jul 24 17:38:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 22:38:59 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson wrote: >>> The documentation suggests using + to combine FeatureLocations, which >>> invites the use of sum. However, sum doesn't work properly. I explain >>> why in my StackOverflow question: >>> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior >> >> Huh, I hadn't anticipated that - but I agree trying to use sum seems >> natural. >> >>> I have considered a number of workarounds: >>> >>> 1. Implementing __radd__ on FeatureLocation to return self if other == >>> 0 allows sum() to work in place, but I am uncomfortable with >>> hard-coding such a condition. >> >> Another idea is to define FeatureLocation or CompoundFeature >> addition with an integer to expose the current private method _shift. >> i.e. Apply an offset to the co-ordinates. Something I'd been pondering >> as a (previously unrelated) enhancement. In this interpretation, adding >> zero would have no effect on the co-ordinates and thus as a side >> effect should also make sum(locations) work. We'd need to test this >> to see if that actually works. > > Yes, this works fine: > > Modifying FeatureLocation.__add__ with the condition: > > if isinstance(other, int): > return self._shift(other) > > and adding FeatureLocation.__radd__: > > def __radd__(self, other): > return self.__add__(other) > > After these changes, FeatureLocation(3,6) + 3 yields [6:9] and > sum([FeatureLocation(3,6), FeatureLocation(10,13)]) yields join{[3:6], > [10:13]}. (+ of FeatureLocations also still works, as does summing > lists with length > 2) OK - good. That might be worthwhile then. >>> 2. Changing the location to subclass set and use xrange for generation >>> would easily allow a number of things: an empty location >>> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the >>> 'magic' of merging abutting locations that you mention. However, using >>> + and sum() on sets is dubious from a mathematically pure standpoint, >>> and this would only work for ExactPositions. Note that I haven't >>> attempted this yet and it may have disadvantages even for >>> ExactPositions that I've failed to imagine. >>> >>> Let me know your thoughts. >> >> I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty >> location, but rather as a between location - in this case between >> the last and first base on a circular genome. In Genbank notation >> for a circular genome of length 1234, this would be 1234^1 >> (already an annoying special case we have to handle in the >> parser and the writer - although I'd have to check the code >> to see if we store this as [0:0] or [1234:1234] since both make >> sense). > > If the length is 1234, [1234] would be an index error. I don't think > [1233:1233] would make sense either; for space-counted genomic > coordinates (http://alternateallele.blogspot.co.uk/2012/03/genome-coordinate-conventions.html), > the index refers to the space to the left of the base pair. By that > convention, [0:0] would refer to the gap between the last base and the > first base. The point is that with a circular sequence of length n, base 0 is also base n, so [0:0] is sort of the same as [n:n], or [n:0]. Of these I guess [0,0] is the most sensible representation for following Python norms. But we digress - this certainly isn't an 'empty location', something which doesn't really make sense (other than in the sense of None meaning missing data). >> >> On the other hand, a CompoundLocation with zero parts might >> make sense. There is something to be said for simply have >> a single (upgraded) FeatureLocation object with a parts list, >> which in the typical case would be length one, and proxy >> methods for start/end as currently defined in CompoundLocation. >> Maybe I should try that on another branch... it might be more >> elegant overall. >> > > I haven't tested sum() on CompoundLocations but I would guess they > would need similar treatment to FeatureLocation. Should > CompoundLocation + int also shift each part? If we make those changes to the FeatureLocation, then yes, the CompoundLocation should get them too. > I agree that an "upgraded" FeatureLocation could be more > elegant. It could turn out to be simpler having just one location object... certainly worth trying out before committing this branch as is. Peter From anaryin at gmail.com Wed Jul 25 03:59:52 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 25 Jul 2012 09:59:52 +0200 Subject: [Biopython-dev] Extended Amino Acid Chains In-Reply-To: References: Message-ID: Hey Matthew, Rodrigo, The only problem I see with incorporating such "feature" in Biopython is that you would need a topology and parameters for the aminoacids and these are often forcefield dependent. Therefore, the quantity of data to add to the distribution would be quite big and you'd need someone to keep updating it as ffs evolve. Or am I seeing this from a completely wrong angle? Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2012/7/24 Rodrigo Faccioli > Hi Mathew, > > If I understood your email, I have already implemented it. However, it is > not put into BioPython project yet.In this moment, I don't have time to do > it alone. > > In [1] there is an example of my code. In my project I extended the > BioPython classes and created my parser because I had to work with files > and database in same code. Therefore, I believed that it was the best :-). > > So, if it is what you wanted, we can work together to put it into BioPython > project. > > [1] https://dl.dropbox.com/u/4270818/workingFcfrpStructure.py > > Best regards, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structural Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-8739 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 > Personal Blogg - http://rodrigofaccioli.blogspot.com/ > > > > On Tue, Jul 24, 2012 at 3:36 PM, Matthew Tien >wrote: > > > To whom it may concern, > > > > I am currently developing a program in Biopython that creates amino acid > > chains from an inputted AA sequence. The program would output a single > > amino acid chain in an extended conformation. Is this something of > interest > > to the developers of Biopython? > > > > I am using basic calculus to calculate the position of the atoms in the > > protein residues and using known protein geometries and database > > information from PDB.org and the Dunbrack group < > http://dunbrack.fccc.edu/ > > >. > > This program is an extension of my current research in calculating > Relative > > Solvent Accessibilities of protein residues. > > > > Thank you for your time, > > Matthew Tien > > > > -- > > B.S. Biochemistry, University of Texas at Austin > > PhD. student, University of Chicago > > Marcotte Lab and Wilke Group > > alt. Matthew.Tien at yahoo.com > > 361-876-0942 > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Thu Jul 26 15:04:03 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 26 Jul 2012 21:04:03 +0200 Subject: [Biopython-dev] Biopython.org down? Message-ID: Hi everyone, I have been trying to access the main site (biopython.org) since yesterday night to no avail. Upon checking http://www.downforeveryoneorjustme.com/biopython.org, it seems like the site is really down. And it's not just biopython, apparently all other open-bio sites (open-bio.org, bioperl.org, bioruby.org) are down as well. Does anybody know what's going on? regards, Bow From p.j.a.cock at googlemail.com Fri Jul 27 17:03:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 Jul 2012 22:03:56 +0100 Subject: [Biopython-dev] Biopython.org down? In-Reply-To: References: Message-ID: Yes, as mentioned on Twitter the fiber cable connection of the hosting site was severed in an accident - which also took our mailing list server offline as well as the websites :( Looks like everything is back now :) Peter On Thu, Jul 26, 2012 at 8:04 PM, Wibowo Arindrarto wrote: > Hi everyone, > > I have been trying to access the main site (biopython.org) since yesterday > night to no avail. Upon checking > http://www.downforeveryoneorjustme.com/biopython.org, it seems like the > site is really down. And it's not just biopython, apparently all other > open-bio sites (open-bio.org, bioperl.org, bioruby.org) are down as well. > > Does anybody know what's going on? > > regards, > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From w.arindrarto at gmail.com Fri Jul 27 17:07:39 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 27 Jul 2012 23:07:39 +0200 Subject: [Biopython-dev] Biopython.org down? In-Reply-To: References: Message-ID: Hi Lenna and Peter, Ah yes, I saw the tweet from @Biopython some time after I sent the email. I knew it was pretty bad when I didn't get the usual mail-received notification. Anyway, good to see it's online now :). regards, Bow On Fri, Jul 27, 2012 at 11:03 PM, Peter Cock wrote: > Yes, as mentioned on Twitter the fiber cable connection of the hosting > site was severed in an accident - which also took our mailing list server > offline as well as the websites :( > > Looks like everything is back now :) > > Peter > > On Thu, Jul 26, 2012 at 8:04 PM, Wibowo Arindrarto > wrote: > > Hi everyone, > > > > I have been trying to access the main site (biopython.org) since > yesterday > > night to no avail. Upon checking > > http://www.downforeveryoneorjustme.com/biopython.org, it seems like the > > site is really down. And it's not just biopython, apparently all other > > open-bio sites (open-bio.org, bioperl.org, bioruby.org) are down as > well. > > > > Does anybody know what's going on? > > > > regards, > > Bow > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Fri Jul 27 17:23:50 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 27 Jul 2012 17:23:50 -0400 Subject: [Biopython-dev] GSoC python variant update 8 Message-ID: It appears that this email didn't make it to the list due to the catastrophe yesterday. I apologize if anyone receives two copies! Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From arklenna at gmail.com Thu Jul 26 18:30:35 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 26 Jul 2012 18:30:35 -0400 Subject: [Biopython-dev] GSoC python variant update 8 Message-ID: Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From chris.mit7 at gmail.com Fri Jul 27 19:17:13 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Fri, 27 Jul 2012 19:17:13 -0400 Subject: [Biopython-dev] GSoC python variant update 8 In-Reply-To: References: Message-ID: Sorry for my brevity, but one great reason to scan a VCF file is to know where your variants are for downstream analysis. For instance, when analyzing RNA-Seq data for features such as Allele Specific Expression, having quick access to where variants are located is essential. On Thu, Jul 26, 2012 at 6:30 PM, Lenna Peterson wrote: > Link: http://arklenna.tumblr.com/post/28082157403/ > > Post: > > I previously proposed the implementation of a method for PyVCF that > would quickly scan the entire file and provide useful summary > statistics. The idea is shamelessly copied from Brad's GFF parser (see > https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this > method is helpful because the annotations on a sequence can vary > widely. However, I no longer think this would be useful for VCF: > > 1. Most importantly, the VCF headers generally contain a complete > listing of all of the types of information contained in the file. It's > technically optional, but I hope that the most commonly used variant > callers produce accurate headers. However, if there is a prevalence of > files with a mismatch between headers and actual INFO/FORMAT fields, > please let me know. > > 2. Next, any listing of ranges of data such as POS or QUAL might as > well be coupled with actual filtering. This would be different if a > presentation of the distribution of quality scores would be necessary > to set an appropriate threshold. It would also depend on the ratio of > speed between the range scan and the filtering (i.e. whether a > possible second filter would be unacceptably time consuming). > > 3. Finally, and perhaps most importantly, many files are so large that > scanning an entire file would take too long. Setting a limit and > displaying updated information in real time (i.e. writing to > `sys.stdout` with '\r', https://gist.github.com/3161269 ) could > overcome this issue. > > If any VCF users can think of a great reason to scan a VCF file before > filtering it, please get in touch. > > ------- > > I added the method `as_SeqFeature()` to my basic variant class, but > it's still incomplete. Some of this is in flux due to forthcoming > changes to FeatureLocation. > > I'm currently working on expanding the coordinate mapper Reece posted > to the dev list a couple years ago (see > http://biopython.org/pipermail/biopython/2010-June/006598.html ). > Expect an update on that very soon. > > Best, > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From rodrigo.faccioli at gmail.com Wed Jul 25 14:16:57 2012 From: rodrigo.faccioli at gmail.com (Rodrigo Faccioli) Date: Wed, 25 Jul 2012 15:16:57 -0300 Subject: [Biopython-dev] Extended Amino Acid Chains In-Reply-To: References: Message-ID: Hi Joao, What I understood about the Tien's idea, your angle is correct. However, I would like to say is that the use of force-field could be implemented in biopython through xml files since each xml file represents a version of ff. I'm not an expertise in ff. In fact, I have been studying only charmm27 mainly its implementation at gromacs. So, I believe that we can base on gromacs topology files and create a specific xml file for charmm27, for example. Maybe create a parser to read these gromacs files. Furthermore, the use of ff in biopython could be used in other implementations such as checking the structure. In this way, we can create a command like that: check_charmm27(structure). This command can create a list of errors of structure based on charmm27 ff. Did I write correctly? This email is an idea only. Best regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structural Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-8739 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 Personal Blogg - http://rodrigofaccioli.blogspot.com/ On Wed, Jul 25, 2012 at 4:59 AM, Jo?o Rodrigues wrote: > Hey Matthew, Rodrigo, > > The only problem I see with incorporating such "feature" in Biopython is > that you would need a topology and parameters for the aminoacids and these > are often forcefield dependent. Therefore, the quantity of data to add to > the distribution would be quite big and you'd need someone to keep updating > it as ffs evolve. Or am I seeing this from a completely wrong angle? > > Cheers, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > > 2012/7/24 Rodrigo Faccioli > >> Hi Mathew, >> >> If I understood your email, I have already implemented it. However, it is >> not put into BioPython project yet.In this moment, I don't have time to do >> it alone. >> >> In [1] there is an example of my code. In my project I extended the >> BioPython classes and created my parser because I had to work with files >> and database in same code. Therefore, I believed that it was the best :-). >> >> So, if it is what you wanted, we can work together to put it into >> BioPython >> project. >> >> [1] https://dl.dropbox.com/u/4270818/workingFcfrpStructure.py >> >> Best regards, >> >> -- >> Rodrigo Antonio Faccioli >> Ph.D Student in Electrical Engineering >> University of Sao Paulo - USP >> Engineering School of Sao Carlos - EESC >> Department of Electrical Engineering - SEL >> Intelligent System in Structural Bioinformatics >> http://laips.sel.eesc.usp.br >> Phone: 55 (16) 3373-8739 >> Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 >> Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 >> Personal Blogg - http://rodrigofaccioli.blogspot.com/ >> >> >> >> On Tue, Jul 24, 2012 at 3:36 PM, Matthew Tien > >wrote: >> >> > To whom it may concern, >> > >> > I am currently developing a program in Biopython that creates amino acid >> > chains from an inputted AA sequence. The program would output a single >> > amino acid chain in an extended conformation. Is this something of >> interest >> > to the developers of Biopython? >> > >> > I am using basic calculus to calculate the position of the atoms in the >> > protein residues and using known protein geometries and database >> > information from PDB.org and the Dunbrack group < >> http://dunbrack.fccc.edu/ >> > >. >> > This program is an extension of my current research in calculating >> Relative >> > Solvent Accessibilities of protein residues. >> > >> > Thank you for your time, >> > Matthew Tien >> > >> > -- >> > B.S. Biochemistry, University of Texas at Austin >> > PhD. student, University of Chicago >> > Marcotte Lab and Wilke Group >> > alt. Matthew.Tien at yahoo.com >> > 361-876-0942 >> > _______________________________________________ >> > Biopython-dev mailing list >> > Biopython-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > From jeff.hussmann at gmail.com Mon Jul 30 18:03:58 2012 From: jeff.hussmann at gmail.com (Jeff Hussmann) Date: Mon, 30 Jul 2012 17:03:58 -0500 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable Message-ID: Hello all - Bio.Data.CodonTable currently has a variable back_table that provides a mapping from an amino acid to single (arbitrary?) codon that encodes the amino acid. Is there any interest in adding a full_back_table (or some other suitable name) that would provide a mapping from an amino acid to a list of all codons that encode it? If so, I will submit a pull request. I have been using this myself for some projects on synonymous codon usage. - Jeff From p.j.a.cock at googlemail.com Tue Jul 31 05:06:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 31 Jul 2012 10:06:14 +0100 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable In-Reply-To: References: Message-ID: On Mon, Jul 30, 2012 at 11:03 PM, Jeff Hussmann wrote: > Hello all - > > Bio.Data.CodonTable currently has a variable back_table that provides > a mapping from an amino acid to single (arbitrary?) codon that encodes > the amino acid. The current code (which I doubt is widely used) does pick an arbitrary codon (using a sort to ensure this is consistent between Python versions). As noted in the comments, there are more useful alternatives - but the example of doing this on usage frequency is organism specific so can't be hard coded. > Is there any interest in adding a full_back_table (or some other > suitable name) that would provide a mapping from an amino acid to a > list of all codons that encode it? If so, I will submit a pull > request. I have been using this myself for some projects on synonymous > codon usage. Something like that would be more useful - sure, do a pull request from the current master branch. See also past discussions about back translation of sequences, e.g. http://lists.open-bio.org/pipermail/biopython/2012-April/007901.html Peter From p.j.a.cock at googlemail.com Tue Jul 31 06:37:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 31 Jul 2012 11:37:35 +0100 Subject: [Biopython-dev] Travis Continuous Integration testing & pull requests Message-ID: Hi all, I'm cross posting as this is an announcement. Please keep any follow up discussion to the relevant project specific mailing list, or if general open-bio-l please. Those following the OBF blog or the OBF or Bio* Twitter accounts will have already seen this, which I posted yesterday: http://news.open-bio.org/news/2012/07/travis-ci-for-testing/ In summary, since earlier this year BioRuby and then Biopython and BioPerl have been using Travis-CI.org (a hosted continuous integration service for the open source community) to run their unit tests automatically whenever their GitHub repositories are updated. In addition we now have TravisCI automatically running our tests on any new GitHub pull requests - supported by an OBF donation to Travis-CI, see: http://about.travis-ci.org/blog/announcing-pull-request-support/ Currently BioJava only uses GitHub as an SVN mirror - but this should still let you start using TravisCI for automated testing: http://about.travis-ci.org/docs/user/languages/java/ For EMBOSS, this is another incentive to convert from CVS to github - TravisCI recently announced support for C/C++ projects: http://about.travis-ci.org/blog/support_for_go_c_and_cpp/ http://about.travis-ci.org/docs/user/languages/c/ Potentially there are other OBF projects where this would be useful too. Regards, Peter From jeff.hussmann at gmail.com Tue Jul 31 15:07:42 2012 From: jeff.hussmann at gmail.com (Jeff Hussmann) Date: Tue, 31 Jul 2012 14:07:42 -0500 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable In-Reply-To: References: Message-ID: It seems desirable to have each amino acid's list of codons be given in a deterministic order. I have been sorting lexicographically using the ordering 'TCAG'. This is referred to as the 'conventional ordering' in CodonTable.__str__. The most flexible solution would be to take the ordering from self.nucleotide_alphabet.letters, but this would give 'GATC' for any CodonTable using IUPAC.unambiguous_dna as its nucleotide alphabet. Are there any Biopython-wide conventions here? On Tue, Jul 31, 2012 at 4:06 AM, Peter Cock wrote: > On Mon, Jul 30, 2012 at 11:03 PM, Jeff Hussmann wrote: >> Hello all - >> >> Bio.Data.CodonTable currently has a variable back_table that provides >> a mapping from an amino acid to single (arbitrary?) codon that encodes >> the amino acid. > > The current code (which I doubt is widely used) does pick an arbitrary > codon (using a sort to ensure this is consistent between Python versions). > As noted in the comments, there are more useful alternatives - but the > example of doing this on usage frequency is organism specific so > can't be hard coded. > >> Is there any interest in adding a full_back_table (or some other >> suitable name) that would provide a mapping from an amino acid to a >> list of all codons that encode it? If so, I will submit a pull >> request. I have been using this myself for some projects on synonymous >> codon usage. > > Something like that would be more useful - sure, do a pull request > from the current master branch. > > See also past discussions about back translation of sequences, e.g. > http://lists.open-bio.org/pipermail/biopython/2012-April/007901.html > > Peter From zcharlop at mail.rockefeller.edu Tue Jul 31 20:37:27 2012 From: zcharlop at mail.rockefeller.edu (Zachary Charlop-Powers) Date: Wed, 1 Aug 2012 00:37:27 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior Message-ID: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Hello Biopython, I am writing about a small feature that I would like to see implemented (and could possibly help to implement it: I haven't contributed before and am not sure exactly how tough this will be). When using Genome Diagram to draw features you can specify which strand to put a feature on. If the strand is positive it will go above the track in the positive-facing direction and if negative it will go below the track in the negative facing direction. (seehttp://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc200) . That's a great behavior. However if you use strand="None", Genome Diagram will draw the features inline with the track and always in the positive direction. For myself, and probably others, keeping the direction of the features is immensely useful as you can often get a sense of operon structure in prokaryote genomes just by looking at the genes. Of course the forward and the minus strands can be drawn but condensing small sections of genes to a single track saves space when making images. So, would it be possible to change the default behavior of Genome Diagram to draw features inline (strand="None"), but to preserve their orientation? best, zach cp From chapmanb at 50mail.com Mon Jul 2 10:36:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 02 Jul 2012 06:36:39 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: <874npqo3ew.fsf@fastmail.fm> Lenna; Thanks for the updates and thoughts. I like the direction you're moving after taking everything you've learned from the SQL experiments. My general suggestions would be: - Leverage PyVCF for all of the backend parsing. We want to remain compatible with this since merging/interfacing with the work James and everyone is doing is a primary goal. Keeping a similar code structure is a great way to facilitate this. - For HGVS the general idea is to not be too tied to the VCF format, so I wouldn't worry about strict compatibility but rather use it to inform choices where you feel that things are mirroring VCF structure rather than more general variant representation. > Another question that may reveal my complete ignorance of haplotypes > and such: could a polyploid site ever be partially phased? e.g. a > triploid genotype of 0/1|0? It's possible but this is kind of a fringe case right now so I wouldn't especially worry about it. Thanks again, Brad From redmine at redmine.open-bio.org Tue Jul 3 08:59:57 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 3 Jul 2012 08:59:57 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] (New) Bio.GenBank format writer creates invalid start_codon entries. Message-ID: Issue #3368 has been reported by Kai Blin. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: New Priority: Normal Assignee: Kai Blin Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From hughesadam87 at gmail.com Tue Jul 3 19:19:04 2012 From: hughesadam87 at gmail.com (Adam Hughes) Date: Tue, 3 Jul 2012 15:19:04 -0400 Subject: [Biopython-dev] Conserved Domains Database Support Message-ID: Hi everyone, I'm new to the BioPython library and was wondering if there was any support for the conserved domains database from NCBI? In particular, the superfamily batch files that their webtool releases. Doing a Google search, there was some interest for this back in 2008; however, they were mainly interested in parsing the HTML output of CDD searches. Now that CDD offers a nice, regular downloadable datatype, has any BioPython support been implemented to work with this? If not, I'd like to contribute. The data is simple tab-delmited formats of domain alignments, E.G.: Q#10000 0 >WHL22.364604.0 superfamily 212291 7 290 1.01528e-138 401.1 cl09099 P-loop_NTPase superfamily 0 I had envisioned a simple class of mainly getters/setters with a few methods such as sorting by Query batches. ~Adam From p.j.a.cock at googlemail.com Tue Jul 3 22:03:29 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Jul 2012 23:03:29 +0100 Subject: [Biopython-dev] Conserved Domains Database Support In-Reply-To: References: Message-ID: On Tue, Jul 3, 2012 at 8:19 PM, Adam Hughes wrote: > Hi everyone, > > I'm new to the BioPython library and was wondering if there was any support > for the conserved domains database from NCBI? In particular, the > superfamily batch files that their webtool releases. Doing a Google > search, there was some interest for this back in 2008; however, they were > mainly interested in parsing the HTML output of CDD searches. HTML scrappers were always a bit of a pain :( > Now that CDD > offers a nice, regular downloadable datatype, has any BioPython support > been implemented to work with this? > > If not, I'd like to contribute. > > The data is simple tab-delmited formats of domain alignments, E.G.: > > Q#10000 0 >WHL22.364604.0 superfamily 212291 7 290 > 1.01528e-138 401.1 cl09099 P-loop_NTPase superfamily > 0 > > I had envisioned a simple class of mainly getters/setters with a few > methods such as sorting by Query batches. > > ~Adam That is interesting - and offers to work on Biopython are always nice. Is this a file giving domain definitions (HMM or whatever CDD uses), or precomputed search results for different query sequences? Maybe a URL would help - I've not looked at this resource for quite a while. I used to use the rpsblast tool to run local (offline) searches against CDD databases, and that offered several BLAST output flavours. Peter P.S. I'll be away with intermittent email access for the rest of the week. From w.arindrarto at gmail.com Wed Jul 4 13:03:01 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 4 Jul 2012 15:03:01 +0200 Subject: [Biopython-dev] GSoC Project Update -- 9 Message-ID: Hello everyone, The past week I have been working to add PSL parsing support and I've just posted my update here: http://bow.web.id/blog/2012/07/initial-blat-support/ Currently, we have parsing, indexing, and writing support. But this could change (writing might not be supported) due to a possible change in the current object model. I've explained a bit on why this is the case in the post, but to summarize it here, it's because we haven't got a way to properly model segmented HSP sequences. Peter and I have discussed this a bit, but we haven't figured out an elegant way to solve it for now. Aside from working on PSL, I also added more tests and started refactoring the code as it's starting to get messy. That's all my update for the past week. For this week, I'll try to look into other formats and try to come up with possible solutions to the segmented HSP problem. regards, Bow From reece at harts.net Thu Jul 5 19:40:02 2012 From: reece at harts.net (Reece Hart) Date: Thu, 5 Jul 2012 12:40:02 -0700 Subject: [Biopython-dev] [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: On Fri, Jun 29, 2012 at 11:15 PM, Lenna Peterson wrote: > For a Python variant object, are there any organizational choices that > would make it easier for future conversion of a variant to HGVS > syntax? (this is primarily directed at Reece but I'm open to all > suggestions) > Oh, no, things directed at me! That's a broad question. I'll try to answer without being long winded. The essential elements of a sequence variant are a reference to a sequence, the location, and specifics about the operation. The name, allelic depth, etc are all distinct from these elements and I would store them separately in a format-specific record or as a subclass. I don't have much experience with FeatureLocations, but that might be appropriate. Depending on how far you plan to go with VCF, you'll have to deal with Locations for breakpoints. For the Occam's Razor version a model for variation, I'd float this in the community: variation := And I'd test this against representing: - a single SNP in VCF - a compound het from VCF - a variant in RNA - a variant in CDS coords - a variant in a protein sequence - a trinuclotide repeat (Which the simple model above fails, BTW.) What makes the uber variant problem hard, I think, is several competing design axes: 1) sequence type (DNA, RNA, protein), 2) coordinate systems (really, CDS in a transcript record), 3) diversity of variant types (SNV, indel, repeat, etc), 4) diversity of auxiliary data (e.g., genotype info from VCF). HGVS makes us think outside merely VCF data: in particular, it adds the nuance of coordinate systems and multiple sequence types. I suspect you should be considering mixins and/or subclassing for some of these needs. I don't know how to solve any of this complexity. What I do know is that 1) it's too much just for your project, 2) it would be nice to have a design that can be easily extended beyond your project, and 3) therefore, part of your project should be to pave the way for extensions without tackling them. It's also a good time to put stakes in the ground around internal conventions, such as variants are always represented using interbase coordinates (= 0-based, right-open). And, if you end up handling just VCF variants, that's cool too. -Reece From p.j.a.cock at googlemail.com Sun Jul 8 19:06:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 8 Jul 2012 20:06:03 +0100 Subject: [Biopython-dev] Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki In-Reply-To: <4FF5C31F.8080502@broadinstitute.org> References: <4FF5C31F.8080502@broadinstitute.org> Message-ID: This could be important for Lenna's GSoC project. Heng Li had developed the original binary VCF format, BCF, but IIRC he wasn't keen to push it as a standard - see also http://vcftools.sourceforge.net/specs.html and http://vcftools.sourceforge.net/bcf.pdf It looks like BCF2 could be more widely used... Peter ---------- Forwarded message ---------- From: Eric Banks Date: Thu, Jul 5, 2012 at 5:38 PM Subject: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki To: "1000ANALYSIS at LIST.NIH.GOV" <1000ANALYSIS at list.nih.gov>, "vcftools-spec at lists.sourceforge.net" Hi everyone, At the last 1000G meeting we discussed BCF2, the official binary version of VCF. The quick reference guide for BCF2 is now linked from the main VCF page on the 1000G wiki; you can access it directly here: http://www.1000genomes.org/sites/1000genomes.org/files/documents/bcfv2.pdf I take no credit for the document itself, which is really the work of Heng and Mark. At this point, both the GATK and samtools can produce BCF files (and they will soon become our standard output format). We encourage other producers of VCF to consider moving over to BCF2 too. Best, Eric -- Eric Banks, PhD Broad Institute of Harvard and MIT ebanks at broadinstitute.org 617-714-7636 ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ VCFtools-spec mailing list VCFtools-spec at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/vcftools-spec From arklenna at gmail.com Mon Jul 9 04:33:57 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 9 Jul 2012 00:33:57 -0400 Subject: [Biopython-dev] GSoC python variant update 7 Message-ID: Post: http://arklenna.tumblr.com/post/26812132902/ Synopsis: This week, I wrote a script for PyVCF that can filter a file by sample as it's being parsed. It's currently named `vcf_sample_filter.py`. It's designed to be functional from the command line, the Python interpreter, or as a module. Next up: come up with a generic-via-extensibility representation of a variant. I'm working through some examples and should have a basic outline soon. Lenna From p.j.a.cock at googlemail.com Mon Jul 9 11:33:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 9 Jul 2012 12:33:44 +0100 Subject: [Biopython-dev] Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki In-Reply-To: <87hathp42x.fsf@fastmail.fm> References: <4FF5C31F.8080502@broadinstitute.org> <87hathp42x.fsf@fastmail.fm> Message-ID: On Mon, Jul 9, 2012 at 12:27 PM, Brad Chapman wrote: > > Peter; > Thanks for the heads up. I'm excited about BCF2 and am hopeful it'll > help with some of the painful parts of VCF, like subsetting large files > by samples. There is also a page about it on the Broad wiki with more details: > > http://www.broadinstitute.org/gsa/wiki/index.php/BCF2 > > In terms of the representation, this stays close to VCF so shouldn't > change a lot of the API people see. The main changes would be on the > backend side where we'd like to be able to swap in and out BCF2 and VCF > (and GVF) transparently with no visible change to the programmer. > > Brad Yes - that's what we should be aiming for, much like the SAM/BAM duality which has worked really well for sequence alignments. Note that like BAM, BCF and BCF2 are both compressed with BGZF - support for which we included in Biopython 1.60. This can be combined with the Python struct module to parse the binary data (and with a little more effort will support both Python 2 and 3, see the SFF code for pointers or ask me). Peter From chapmanb at 50mail.com Mon Jul 9 11:27:18 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 09 Jul 2012 07:27:18 -0400 Subject: [Biopython-dev] Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki In-Reply-To: References: <4FF5C31F.8080502@broadinstitute.org> Message-ID: <87hathp42x.fsf@fastmail.fm> Peter; Thanks for the heads up. I'm excited about BCF2 and am hopeful it'll help with some of the painful parts of VCF, like subsetting large files by samples. There is also a page about it on the Broad wiki with more details: http://www.broadinstitute.org/gsa/wiki/index.php/BCF2 In terms of the representation, this stays close to VCF so shouldn't change a lot of the API people see. The main changes would be on the backend side where we'd like to be able to swap in and out BCF2 and VCF (and GVF) transparently with no visible change to the programmer. Brad > This could be important for Lenna's GSoC project. > > Heng Li had developed the original binary VCF format, > BCF, but IIRC he wasn't keen to push it as a standard - > see also http://vcftools.sourceforge.net/specs.html and > http://vcftools.sourceforge.net/bcf.pdf > > It looks like BCF2 could be more widely used... > > Peter > > > ---------- Forwarded message ---------- > From: Eric Banks > Date: Thu, Jul 5, 2012 at 5:38 PM > Subject: [VCFtools-spec] The BCF2 quick reference document is up on > the 1000G wiki > To: "1000ANALYSIS at LIST.NIH.GOV" <1000ANALYSIS at list.nih.gov>, > "vcftools-spec at lists.sourceforge.net" > > > > Hi everyone, > > At the last 1000G meeting we discussed BCF2, the official binary version > of VCF. The quick reference guide for BCF2 is now linked from the main > VCF page on the 1000G wiki; you can access it directly here: > http://www.1000genomes.org/sites/1000genomes.org/files/documents/bcfv2.pdf > > I take no credit for the document itself, which is really the work of > Heng and Mark. At this point, both the GATK and samtools can produce > BCF files (and they will soon become our standard output format). We > encourage other producers of VCF to consider moving over to BCF2 too. > > Best, > Eric > > -- > Eric Banks, PhD > Broad Institute of Harvard and MIT > ebanks at broadinstitute.org > 617-714-7636 > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > VCFtools-spec mailing list > VCFtools-spec at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Mon Jul 9 22:40:18 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 Jul 2012 22:40:18 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Peter Cock. Assignee changed from Kai Blin to Biopython Dev Mailing List Apologies for apparently ignoring you - I've just changed the assignee (back to) Biopython Dev Mailing List, since no-one was getting any of these updates by email. Well, I wasn't at least :( I'm wary about changing the parser to give integers instead of ints - that seems likely to break existing scripts. The whitelist approach in https://github.com/kblin/biopython/commit/4dec86810a42743967981b74c81a6fb8e17004e4 seems a better bet. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Jul 9 22:49:46 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 Jul 2012 22:49:46 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] (Closed) Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Kai Blin. Status changed from New to Closed % Done changed from 0 to 100 Applied in changeset commit:aa594ed9a85838d43ab321b756dff07bedfbb126. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Jul 10 06:28:27 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Jul 2012 06:28:27 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Kai Blin. Peter Cock wrote: > Apologies for apparently ignoring you - I've just changed the assignee (back to) Biopython Dev Mailing List, since no-one was getting any of these updates by email. Well, I wasn't at least :( No worries. In turn, I just learned that Redmine merges the Bugzilla fields "assignee" and "QA contact" into one field, and the "QA contact" meaning is the more important one for the way this project is run. :) > > I'm wary about changing the parser to give integers instead of ints - that seems likely to break existing scripts. You mean integers instead of strings, I guess. But yes, I can see the danger of breaking existing scripts, seeing how I even had to fix the "we only parse strings" assumption twice in the remaining parser code. I'm happy with the patch you pushed, thanks a lot. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Jul 10 07:37:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Jul 2012 07:37:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3368] Bio.GenBank format writer creates invalid start_codon entries. References: Message-ID: Issue #3368 has been updated by Peter Cock. We changed to RedMine from Bugzilla relatively recently - still learning its quirks ;) And yes, regarding the string/integer typo. ---------------------------------------- Bug #3368: Bio.GenBank format writer creates invalid start_codon entries. https://redmine.open-bio.org/issues/3368 Author: Kai Blin Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted qualifiers. I'd suggest using the same list as BioPerl:

our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
    anticodon           citation
    codon               codon_start
    cons_splice         direction
    evidence            label
    mod_base            number
    rpt_type            rpt_unit
    transl_except       transl_table
    usedin
    );
(see https://github.com/bioperl/bioperl-live/blob/master/Bio/SeqIO/genbank.pm#L193) I'll try to come up with a patch and will be happy for feedback on implementation, coding conventions and the like. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jul 11 17:49:47 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 11 Jul 2012 17:49:47 +0000 Subject: [Biopython-dev] [Biopython - Feature #3219] Port Biopython documentation to Sphinx References: Message-ID: Issue #3219 has been updated by Wibowo Arindrarto. Hi everyone, Just as an FYI which may or may not be useful, I just stumbled on a Biopython sphinx documentation here: http://www.bio-cloud.info/Biopython/en/index.html. Its sphinx source says it was generated just about a year ago (July 2011). The creator has a personal webpage, but it's not really clear how to contact him/her. ---------------------------------------- Feature #3219: Port Biopython documentation to Sphinx https://redmine.open-bio.org/issues/3219 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: Currently we use Epydoc for the API reference documentation, and LaTeX (to PDF via pdflatex, and HTML via hevea) for the tutorial. There's some material on the wiki to consider, too. A number of Python projects, including CPython, now use Sphinx for documentation. Content is written in reStructuredText format, and can be pulled from both standalone .rst files and Python docstrings. This offers several advantages: (i) API documentation will be prettier and easier to navigate; (ii) the Tutorial will be easier to edit for those not fluent in LaTeX; (iii) Since the API reference and Tutorial will be written in the same markup, potentially even pulling from some shared sources, it will be easier to address redundant or overlapping portions between the two, avoiding inconsistencies. See: http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ Mailing list discussion: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007977.html Numpy's approach: http://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jul 11 18:13:40 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 11 Jul 2012 18:13:40 +0000 Subject: [Biopython-dev] [Biopython - Feature #3219] Port Biopython documentation to Sphinx References: Message-ID: Issue #3219 has been updated by Peter Cock. Seems they were interested in translating it into Chinese, based on a Google translation of this post: http://www.bio-cloud.info/blog/?p=57 ---------------------------------------- Feature #3219: Port Biopython documentation to Sphinx https://redmine.open-bio.org/issues/3219 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: Currently we use Epydoc for the API reference documentation, and LaTeX (to PDF via pdflatex, and HTML via hevea) for the tutorial. There's some material on the wiki to consider, too. A number of Python projects, including CPython, now use Sphinx for documentation. Content is written in reStructuredText format, and can be pulled from both standalone .rst files and Python docstrings. This offers several advantages: (i) API documentation will be prettier and easier to navigate; (ii) the Tutorial will be easier to edit for those not fluent in LaTeX; (iii) Since the API reference and Tutorial will be written in the same markup, potentially even pulling from some shared sources, it will be easier to address redundant or overlapping portions between the two, avoiding inconsistencies. See: http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ Mailing list discussion: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007977.html Numpy's approach: http://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Sun Jul 15 13:56:33 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sun, 15 Jul 2012 15:56:33 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Hi, just wanted to ask if there is any new ideas wrt. this bug. I just wanted to get the residue type from a genbank file and realised that this is also not parsed. This reminded me of this bug. While searching the web I found that the problem also affects others. http://biopython.org/pipermail/biopython-dev/2011-July/009055.html I would really like to see some progress here. And of course I would like to help. But I do not know how. I tried to dig in the bioperl sources - but the problem is that I don't speak perl. Matthias 2012/5/3 Peter Cock : > > On Saturday, April 28, 2012, Matthias Bernt wrote: >> >> Dear developers, >> >> I would like to suggest a quick "fix" for the problem. Currently the >> parser just returns true per default for the circular property. This >> is a wrong piece of information for all circular sequences. >> Furthermore its not possible to detect if the parser did return true >> because it is its default value or if its really from the data. So I >> suggest to return None if the parser does not parse the information. >> >> What do you think? This should be possible with minimal effort. >> > > > The parsing side of this is trivial - the only piece missing is > how best to present the information in the SeqRecord for > BioSQL compatibility (and perhaps some extra work on our > BioSQL bindings). That requires someone to test where > BioPerl stores this in BioSQL (as that is the reference > implementation). > > Without that, a "quick fix" will mostly likely create a bug in > our BioSQL support - in that we wouldn't store the circular > field in the same way as the other Bio* implementations. > > Peter > From arklenna at gmail.com Tue Jul 17 17:48:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 17 Jul 2012 13:48:33 -0400 Subject: [Biopython-dev] GSoC python variant update 7 Message-ID: Hi all, New blog post: http://arklenna.tumblr.com/post/27418058203/ Last week, Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I've started a very minimal Variant object (in https://github.com/lennax/biopython/blob/variant2/Bio/Variant/variant.py), using `FeatureLocation` for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1. I've created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. It can be found on the blog post. Please let me know if there are any errors in my interpretation of these variant types. Thanks, Lenna From redmine at redmine.open-bio.org Wed Jul 18 14:52:24 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Jul 2012 14:52:24 +0000 Subject: [Biopython-dev] [Biopython - Bug #3374] (New) Newick.Tree.randomized not working Message-ID: Issue #3374 has been reported by Aleksey Kladov. ---------------------------------------- Bug #3374: Newick.Tree.randomized not working https://redmine.open-bio.org/issues/3374 Author: Aleksey Kladov Status: New Priority: Normal Assignee: Category: Target version: URL: My code is
from Bio.Phylo import BaseTree, Newick

t = Newick.Tree.randomized(5)
It throws Exeption:
Traceback (most recent call last):
  File "/home/kladov/ab_lab/Rosalind/rosalind-problems/rosalind_problems/phyltree/__init__.py", line 55, in 
    t = Newick.Tree.randomized(5)
  File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/BaseTree.py", line 725, in randomized
    terminals.extend(newterms)
TypeError: 'NoneType' object is not iterable
It looks like the problem is here(file BaseTree.py):
newsplit = random.choice(terminals)
newterms = newsplit.split(branch_length=branch_length) #problem: split returns None...
if branch_stdev:
    # Add some noise to the branch lengths
    for nt in newterms:
    nt.branch_length = max(0,
        random.gauss(branch_length, branch_stdev))
terminals.remove(newsplit)
terminals.extend(newterms) # and now we try to extend with None =(
I suppose that split not only should do actual split of a clade, but also return a list of two new clades. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From chapmanb at 50mail.com Wed Jul 18 18:29:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Jul 2012 14:29:36 -0400 Subject: [Biopython-dev] [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: References: <87y5mhkxtc.fsf@fastmail.fm> Message-ID: <87hat4ly7j.fsf@fastmail.fm> Dilara; Apologies, I missed that the second mail had updated code. > This works as you pointed out because filtered_rec is explicitly defined. > Now if I want to do this > > from Bio import SeqIO > mod = (check_meanQ(rec, q_thresholdd) for rec in > SeqIO.parse("hiseq_pe_test.fastq", "fastq")) > count = SeqIO.write(mod, "filtered_hiseq_pe_test.fastq", "fastq") > print "Modified %i records" %count > > It doesn't work because of some of the records are None. So I tried doing > this The approach I'd take it to clean up check_meanQ to be explicit about the return values: > def check_meanQ(rec, q_threshold): > seqlen=len(rec) > quality_scores=array(rec.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", rec.id, "because mean Q was", > round(quality_scores.mean()) > badrec = None > if round(quality_scores.mean()) > q_threshold: > goodrec = rec > > return goodrec def check_meanQ(rec, q_threshold): quality_scores=array(rec.letter_annotations["phred_quality"]) if round(quality_scores.mean()) <= q_threshold: print "Discarded ", rec.id, "because mean Q was", round(quality_scores.mean()) return None else: return rec Then explicitly check for None values and remove them when writing: > count = SeqIO.write(mod, "filtered_hiseq_pe_test.fastq", "fastq") count = SeqIO.write((x for x in mod if x is not None), "filtered_hiseq_pe_test.fastq", "fastq") Hope this helps, Brad From w.arindrarto at gmail.com Wed Jul 18 19:49:37 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 18 Jul 2012 21:49:37 +0200 Subject: [Biopython-dev] GSoC Project Update -- 10 Message-ID: Hi everyone, I've just posted two new updates for my GSoC project, here: http://bow.web.id/blog/2012/07/parsing-blast-plain-text-files-in-searchio/ and here: http://bow.web.id/blog/2012/07/exonerate-in-searchio/ The first one is about a somewhat unofficial new format to be supported by SearchIO: the BLAST plain text output. I know that current Biopython text parser is obsoleted, but I figure it still could be useful for some to have a similar model in SearchIO. It is unofficial since it's basically a wrapper around the current parser, and after discussing things with Peter, it doesn't seem wise to say that we officially support parsing the format. Especially when NCBI itself does not guarantee a stable style between each BLAST release. I should note that I've also made a small change to the current NCBIStandalone code as there were some problems when I try to parse BLAST 2.2.26+ text output with multiple queries. The second one, is about the program I've been spending most of my time on: Exonerate. We now have three Exonerate formats that SearchIO can parse and index: `exonerate-text`, for human-readable aligments, `exonerate-vulgar`, for vulgar lines, and `exonerate-cigar`, for vulgar lines. It's one of the more interesting formats I've been working on so far :), since it has so much information in it. I've tried to capture them as sensible as possible, and I made a small demonstration using it in my post. In addition to writing these two formats, I've also written their tests. Now, having finished almost all of the parsers, I'm planning to devote more time to start writing the documentation during the coming weeks. regards, Bow From p.j.a.cock at googlemail.com Thu Jul 19 09:17:17 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 Jul 2012 10:17:17 +0100 Subject: [Biopython-dev] [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: <87hat4ly7j.fsf@fastmail.fm> References: <87y5mhkxtc.fsf@fastmail.fm> <87hat4ly7j.fsf@fastmail.fm> Message-ID: On Wed, Jul 18, 2012 at 7:29 PM, Brad Chapman wrote: > > Dilara; > Apologies, I missed that the second mail had updated code. > >> This works as you pointed out because filtered_rec is explicitly defined. >> Now if I want to do this >> >> from Bio import SeqIO >> mod = (check_meanQ(rec, q_thresholdd) for rec in >> SeqIO.parse("hiseq_pe_test.fastq", "fastq")) >> count = SeqIO.write(mod, "filtered_hiseq_pe_test.fastq", "fastq") >> print "Modified %i records" %count >> >> It doesn't work because of some of the records are None. So I tried doing >> this > > The approach I'd take it to clean up check_meanQ to be explicit about > the return values: > >> def check_meanQ(rec, q_threshold): >> seqlen=len(rec) >> quality_scores=array(rec.letter_annotations["phred_quality"]) >> if round(quality_scores.mean()) <= q_threshold: >> print "Discarded ", rec.id, "because mean Q was", >> round(quality_scores.mean()) >> badrec = None >> if round(quality_scores.mean()) > q_threshold: >> goodrec = rec >> >> return goodrec > > def check_meanQ(rec, q_threshold): > quality_scores=array(rec.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", rec.id, "because mean Q was", round(quality_scores.mean()) > return None > else: > return rec That should work - although since you are not actually modifying the record at all I'd suggest a check function returning a boolean (True for False). Then you could use this in a generator expression like this: def check_meanQ(rec, q_threshold): quality_scores=array(rec.letter_annotations["phred_quality"]) return round(quality_scores.mean()) > q_threshold records = SeqIO.parse("hiseq_pe_test.fastq", "fastq")) count = SeqIO.write((x for x in records if check_meanQ(x)), "filtered_hiseq_pe_test.fastq", "fastq") (Untested - there could be a typo in there) Peter. P.S. Since this isn't directly about new development work on Biopython itself, the main mailing list would be more appropriate for this kind of question in future. Thanks From p.j.a.cock at googlemail.com Sun Jul 22 14:19:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 22 Jul 2012 15:19:31 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations Message-ID: Dear all, One of the 'warts' in the current SeqRecord/SeqFeature object model is how non-trivial features are stored - in particular joins (in the terminology of GenBank/EMBL). Previous discussions include: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html ... http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html Consider a single gene like this from NC_000932 in our test suite: complement(join(97999..98793,69611..69724)) Currently that becomes three SeqFeature objects, a parent object present in the SeqRecord's feature list, and two child objects (one for each exon) within that parent feature's sub_features list. The parent feature gets a location which summarises the span, so start 97999-1 (Pythonic counting), end 69724, and strand -1. This usage of the sub_features property in this way has been present in Biopython for a very long time, and prevents us using it for nesting features based on the parent/child relationship models used in GFF (e.g. gene and CDS, or gene, mRNA, CDS, and exon). As Brad and I had discussed, a new separate mechanism might be added for explicit parent/child relationships between SeqFeature objects useful for GFF3, since the current name sub_features has this historical baggage. Suggestion ========== What I had proposed was we get rid of sub_features (deprecate it, so for the next couple of releases our parser and BioSQL access will populate it, but our writers and the BioSQL loader will ignore it) and replace it with a new subclass of the FeatureLocation object specifically for these compound locations. This will then map much more closely to the tables used in BioSQL, and therefore I suspect the BioPerl object model too. Once the sub_feature support is dropped, the objects for the complement(join(97999..98793,69611..69724)) example becomes just one SeqFeature object, whose location is a new CompoundLocation containing two parts (the two exons). Note that in order to handle mixed strand features and to make iteration etc simpler, the parts are stored in the biological order (5' to 3'). To put this another way, for this example I find it helps to think about example this as the old EMBL variant form of the location string: join(complement(69611..69724),complement(97999..98793)) i.e. The first part of this gene (the 5' end of the gene) is complement(69611..69724), and the last part (with the 3' end of the gene) is complement(97999..98793). For iteration over the bases of this CompoundLocation you'd get 69723, 69723, ..., 69610 (the first exon), then 98792, ..., 97998 (the second exon) which is exactly what happens now when iterating over the parent SeqFeature. This is what I have tried to do on this branch: https://github.com/peterjc/biopython/tree/f_loc4 As part of this, adding two FeatureLocations will give a CompoundLocation - similarly you can add a simple FeatureLocation and a CompoundLocation or two CompoundLocation objects. I think this makes creating a SeqFeature describing a Eukaryotic gene model MUCH simpler than with the existing approach. (A potential refinement not implemented yet would be to merge abutting exact locations automatically, so that adding 123..456 and 457..999 would give 123..999 instead of join(123..456,457..999), but that might be too much magic?) Impact ====== What does this mean for Biopython users? It will only really affect people using annotated nucleotide files, (i.e. GenBank or EMBL files), and only those doing anything clever with 'join' type features. The deprecation process will allow scripts just reading files to continue to be used unmodified in the short term. However, as the branch currently stands, scripts building SeqFeature objects using sub_features would have to be updated immediately. I believe this is only going to affect a handful of people though, and will (once done) simplify their code. Thoughts? I've tried to balance backwards compatibility with providing something more intuitive - and fixing this should help with merging the GFF support. Peter From chapmanb at 50mail.com Mon Jul 23 13:05:34 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 23 Jul 2012 09:05:34 -0400 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: <87629emxup.fsf@fastmail.fm> Peter; Thanks for working through the sub_feature issue and coming up with this proposal. I'm 100% on board with converting over to something more general and this looks like a great approach. A couple of quick thoughts: - Would it be possible to have a back-compatible 'sub_features' that reconstituted features based on the compound location? This could help us avoid breaking scripts that use sub_features, even if we no longer fill those in going forward. - How do you envision storing GFF feature hierarchies? The location object is more lightweight with only position and strand information. Nested child GFF features would have key/value pairs associated with them as well. Would we want to use sub_features (or some new nested structure) for these? Brad > Dear all, > > One of the 'warts' in the current SeqRecord/SeqFeature object > model is how non-trivial features are stored - in particular joins > (in the terminology of GenBank/EMBL). > > Previous discussions include: > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html > ... > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html > http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html > > Consider a single gene like this from NC_000932 in our test > suite: complement(join(97999..98793,69611..69724)) > > Currently that becomes three SeqFeature objects, a parent > object present in the SeqRecord's feature list, and two child > objects (one for each exon) within that parent feature's > sub_features list. The parent feature gets a location which > summarises the span, so start 97999-1 (Pythonic counting), > end 69724, and strand -1. > > This usage of the sub_features property in this way has > been present in Biopython for a very long time, and prevents > us using it for nesting features based on the parent/child > relationship models used in GFF (e.g. gene and CDS, or > gene, mRNA, CDS, and exon). > > As Brad and I had discussed, a new separate mechanism > might be added for explicit parent/child relationships > between SeqFeature objects useful for GFF3, since the > current name sub_features has this historical baggage. > > Suggestion > ========== > > What I had proposed was we get rid of sub_features > (deprecate it, so for the next couple of releases our parser > and BioSQL access will populate it, but our writers and > the BioSQL loader will ignore it) and replace it with a new > subclass of the FeatureLocation object specifically for these > compound locations. > > This will then map much more closely to the tables used > in BioSQL, and therefore I suspect the BioPerl object > model too. > > Once the sub_feature support is dropped, the objects for > the complement(join(97999..98793,69611..69724)) example > becomes just one SeqFeature object, whose location is a new > CompoundLocation containing two parts (the two exons). > > Note that in order to handle mixed strand features and > to make iteration etc simpler, the parts are stored in the > biological order (5' to 3'). To put this another way, for this > example I find it helps to think about example this as the > old EMBL variant form of the location string: > > join(complement(69611..69724),complement(97999..98793)) > > i.e. The first part of this gene (the 5' end of the gene) > is complement(69611..69724), and the last part (with > the 3' end of the gene) is complement(97999..98793). > > For iteration over the bases of this CompoundLocation > you'd get 69723, 69723, ..., 69610 (the first exon), then > 98792, ..., 97998 (the second exon) which is exactly what > happens now when iterating over the parent SeqFeature. > > This is what I have tried to do on this branch: > https://github.com/peterjc/biopython/tree/f_loc4 > > As part of this, adding two FeatureLocations will give a > CompoundLocation - similarly you can add a simple > FeatureLocation and a CompoundLocation or two > CompoundLocation objects. I think this makes creating > a SeqFeature describing a Eukaryotic gene model > MUCH simpler than with the existing approach. > > (A potential refinement not implemented yet would be > to merge abutting exact locations automatically, so that > adding 123..456 and 457..999 would give 123..999 > instead of join(123..456,457..999), but that might be > too much magic?) > > Impact > ====== > > What does this mean for Biopython users? It will only > really affect people using annotated nucleotide files, > (i.e. GenBank or EMBL files), and only those doing > anything clever with 'join' type features. > > The deprecation process will allow scripts just reading > files to continue to be used unmodified in the short > term. > > However, as the branch currently stands, scripts > building SeqFeature objects using sub_features > would have to be updated immediately. I believe > this is only going to affect a handful of people > though, and will (once done) simplify their code. > > Thoughts? I've tried to balance backwards compatibility > with providing something more intuitive - and fixing this > should help with merging the GFF support. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon Jul 23 16:02:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Jul 2012 17:02:45 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: <87629emxup.fsf@fastmail.fm> References: <87629emxup.fsf@fastmail.fm> Message-ID: On Mon, Jul 23, 2012 at 2:05 PM, Brad Chapman wrote: > > Peter; > Thanks for working through the sub_feature issue and coming up with this > proposal. I'm 100% on board with converting over to something more > general and this looks like a great approach. > > A couple of quick thoughts: > > - Would it be possible to have a back-compatible 'sub_features' that > reconstituted features based on the compound location? This could help > us avoid breaking scripts that use sub_features, even if we no longer > fill those in going forward. When you say 'use' do you mean populate and modify? Use in the read-only sense is already covered - in that any Biopython code generating complex SeqFeature objects would (in the short term) populate both the sub_feature AND the new compound location. Things get very hairy if we want to support edits to the sub_features also automatically updating the new compound location (and vice versa). So I don't want to do that. > - How do you envision storing GFF feature hierarchies? The location > object is more lightweight with only position and strand information. Only in the simple cases. In addition to single line GFF features, you have joins expressed by multiple GFF lines with a common ID. Also, it seems quite possible that GFF3 will add a new tag entry to describe fuzzy locations in future, see e.g. this thread http://sourceforge.net/mailarchive/message.php?msg_id=28240013 > Nested child GFF features would have key/value pairs associated with > them as well. Would we want to use sub_features (or some new nested > structure) for these? Absolutely a new nested structure - reusing sub_features would just cause too much confusion. This might be done with a parent attribute and/or a children list - perhaps with weak references to avoid garbage collection problems with freeing memory. Peter From kai.blin at biotech.uni-tuebingen.de Tue Jul 24 07:34:47 2012 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 24 Jul 2012 09:34:47 +0200 Subject: [Biopython-dev] How to add unit tests Message-ID: <500E5017.5020303@biotech.uni-tuebingen.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks, I've sent Wibowo a patch implementing a parser for yet another format. He asked for some tests, and I'm happy to provide them. Or at least I would be if I was clear on how to add them. Some modules seem to use doctests, some seem to have something home-grown. Where would I put the sequence files to parse during the tests? Hope you can shed some light on this, Kai - -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJQDlAXAAoJEKM5lwBiwTTPI7AH/RWsVXeymP2r6WuDeyzCL+oe S9OEqy7hQGc89ktd1HLn8LVid4baA5f31zPXaPsBdjwFfZT/8l3khjXp3JhOOsQJ wsKsqS985MiswkI0ZzTc598LhOt0oVz2cCPynLFFpj8K9f9OL5PdFKm9owS1urmP 919TBaRX7AWN/qyv3vCztMwvxrMYPz6hKw78oHikJP+i6rtEKYVyVYrvtqBBn0E4 7J/Hfkh+aqAgYR1YlWYCrNlHGM6xJpXmwwIPZp1C1Fgb2sFPsXcHLEQi9KydB7SK m+fosoow40BJbIerBYyUNGOcAkW5yuObLk99UYcYq26LEhUjDcpqNM8C2OtW4NI= =pu0a -----END PGP SIGNATURE----- From w.arindrarto at gmail.com Tue Jul 24 08:11:15 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Jul 2012 10:11:15 +0200 Subject: [Biopython-dev] How to add unit tests In-Reply-To: <500E5017.5020303@biotech.uni-tuebingen.de> References: <500E5017.5020303@biotech.uni-tuebingen.de> Message-ID: Hi Kai, Indeed there seem to be different kinds of testing in the Biopython tests suite. I can't really answer for test suites other than SearchIO, so I'll try to explain what I have in mind for SearchIO. In general, SearchIO has been using the unittest module, with roughly one test file being tested by one test function. Here are the 'consensus' that I'm using: 1. All test files are stored in a folder in the 'Tests' folder according to the program name. For example, HMMER files are all stored in the HMMER folder. 2. In each of those program-specific folder, there is a README file listing the test files and what they are. 3. The naming scheme of the test files may differ slightly between each program, but they are always consistent in the same program. Taking another example from the HMMER file, you can see that they are named like so: 'format_version_program_number.out'. So the first test file for the text output format from hmmpfam version 2.11 would be named "text_211_hmmpfam_001.out". 4. As for the contents of these files, it is basically up to you. I myself try to cover at least cases where there are single and multiple queries, and try to make them as short as possible (although sometimes it's not really that short). 4. The python test file itself are named like so: 'test_SearchIO_{format}.py'. This test file only tests parsing and/or reading-related code, and maybe some format-specific tests. If one program has several formats that differ slightly, they are grouped in one test file named 'test_SearchIO_{program}.py' Tests for indexing and writing are written in 'test_SearchIO_index.py' and 'test_SearchIO_write.py' for now. There's also a file called 'search_tests_common.py' that tests for equality between two different QueryResult objects (all their attributes and the items they contain), but so far this is only used in indexing and writing tests. 5. As for the doctests, they are meant to use the files in each program-specific folder as well. You are free to add extra files that showcases the important features of your parser; a file that's not used by the actual unittest suite. However, as you can see, the doctests are very little at the moment, as I am also still in the process of writing them. For now, I'm prioritizing the unittests first. I hope that helps :), and thanks again for the patch! regards, Bow On Tue, Jul 24, 2012 at 9:34 AM, Kai Blin wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi folks, > > I've sent Wibowo a patch implementing a parser for yet another format. > He asked for some tests, and I'm happy to provide them. Or at least I > would be if I was clear on how to add them. Some modules seem to use > doctests, some seem to have something home-grown. Where would I put > the sequence files to parse during the tests? > > Hope you can shed some light on this, > Kai > > - -- > Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de > Institute for Microbiology and Infection Medicine > Division of Microbiology/Biotechnology > Eberhard-Karls-Universit?t T?bingen > Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 > D-72076 T?bingen Fax : ++49 7071 29-5979 > Germany > Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJQDlAXAAoJEKM5lwBiwTTPI7AH/RWsVXeymP2r6WuDeyzCL+oe > S9OEqy7hQGc89ktd1HLn8LVid4baA5f31zPXaPsBdjwFfZT/8l3khjXp3JhOOsQJ > wsKsqS985MiswkI0ZzTc598LhOt0oVz2cCPynLFFpj8K9f9OL5PdFKm9owS1urmP > 919TBaRX7AWN/qyv3vCztMwvxrMYPz6hKw78oHikJP+i6rtEKYVyVYrvtqBBn0E4 > 7J/Hfkh+aqAgYR1YlWYCrNlHGM6xJpXmwwIPZp1C1Fgb2sFPsXcHLEQi9KydB7SK > m+fosoow40BJbIerBYyUNGOcAkW5yuObLk99UYcYq26LEhUjDcpqNM8C2OtW4NI= > =pu0a > -----END PGP SIGNATURE----- > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Tue Jul 24 09:33:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 10:33:14 +0100 Subject: [Biopython-dev] How to add unit tests In-Reply-To: <500E5017.5020303@biotech.uni-tuebingen.de> References: <500E5017.5020303@biotech.uni-tuebingen.de> Message-ID: On Tue, Jul 24, 2012 at 8:34 AM, Kai Blin wrote: > Hi folks, > > I've sent Wibowo a patch implementing a parser for yet another format. > He asked for some tests, and I'm happy to provide them. Or at least I > would be if I was clear on how to add them. Some modules seem to use > doctests, some seem to have something home-grown. Where would I put > the sequence files to parse during the tests? > > Hope you can shed some light on this, > Kai There is a whole chapter on our testing setup in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf In short: All the recent unit tests use the standard library unittest library. Older unit tests use a home grown print-and-compare approach where we have a copy of the expected output as a file on disk. Input files (of reasonable size, with providence information if possible - i.e. where they came from) are under the Tests folder in sub-directories by type or module. In the case of hmmer2, put them somewhere based on where Bow is putting his hmmer3 files. We also use doctests (in the code) for short illustrative examples with no external dependencies. You do use doctest style embedded examples which do have dependencies (e.g. network access), but we don't run them as unit tests to avoid test failures. We also use doctests in the LaTeX source of the tutorial, run via test_Tutorial.py - again only things without dependencies. Peter From arklenna at gmail.com Tue Jul 24 16:57:34 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Jul 2012 12:57:34 -0400 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Sun, Jul 22, 2012 at 10:19 AM, Peter Cock wrote: > Dear all, > > One of the 'warts' in the current SeqRecord/SeqFeature object > model is how non-trivial features are stored - in particular joins > (in the terminology of GenBank/EMBL). > > Previous discussions include: > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html > ... > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html > http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html > > Consider a single gene like this from NC_000932 in our test > suite: complement(join(97999..98793,69611..69724)) > > Currently that becomes three SeqFeature objects, a parent > object present in the SeqRecord's feature list, and two child > objects (one for each exon) within that parent feature's > sub_features list. The parent feature gets a location which > summarises the span, so start 97999-1 (Pythonic counting), > end 69724, and strand -1. > > This usage of the sub_features property in this way has > been present in Biopython for a very long time, and prevents > us using it for nesting features based on the parent/child > relationship models used in GFF (e.g. gene and CDS, or > gene, mRNA, CDS, and exon). > > As Brad and I had discussed, a new separate mechanism > might be added for explicit parent/child relationships > between SeqFeature objects useful for GFF3, since the > current name sub_features has this historical baggage. > > Suggestion > ========== > > What I had proposed was we get rid of sub_features > (deprecate it, so for the next couple of releases our parser > and BioSQL access will populate it, but our writers and > the BioSQL loader will ignore it) and replace it with a new > subclass of the FeatureLocation object specifically for these > compound locations. > > This will then map much more closely to the tables used > in BioSQL, and therefore I suspect the BioPerl object > model too. > > Once the sub_feature support is dropped, the objects for > the complement(join(97999..98793,69611..69724)) example > becomes just one SeqFeature object, whose location is a new > CompoundLocation containing two parts (the two exons). > > Note that in order to handle mixed strand features and > to make iteration etc simpler, the parts are stored in the > biological order (5' to 3'). To put this another way, for this > example I find it helps to think about example this as the > old EMBL variant form of the location string: > > join(complement(69611..69724),complement(97999..98793)) > > i.e. The first part of this gene (the 5' end of the gene) > is complement(69611..69724), and the last part (with > the 3' end of the gene) is complement(97999..98793). > > For iteration over the bases of this CompoundLocation > you'd get 69723, 69723, ..., 69610 (the first exon), then > 98792, ..., 97998 (the second exon) which is exactly what > happens now when iterating over the parent SeqFeature. > > This is what I have tried to do on this branch: > https://github.com/peterjc/biopython/tree/f_loc4 > > As part of this, adding two FeatureLocations will give a > CompoundLocation - similarly you can add a simple > FeatureLocation and a CompoundLocation or two > CompoundLocation objects. I think this makes creating > a SeqFeature describing a Eukaryotic gene model > MUCH simpler than with the existing approach. > > (A potential refinement not implemented yet would be > to merge abutting exact locations automatically, so that > adding 123..456 and 457..999 would give 123..999 > instead of join(123..456,457..999), but that might be > too much magic?) > > Impact > ====== > > What does this mean for Biopython users? It will only > really affect people using annotated nucleotide files, > (i.e. GenBank or EMBL files), and only those doing > anything clever with 'join' type features. > > The deprecation process will allow scripts just reading > files to continue to be used unmodified in the short > term. > > However, as the branch currently stands, scripts > building SeqFeature objects using sub_features > would have to be updated immediately. I believe > this is only going to affect a handful of people > though, and will (once done) simplify their code. > > Thoughts? I've tried to balance backwards compatibility > with providing something more intuitive - and fixing this > should help with merging the GFF support. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev Hi Peter, I have been testing the new CompoundLocation w.r.t. coordinate mapping and for the most part, I find it simplifies things. The documentation suggests using + to combine FeatureLocations, which invites the use of sum. However, sum doesn't work properly. I explain why in my StackOverflow question: http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior I have considered a number of workarounds: 1. Implementing __radd__ on FeatureLocation to return self if other == 0 allows sum() to work in place, but I am uncomfortable with hard-coding such a condition. 2. Changing the location to subclass set and use xrange for generation would easily allow a number of things: an empty location (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the 'magic' of merging abutting locations that you mention. However, using + and sum() on sets is dubious from a mathematically pure standpoint, and this would only work for ExactPositions. Note that I haven't attempted this yet and it may have disadvantages even for ExactPositions that I've failed to imagine. Let me know your thoughts. Lenna From p.j.a.cock at googlemail.com Tue Jul 24 17:19:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 18:19:31 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Tue, Jul 24, 2012 at 5:57 PM, Lenna Peterson wrote: >> This is what I have tried to do on this branch: >> https://github.com/peterjc/biopython/tree/f_loc4 >> >> As part of this, adding two FeatureLocations will give a >> CompoundLocation - similarly you can add a simple >> FeatureLocation and a CompoundLocation or two >> CompoundLocation objects. I think this makes creating >> a SeqFeature describing a Eukaryotic gene model >> MUCH simpler than with the existing approach. >> >> (A potential refinement not implemented yet would be >> to merge abutting exact locations automatically, so that >> adding 123..456 and 457..999 would give 123..999 >> instead of join(123..456,457..999), but that might be >> too much magic?) > > Hi Peter, > > I have been testing the new CompoundLocation w.r.t. coordinate mapping > and for the most part, I find it simplifies things. That's encouraging. > The documentation suggests using + to combine FeatureLocations, which > invites the use of sum. However, sum doesn't work properly. I explain > why in my StackOverflow question: > http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior Huh, I hadn't anticipated that - but I agree trying to use sum seems natural. > I have considered a number of workarounds: > > 1. Implementing __radd__ on FeatureLocation to return self if other == > 0 allows sum() to work in place, but I am uncomfortable with > hard-coding such a condition. Another idea is to define FeatureLocation or CompoundFeature addition with an integer to expose the current private method _shift. i.e. Apply an offset to the co-ordinates. Something I'd been pondering as a (previously unrelated) enhancement. In this interpretation, adding zero would have no effect on the co-ordinates and thus as a side effect should also make sum(locations) work. We'd need to test this to see if that actually works. > 2. Changing the location to subclass set and use xrange for generation > would easily allow a number of things: an empty location > (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the > 'magic' of merging abutting locations that you mention. However, using > + and sum() on sets is dubious from a mathematically pure standpoint, > and this would only work for ExactPositions. Note that I haven't > attempted this yet and it may have disadvantages even for > ExactPositions that I've failed to imagine. > > Let me know your thoughts. I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty location, but rather as a between location - in this case between the last and first base on a circular genome. In Genbank notation for a circular genome of length 1234, this would be 1234^1 (already an annoying special case we have to handle in the parser and the writer - although I'd have to check the code to see if we store this as [0:0] or [1234:1234] since both make sense). On the other hand, a CompoundLocation with zero parts might make sense. There is something to be said for simply have a single (upgraded) FeatureLocation object with a parts list, which in the typical case would be length one, and proxy methods for start/end as currently defined in CompoundLocation. Maybe I should try that on another branch... it might be more elegant overall. Peter From matthew.tien89 at gmail.com Tue Jul 24 18:36:05 2012 From: matthew.tien89 at gmail.com (Matthew Tien) Date: Tue, 24 Jul 2012 13:36:05 -0500 Subject: [Biopython-dev] Extended Amino Acid Chains Message-ID: To whom it may concern, I am currently developing a program in Biopython that creates amino acid chains from an inputted AA sequence. The program would output a single amino acid chain in an extended conformation. Is this something of interest to the developers of Biopython? I am using basic calculus to calculate the position of the atoms in the protein residues and using known protein geometries and database information from PDB.org and the Dunbrack group . This program is an extension of my current research in calculating Relative Solvent Accessibilities of protein residues. Thank you for your time, Matthew Tien -- B.S. Biochemistry, University of Texas at Austin PhD. student, University of Chicago Marcotte Lab and Wilke Group alt. Matthew.Tien at yahoo.com 361-876-0942 From rodrigo.faccioli at gmail.com Tue Jul 24 19:01:58 2012 From: rodrigo.faccioli at gmail.com (Rodrigo Faccioli) Date: Tue, 24 Jul 2012 16:01:58 -0300 Subject: [Biopython-dev] Extended Amino Acid Chains In-Reply-To: References: Message-ID: Hi Mathew, If I understood your email, I have already implemented it. However, it is not put into BioPython project yet.In this moment, I don't have time to do it alone. In [1] there is an example of my code. In my project I extended the BioPython classes and created my parser because I had to work with files and database in same code. Therefore, I believed that it was the best :-). So, if it is what you wanted, we can work together to put it into BioPython project. [1] https://dl.dropbox.com/u/4270818/workingFcfrpStructure.py Best regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structural Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-8739 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 Personal Blogg - http://rodrigofaccioli.blogspot.com/ On Tue, Jul 24, 2012 at 3:36 PM, Matthew Tien wrote: > To whom it may concern, > > I am currently developing a program in Biopython that creates amino acid > chains from an inputted AA sequence. The program would output a single > amino acid chain in an extended conformation. Is this something of interest > to the developers of Biopython? > > I am using basic calculus to calculate the position of the atoms in the > protein residues and using known protein geometries and database > information from PDB.org and the Dunbrack group >. > This program is an extension of my current research in calculating Relative > Solvent Accessibilities of protein residues. > > Thank you for your time, > Matthew Tien > > -- > B.S. Biochemistry, University of Texas at Austin > PhD. student, University of Chicago > Marcotte Lab and Wilke Group > alt. Matthew.Tien at yahoo.com > 361-876-0942 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Tue Jul 24 21:08:44 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Jul 2012 17:08:44 -0400 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: >> The documentation suggests using + to combine FeatureLocations, which >> invites the use of sum. However, sum doesn't work properly. I explain >> why in my StackOverflow question: >> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior > > Huh, I hadn't anticipated that - but I agree trying to use sum seems > natural. > >> I have considered a number of workarounds: >> >> 1. Implementing __radd__ on FeatureLocation to return self if other == >> 0 allows sum() to work in place, but I am uncomfortable with >> hard-coding such a condition. > > Another idea is to define FeatureLocation or CompoundFeature > addition with an integer to expose the current private method _shift. > i.e. Apply an offset to the co-ordinates. Something I'd been pondering > as a (previously unrelated) enhancement. In this interpretation, adding > zero would have no effect on the co-ordinates and thus as a side > effect should also make sum(locations) work. We'd need to test this > to see if that actually works. Yes, this works fine: Modifying FeatureLocation.__add__ with the condition: if isinstance(other, int): return self._shift(other) and adding FeatureLocation.__radd__: def __radd__(self, other): return self.__add__(other) After these changes, FeatureLocation(3,6) + 3 yields [6:9] and sum([FeatureLocation(3,6), FeatureLocation(10,13)]) yields join{[3:6], [10:13]}. (+ of FeatureLocations also still works, as does summing lists with length > 2) > >> 2. Changing the location to subclass set and use xrange for generation >> would easily allow a number of things: an empty location >> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the >> 'magic' of merging abutting locations that you mention. However, using >> + and sum() on sets is dubious from a mathematically pure standpoint, >> and this would only work for ExactPositions. Note that I haven't >> attempted this yet and it may have disadvantages even for >> ExactPositions that I've failed to imagine. >> >> Let me know your thoughts. > > I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty > location, but rather as a between location - in this case between > the last and first base on a circular genome. In Genbank notation > for a circular genome of length 1234, this would be 1234^1 > (already an annoying special case we have to handle in the > parser and the writer - although I'd have to check the code > to see if we store this as [0:0] or [1234:1234] since both make > sense). If the length is 1234, [1234] would be an index error. I don't think [1233:1233] would make sense either; for space-counted genomic coordinates (http://alternateallele.blogspot.co.uk/2012/03/genome-coordinate-conventions.html), the index refers to the space to the left of the base pair. By that convention, [0:0] would refer to the gap between the last base and the first base. > > On the other hand, a CompoundLocation with zero parts might > make sense. There is something to be said for simply have > a single (upgraded) FeatureLocation object with a parts list, > which in the typical case would be length one, and proxy > methods for start/end as currently defined in CompoundLocation. > Maybe I should try that on another branch... it might be more > elegant overall. > I haven't tested sum() on CompoundLocations but I would guess they would need similar treatment to FeatureLocation. Should CompoundLocation + int also shift each part? I agree that an "upgraded" FeatureLocation could be more elegant. From p.j.a.cock at googlemail.com Tue Jul 24 21:38:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 22:38:59 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson wrote: >>> The documentation suggests using + to combine FeatureLocations, which >>> invites the use of sum. However, sum doesn't work properly. I explain >>> why in my StackOverflow question: >>> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior >> >> Huh, I hadn't anticipated that - but I agree trying to use sum seems >> natural. >> >>> I have considered a number of workarounds: >>> >>> 1. Implementing __radd__ on FeatureLocation to return self if other == >>> 0 allows sum() to work in place, but I am uncomfortable with >>> hard-coding such a condition. >> >> Another idea is to define FeatureLocation or CompoundFeature >> addition with an integer to expose the current private method _shift. >> i.e. Apply an offset to the co-ordinates. Something I'd been pondering >> as a (previously unrelated) enhancement. In this interpretation, adding >> zero would have no effect on the co-ordinates and thus as a side >> effect should also make sum(locations) work. We'd need to test this >> to see if that actually works. > > Yes, this works fine: > > Modifying FeatureLocation.__add__ with the condition: > > if isinstance(other, int): > return self._shift(other) > > and adding FeatureLocation.__radd__: > > def __radd__(self, other): > return self.__add__(other) > > After these changes, FeatureLocation(3,6) + 3 yields [6:9] and > sum([FeatureLocation(3,6), FeatureLocation(10,13)]) yields join{[3:6], > [10:13]}. (+ of FeatureLocations also still works, as does summing > lists with length > 2) OK - good. That might be worthwhile then. >>> 2. Changing the location to subclass set and use xrange for generation >>> would easily allow a number of things: an empty location >>> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the >>> 'magic' of merging abutting locations that you mention. However, using >>> + and sum() on sets is dubious from a mathematically pure standpoint, >>> and this would only work for ExactPositions. Note that I haven't >>> attempted this yet and it may have disadvantages even for >>> ExactPositions that I've failed to imagine. >>> >>> Let me know your thoughts. >> >> I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty >> location, but rather as a between location - in this case between >> the last and first base on a circular genome. In Genbank notation >> for a circular genome of length 1234, this would be 1234^1 >> (already an annoying special case we have to handle in the >> parser and the writer - although I'd have to check the code >> to see if we store this as [0:0] or [1234:1234] since both make >> sense). > > If the length is 1234, [1234] would be an index error. I don't think > [1233:1233] would make sense either; for space-counted genomic > coordinates (http://alternateallele.blogspot.co.uk/2012/03/genome-coordinate-conventions.html), > the index refers to the space to the left of the base pair. By that > convention, [0:0] would refer to the gap between the last base and the > first base. The point is that with a circular sequence of length n, base 0 is also base n, so [0:0] is sort of the same as [n:n], or [n:0]. Of these I guess [0,0] is the most sensible representation for following Python norms. But we digress - this certainly isn't an 'empty location', something which doesn't really make sense (other than in the sense of None meaning missing data). >> >> On the other hand, a CompoundLocation with zero parts might >> make sense. There is something to be said for simply have >> a single (upgraded) FeatureLocation object with a parts list, >> which in the typical case would be length one, and proxy >> methods for start/end as currently defined in CompoundLocation. >> Maybe I should try that on another branch... it might be more >> elegant overall. >> > > I haven't tested sum() on CompoundLocations but I would guess they > would need similar treatment to FeatureLocation. Should > CompoundLocation + int also shift each part? If we make those changes to the FeatureLocation, then yes, the CompoundLocation should get them too. > I agree that an "upgraded" FeatureLocation could be more > elegant. It could turn out to be simpler having just one location object... certainly worth trying out before committing this branch as is. Peter From anaryin at gmail.com Wed Jul 25 07:59:52 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 25 Jul 2012 09:59:52 +0200 Subject: [Biopython-dev] Extended Amino Acid Chains In-Reply-To: References: Message-ID: Hey Matthew, Rodrigo, The only problem I see with incorporating such "feature" in Biopython is that you would need a topology and parameters for the aminoacids and these are often forcefield dependent. Therefore, the quantity of data to add to the distribution would be quite big and you'd need someone to keep updating it as ffs evolve. Or am I seeing this from a completely wrong angle? Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2012/7/24 Rodrigo Faccioli > Hi Mathew, > > If I understood your email, I have already implemented it. However, it is > not put into BioPython project yet.In this moment, I don't have time to do > it alone. > > In [1] there is an example of my code. In my project I extended the > BioPython classes and created my parser because I had to work with files > and database in same code. Therefore, I believed that it was the best :-). > > So, if it is what you wanted, we can work together to put it into BioPython > project. > > [1] https://dl.dropbox.com/u/4270818/workingFcfrpStructure.py > > Best regards, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structural Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-8739 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 > Personal Blogg - http://rodrigofaccioli.blogspot.com/ > > > > On Tue, Jul 24, 2012 at 3:36 PM, Matthew Tien >wrote: > > > To whom it may concern, > > > > I am currently developing a program in Biopython that creates amino acid > > chains from an inputted AA sequence. The program would output a single > > amino acid chain in an extended conformation. Is this something of > interest > > to the developers of Biopython? > > > > I am using basic calculus to calculate the position of the atoms in the > > protein residues and using known protein geometries and database > > information from PDB.org and the Dunbrack group < > http://dunbrack.fccc.edu/ > > >. > > This program is an extension of my current research in calculating > Relative > > Solvent Accessibilities of protein residues. > > > > Thank you for your time, > > Matthew Tien > > > > -- > > B.S. Biochemistry, University of Texas at Austin > > PhD. student, University of Chicago > > Marcotte Lab and Wilke Group > > alt. Matthew.Tien at yahoo.com > > 361-876-0942 > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Thu Jul 26 19:04:03 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 26 Jul 2012 21:04:03 +0200 Subject: [Biopython-dev] Biopython.org down? Message-ID: Hi everyone, I have been trying to access the main site (biopython.org) since yesterday night to no avail. Upon checking http://www.downforeveryoneorjustme.com/biopython.org, it seems like the site is really down. And it's not just biopython, apparently all other open-bio sites (open-bio.org, bioperl.org, bioruby.org) are down as well. Does anybody know what's going on? regards, Bow From p.j.a.cock at googlemail.com Fri Jul 27 21:03:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 Jul 2012 22:03:56 +0100 Subject: [Biopython-dev] Biopython.org down? In-Reply-To: References: Message-ID: Yes, as mentioned on Twitter the fiber cable connection of the hosting site was severed in an accident - which also took our mailing list server offline as well as the websites :( Looks like everything is back now :) Peter On Thu, Jul 26, 2012 at 8:04 PM, Wibowo Arindrarto wrote: > Hi everyone, > > I have been trying to access the main site (biopython.org) since yesterday > night to no avail. Upon checking > http://www.downforeveryoneorjustme.com/biopython.org, it seems like the > site is really down. And it's not just biopython, apparently all other > open-bio sites (open-bio.org, bioperl.org, bioruby.org) are down as well. > > Does anybody know what's going on? > > regards, > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From w.arindrarto at gmail.com Fri Jul 27 21:07:39 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 27 Jul 2012 23:07:39 +0200 Subject: [Biopython-dev] Biopython.org down? In-Reply-To: References: Message-ID: Hi Lenna and Peter, Ah yes, I saw the tweet from @Biopython some time after I sent the email. I knew it was pretty bad when I didn't get the usual mail-received notification. Anyway, good to see it's online now :). regards, Bow On Fri, Jul 27, 2012 at 11:03 PM, Peter Cock wrote: > Yes, as mentioned on Twitter the fiber cable connection of the hosting > site was severed in an accident - which also took our mailing list server > offline as well as the websites :( > > Looks like everything is back now :) > > Peter > > On Thu, Jul 26, 2012 at 8:04 PM, Wibowo Arindrarto > wrote: > > Hi everyone, > > > > I have been trying to access the main site (biopython.org) since > yesterday > > night to no avail. Upon checking > > http://www.downforeveryoneorjustme.com/biopython.org, it seems like the > > site is really down. And it's not just biopython, apparently all other > > open-bio sites (open-bio.org, bioperl.org, bioruby.org) are down as > well. > > > > Does anybody know what's going on? > > > > regards, > > Bow > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Fri Jul 27 21:23:50 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 27 Jul 2012 17:23:50 -0400 Subject: [Biopython-dev] GSoC python variant update 8 Message-ID: It appears that this email didn't make it to the list due to the catastrophe yesterday. I apologize if anyone receives two copies! Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From arklenna at gmail.com Thu Jul 26 22:30:35 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 26 Jul 2012 18:30:35 -0400 Subject: [Biopython-dev] GSoC python variant update 8 Message-ID: Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From chris.mit7 at gmail.com Fri Jul 27 23:17:13 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Fri, 27 Jul 2012 19:17:13 -0400 Subject: [Biopython-dev] GSoC python variant update 8 In-Reply-To: References: Message-ID: Sorry for my brevity, but one great reason to scan a VCF file is to know where your variants are for downstream analysis. For instance, when analyzing RNA-Seq data for features such as Allele Specific Expression, having quick access to where variants are located is essential. On Thu, Jul 26, 2012 at 6:30 PM, Lenna Peterson wrote: > Link: http://arklenna.tumblr.com/post/28082157403/ > > Post: > > I previously proposed the implementation of a method for PyVCF that > would quickly scan the entire file and provide useful summary > statistics. The idea is shamelessly copied from Brad's GFF parser (see > https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this > method is helpful because the annotations on a sequence can vary > widely. However, I no longer think this would be useful for VCF: > > 1. Most importantly, the VCF headers generally contain a complete > listing of all of the types of information contained in the file. It's > technically optional, but I hope that the most commonly used variant > callers produce accurate headers. However, if there is a prevalence of > files with a mismatch between headers and actual INFO/FORMAT fields, > please let me know. > > 2. Next, any listing of ranges of data such as POS or QUAL might as > well be coupled with actual filtering. This would be different if a > presentation of the distribution of quality scores would be necessary > to set an appropriate threshold. It would also depend on the ratio of > speed between the range scan and the filtering (i.e. whether a > possible second filter would be unacceptably time consuming). > > 3. Finally, and perhaps most importantly, many files are so large that > scanning an entire file would take too long. Setting a limit and > displaying updated information in real time (i.e. writing to > `sys.stdout` with '\r', https://gist.github.com/3161269 ) could > overcome this issue. > > If any VCF users can think of a great reason to scan a VCF file before > filtering it, please get in touch. > > ------- > > I added the method `as_SeqFeature()` to my basic variant class, but > it's still incomplete. Some of this is in flux due to forthcoming > changes to FeatureLocation. > > I'm currently working on expanding the coordinate mapper Reece posted > to the dev list a couple years ago (see > http://biopython.org/pipermail/biopython/2010-June/006598.html ). > Expect an update on that very soon. > > Best, > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From rodrigo.faccioli at gmail.com Wed Jul 25 18:16:57 2012 From: rodrigo.faccioli at gmail.com (Rodrigo Faccioli) Date: Wed, 25 Jul 2012 15:16:57 -0300 Subject: [Biopython-dev] Extended Amino Acid Chains In-Reply-To: References: Message-ID: Hi Joao, What I understood about the Tien's idea, your angle is correct. However, I would like to say is that the use of force-field could be implemented in biopython through xml files since each xml file represents a version of ff. I'm not an expertise in ff. In fact, I have been studying only charmm27 mainly its implementation at gromacs. So, I believe that we can base on gromacs topology files and create a specific xml file for charmm27, for example. Maybe create a parser to read these gromacs files. Furthermore, the use of ff in biopython could be used in other implementations such as checking the structure. In this way, we can create a command like that: check_charmm27(structure). This command can create a list of errors of structure based on charmm27 ff. Did I write correctly? This email is an idea only. Best regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structural Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-8739 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 Personal Blogg - http://rodrigofaccioli.blogspot.com/ On Wed, Jul 25, 2012 at 4:59 AM, Jo?o Rodrigues wrote: > Hey Matthew, Rodrigo, > > The only problem I see with incorporating such "feature" in Biopython is > that you would need a topology and parameters for the aminoacids and these > are often forcefield dependent. Therefore, the quantity of data to add to > the distribution would be quite big and you'd need someone to keep updating > it as ffs evolve. Or am I seeing this from a completely wrong angle? > > Cheers, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > > 2012/7/24 Rodrigo Faccioli > >> Hi Mathew, >> >> If I understood your email, I have already implemented it. However, it is >> not put into BioPython project yet.In this moment, I don't have time to do >> it alone. >> >> In [1] there is an example of my code. In my project I extended the >> BioPython classes and created my parser because I had to work with files >> and database in same code. Therefore, I believed that it was the best :-). >> >> So, if it is what you wanted, we can work together to put it into >> BioPython >> project. >> >> [1] https://dl.dropbox.com/u/4270818/workingFcfrpStructure.py >> >> Best regards, >> >> -- >> Rodrigo Antonio Faccioli >> Ph.D Student in Electrical Engineering >> University of Sao Paulo - USP >> Engineering School of Sao Carlos - EESC >> Department of Electrical Engineering - SEL >> Intelligent System in Structural Bioinformatics >> http://laips.sel.eesc.usp.br >> Phone: 55 (16) 3373-8739 >> Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 >> Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 >> Personal Blogg - http://rodrigofaccioli.blogspot.com/ >> >> >> >> On Tue, Jul 24, 2012 at 3:36 PM, Matthew Tien > >wrote: >> >> > To whom it may concern, >> > >> > I am currently developing a program in Biopython that creates amino acid >> > chains from an inputted AA sequence. The program would output a single >> > amino acid chain in an extended conformation. Is this something of >> interest >> > to the developers of Biopython? >> > >> > I am using basic calculus to calculate the position of the atoms in the >> > protein residues and using known protein geometries and database >> > information from PDB.org and the Dunbrack group < >> http://dunbrack.fccc.edu/ >> > >. >> > This program is an extension of my current research in calculating >> Relative >> > Solvent Accessibilities of protein residues. >> > >> > Thank you for your time, >> > Matthew Tien >> > >> > -- >> > B.S. Biochemistry, University of Texas at Austin >> > PhD. student, University of Chicago >> > Marcotte Lab and Wilke Group >> > alt. Matthew.Tien at yahoo.com >> > 361-876-0942 >> > _______________________________________________ >> > Biopython-dev mailing list >> > Biopython-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > From jeff.hussmann at gmail.com Mon Jul 30 22:03:58 2012 From: jeff.hussmann at gmail.com (Jeff Hussmann) Date: Mon, 30 Jul 2012 17:03:58 -0500 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable Message-ID: Hello all - Bio.Data.CodonTable currently has a variable back_table that provides a mapping from an amino acid to single (arbitrary?) codon that encodes the amino acid. Is there any interest in adding a full_back_table (or some other suitable name) that would provide a mapping from an amino acid to a list of all codons that encode it? If so, I will submit a pull request. I have been using this myself for some projects on synonymous codon usage. - Jeff From p.j.a.cock at googlemail.com Tue Jul 31 09:06:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 31 Jul 2012 10:06:14 +0100 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable In-Reply-To: References: Message-ID: On Mon, Jul 30, 2012 at 11:03 PM, Jeff Hussmann wrote: > Hello all - > > Bio.Data.CodonTable currently has a variable back_table that provides > a mapping from an amino acid to single (arbitrary?) codon that encodes > the amino acid. The current code (which I doubt is widely used) does pick an arbitrary codon (using a sort to ensure this is consistent between Python versions). As noted in the comments, there are more useful alternatives - but the example of doing this on usage frequency is organism specific so can't be hard coded. > Is there any interest in adding a full_back_table (or some other > suitable name) that would provide a mapping from an amino acid to a > list of all codons that encode it? If so, I will submit a pull > request. I have been using this myself for some projects on synonymous > codon usage. Something like that would be more useful - sure, do a pull request from the current master branch. See also past discussions about back translation of sequences, e.g. http://lists.open-bio.org/pipermail/biopython/2012-April/007901.html Peter From p.j.a.cock at googlemail.com Tue Jul 31 10:37:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 31 Jul 2012 11:37:35 +0100 Subject: [Biopython-dev] Travis Continuous Integration testing & pull requests Message-ID: Hi all, I'm cross posting as this is an announcement. Please keep any follow up discussion to the relevant project specific mailing list, or if general open-bio-l please. Those following the OBF blog or the OBF or Bio* Twitter accounts will have already seen this, which I posted yesterday: http://news.open-bio.org/news/2012/07/travis-ci-for-testing/ In summary, since earlier this year BioRuby and then Biopython and BioPerl have been using Travis-CI.org (a hosted continuous integration service for the open source community) to run their unit tests automatically whenever their GitHub repositories are updated. In addition we now have TravisCI automatically running our tests on any new GitHub pull requests - supported by an OBF donation to Travis-CI, see: http://about.travis-ci.org/blog/announcing-pull-request-support/ Currently BioJava only uses GitHub as an SVN mirror - but this should still let you start using TravisCI for automated testing: http://about.travis-ci.org/docs/user/languages/java/ For EMBOSS, this is another incentive to convert from CVS to github - TravisCI recently announced support for C/C++ projects: http://about.travis-ci.org/blog/support_for_go_c_and_cpp/ http://about.travis-ci.org/docs/user/languages/c/ Potentially there are other OBF projects where this would be useful too. Regards, Peter From jeff.hussmann at gmail.com Tue Jul 31 19:07:42 2012 From: jeff.hussmann at gmail.com (Jeff Hussmann) Date: Tue, 31 Jul 2012 14:07:42 -0500 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable In-Reply-To: References: Message-ID: It seems desirable to have each amino acid's list of codons be given in a deterministic order. I have been sorting lexicographically using the ordering 'TCAG'. This is referred to as the 'conventional ordering' in CodonTable.__str__. The most flexible solution would be to take the ordering from self.nucleotide_alphabet.letters, but this would give 'GATC' for any CodonTable using IUPAC.unambiguous_dna as its nucleotide alphabet. Are there any Biopython-wide conventions here? On Tue, Jul 31, 2012 at 4:06 AM, Peter Cock wrote: > On Mon, Jul 30, 2012 at 11:03 PM, Jeff Hussmann wrote: >> Hello all - >> >> Bio.Data.CodonTable currently has a variable back_table that provides >> a mapping from an amino acid to single (arbitrary?) codon that encodes >> the amino acid. > > The current code (which I doubt is widely used) does pick an arbitrary > codon (using a sort to ensure this is consistent between Python versions). > As noted in the comments, there are more useful alternatives - but the > example of doing this on usage frequency is organism specific so > can't be hard coded. > >> Is there any interest in adding a full_back_table (or some other >> suitable name) that would provide a mapping from an amino acid to a >> list of all codons that encode it? If so, I will submit a pull >> request. I have been using this myself for some projects on synonymous >> codon usage. > > Something like that would be more useful - sure, do a pull request > from the current master branch. > > See also past discussions about back translation of sequences, e.g. > http://lists.open-bio.org/pipermail/biopython/2012-April/007901.html > > Peter