From p.j.a.cock at googlemail.com Mon Oct 3 07:20:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Oct 2011 12:20:21 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome Message-ID: Hi Brad (et al), You might have seen on Twitter at the end of last week I mentioned some work to extend Brad's Bio.Graphics.BasicChromosome to allow features within a chromosome segment, optionally with labels. The branch is here: https://github.com/peterjc/biopython/tree/chr_diag I put together a non-trivial example of showing the tRNA genes in Arabidopsis as a unit test in test_GraphicsChromosome.py - this is deliberately showing too many features in order to check the label placement algorithm: http://twitpic.com/6sgr1m This kind of figure is also used for showing SNP placement and genetic marker loci used in breeding etc. If I had put more (or a more uniform set of) features you'd get something worthy of the nickname "millipede diagram", looking like a segmented body (the chromosome) with thousands of legs (the lines for the labels). This isn't quite backwards compatible - the old code draws the chromosomes left aligned within their allocated space, while I put them centrally in order to draw labels on either side. Iddo sounded enthusiastic on Twitter. Does this look worth including as is? Would someone (doesn't have to be Brad) like to test/review it please? Thanks, Peter From bioinformed at gmail.com Mon Oct 3 17:28:21 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 3 Oct 2011 17:28:21 -0400 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 7:20 AM, Peter Cock wrote: > You might have seen on Twitter at the end of last week I mentioned > some work to extend Brad's Bio.Graphics.BasicChromosome to allow > features within a chromosome segment, optionally with labels. > > This looks to be extremely useful. Is there any support for layouts to stack or pack chromosomes? I'm thinking of diagrams for humans, where we don't fit as well in linear displays. I also think supporting chromosome bands would be extremely useful. These could include full cytobands, centromeres, euchromatic vs hetrochromatic regions, user configurable bands (e.g. linkage regions, IBD blocks, etc.) The figure shows off what I'm thinking about the banding and layout, even though it uses colored circles instead of text labels: http://www.genome.gov/multimedia/illustrations/GWAS_2011_1.pdf If there is interest, I may have some time to work on these features once the basic infrastructure is stable. Best regards, -Kevin From p.j.a.cock at googlemail.com Mon Oct 3 18:24:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Oct 2011 23:24:12 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Monday, October 3, 2011, Kevin Jacobs <jacobs at bioinformed.com> < bioinformed at gmail.com> wrote: > On Mon, Oct 3, 2011 at 7:20 AM, Peter Cock wrote: > >> You might have seen on Twitter at the end of last week I mentioned >> some work to extend Brad's Bio.Graphics.BasicChromosome to allow >> features within a chromosome segment, optionally with labels. >> >> > > This looks to be extremely useful. Is there any support for layouts to > stack or pack chromosomes? I'm thinking of diagrams for humans, where we > don't fit as well in linear displays. I also think supporting chromosome > bands would be extremely useful. These could include full cytobands, > centromeres, euchromatic vs hetrochromatic regions, user configurable bands > (e.g. linkage regions, IBD blocks, etc.) > > The figure shows off what I'm thinking about the banding and layout, even > though it uses colored circles instead of text labels: > http://www.genome.gov/multimedia/illustrations/GWAS_2011_1.pdf > > If there is interest, I may have some time to work on these features once > the basic infrastructure is stable. > > Best regards, > -Kevin Hi Kevin, I'm glad to hear there is some interest in this :) That example you linked to is interesting - there are several things of specific interest - and helps as I'm not yet familiar with all the technical terms you used. Notches in the chromosome which I assume are centromeres (I can see how that might be added to the Bio code as another segment type, similar to the telomeres). Coloured background regions in the chromosome (should be able to do this already), some of which are hatched (not possible right now... would have to look into ReportLab's capabilities here). This is what you meant by banding? Multiple coloured dots for labels. Doable, but a nice API might be tricky. For layout did you mean the fact this isn't just a row of chromosomes left to right, but here there are two rows? I'm inclined to say the user should just move things in the PDF for a final version using Adobe of Inkscape ;) Regards, Peter From keith.hughitt at gmail.com Tue Oct 4 07:31:51 2011 From: keith.hughitt at gmail.com (Keith Hughitt) Date: Tue, 4 Oct 2011 07:31:51 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator Message-ID: Hi all, I was thinking recently that it would be nice if the FASTA file reader were able to check for known formats (e.g. NCBI) and then use that information to choose better values for name, id, etc. After some discussion with Peter Cock on GitHub, however, he convinced me that this would be problematic in terms of backwards compatibility, and that instead a better approach might be to add a new sub-format ("fasta-ncbi") to the list of supported format readers. This could go something like: 1. Create a new function in SeqIO.FastaIO for parsing NCBI-formatted FASTA files. Add it the the mapping of iterators. 2. FastaIO.NCBIFasterIterator will simply call FASTAIterator and then modify the result by assigning a new id, name, etc (other suggestions?) 3. FastaIO.NCBIFastaWriter (modify and subclass FastaIO.FastaWriter?) 4. Modify code that interacts with NCBI services which return FASTA files and have it return a NCBIFasterIterator (First use a deprecation/warning to let users know of the pending change?) Does this sound like it would be a useful feature? What about the basic approach outlined above? Any suggestions? Keith From p.j.a.cock at googlemail.com Tue Oct 4 07:46:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Oct 2011 12:46:19 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: On Tue, Oct 4, 2011 at 12:31 PM, Keith Hughitt wrote: > Hi all, > > I was thinking recently that it would be nice if the FASTA file reader were > able to check for known formats (e.g. NCBI) and then use that information to > choose better values for name, id, etc. > > After some discussion with Peter Cock on GitHub, however, he convinced me > that this would be problematic in terms of backwards compatibility, and that > instead a better approach might be to add a new sub-format ("fasta-ncbi") to > the list of supported format readers. > > This could go something like: > > 1. Create a new function in SeqIO.FastaIO for parsing NCBI-formatted FASTA > files. Add it the the mapping of iterators. Yes. > 2. FastaIO.NCBIFasterIterator will simply call FASTAIterator and then modify > the result by assigning a new id, name, etc (other suggestions?) Store the GI number in the SeqRecord's annotation under key "gi" to match the GenBank parser. There may be other things like this. If the FASTA header does not match the NCBI style, that should probably trigger an exception. > 3. FastaIO.NCBIFastaWriter (modify and subclass FastaIO.FastaWriter?) This will be harder, but yes in principle. > 4. Modify code that interacts with NCBI services which return FASTA files > and have it return a NCBIFasterIterator (First use a deprecation/warning to > let users know of the pending change?) No need. I'm pretty sure all the NCBI code (like Bio.Entrez) returns handles so it is up to the end user to decide what to do with the data, e.g. parse it with the current SeqIO "fasta" format, or save it straight to disk. > Does this sound like it would be a useful feature? What about the basic > approach outlined above? Any suggestions? > > Keith Yes, it sounds useful. I'm not sure where the most current NCBI documentation is, but this is a good start: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html Peter From chapmanb at 50mail.com Wed Oct 5 08:03:31 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Oct 2011 08:03:31 -0400 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> Peter; > >> You might have seen on Twitter at the end of last week I mentioned > >> some work to extend Brad's Bio.Graphics.BasicChromosome to allow > >> features within a chromosome segment, optionally with labels. This is awesome, thanks for extending it. All of your tweaks are good improvements, and I'm +1 for including it in the next release. Please improve away. Thanks much, Brad From bioinformed at gmail.com Wed Oct 5 09:16:56 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 5 Oct 2011 09:16:56 -0400 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 6:24 PM, Peter Cock wrote: > Notches in the chromosome which I assume are centromeres > (I can see how that might be added to the Bio code as another > segment type, similar to the telomeres). > Yes-- although the visual style for centromeres need not be precisely as shown in my example. > Coloured background regions in the chromosome (should be > able to do this already), some of which are hatched (not possible > right now... would have to look into ReportLab's capabilities here). > This is what you meant by banding? > Yes-- being able to show cytobands and custom bands to designate regions will be very useful for me. As before, I'm not wed to the cross-hatching, in fact the standard displays use only grayscale. Multiple coloured dots for labels. Doable, but a nice API might > be tricky. > I don't much care about those -- I'd be happy with text labels. > For layout did you mean the fact this isn't just a row of > chromosomes left to right, but here there are two rows? > I'm inclined to say the user should just move things in > the PDF for a final version using Adobe of Inkscape ;) > Correct. I'd prefer to have some programmatic control of layout, since I'd hate to have to manually edit every whole-genome plot. Since I'm working exclusively with human data for now, it would be possible to pre-specify a few standard layouts and avoid the trouble of supporting dynamic features. Just let me know when the code is stable enough to start poking around. I'll float a proposal for what I think could be done to obtain feedback before I commit much time to coding. Thanks, -Kevin From p.j.a.cock at googlemail.com Wed Oct 5 09:32:34 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 14:32:34 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Wed, Oct 5, 2011 at 2:16 PM, Kevin Jacobs wrote: > On Mon, Oct 3, 2011 at 6:24 PM, Peter Cock wrote: >> >> Notches in the chromosome which I assume are centromeres >> (I can see how that might be added to the Bio code as another >> segment type, similar to the telomeres). > > Yes-- although the visual style for centromeres need not be precisely as > shown in my example. > >> >> Coloured background regions in the chromosome (should be >> able to do this already), some of which are hatched (not possible >> right now... would have to look into ReportLab's capabilities here). >> This is what you meant by banding? > > Yes-- being able to show cytobands and custom bands to designate regions > will be very useful for me. ?As before, I'm not wed to the cross-hatching, > in fact the standard displays use only grayscale. OK - simple colours are easy, I can add that to the test case example. >> >> Multiple coloured dots for labels. Doable, but a nice API might >> be tricky. > > I don't much care about those -- I'd be happy with text labels. > Good. >> >> For layout did you mean the fact this isn't just a row of >> chromosomes left to right, but here there are two rows? >> I'm inclined to say the user should just move things in >> the PDF for a final version using Adobe of Inkscape ;) > > Correct. ?I'd prefer to have some?programmatic?control of layout, since I'd > hate to have to manually edit every whole-genome plot. ?Since I'm working > exclusively with human data for now, it would be possible to pre-specify a > few standard layouts and avoid the trouble of supporting dynamic features. > Just let me know when the code is stable enough to start poking around. > ?I'll float a proposal for what I think could be done to obtain feedback > before I commit much time to coding. Would an option for using multiple rows be enough? It wouldn't be quite as compact as the tweaked human example you showed - but probably good enough to print on a single page. Another option is to do the PDF editing programmatically, for example with ReportLab. You can embed multiple (smaller) PDF files within a larger container. Its a bit fiddly, but would be worth the effort for a major pipeline where you always use the same (few) organism(s). Peter From p.j.a.cock at googlemail.com Wed Oct 5 10:40:56 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 15:40:56 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> References: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> Message-ID: On Wed, Oct 5, 2011 at 1:03 PM, Brad Chapman wrote: > > Peter; > >> >> You might have seen on Twitter at the end of last week I mentioned >> >> some work to extend Brad's Bio.Graphics.BasicChromosome to allow >> >> features within a chromosome segment, optionally with labels. > > This is awesome, thanks for extending it. All of your tweaks are good > improvements, and I'm +1 for including it in the next release. Please > improve away. Awesome. I've applied the current branch to the trunk, although I'm not promising there won't be changes to the new stuff between now and the next release. In particular, doing the labels (and their placement) for the whole of a chromosome (and not just for a segment) would allow us to squeeze in more labels (e.g. in example I showed using the vertical space currently reserved for the telomeres). Peter From p.j.a.cock at googlemail.com Wed Oct 5 17:17:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 22:17:38 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Wed, Oct 5, 2011 at 2:32 PM, Peter Cock wrote: > On Wed, Oct 5, 2011 at 2:16 PM, Kevin Jacobs wrote: >> On Mon, Oct 3, 2011 at 6:24 PM, Peter Cock wrote: >>> Coloured background regions in the chromosome (should be >>> able to do this already), some of which are hatched (not possible >>> right now... would have to look into ReportLab's capabilities here). >>> This is what you meant by banding? >> >> Yes-- being able to show cytobands and custom bands to designate regions >> will be very useful for me. ?As before, I'm not wed to the cross-hatching, >> in fact the standard displays use only grayscale. > > OK - simple colours are easy, I can add that to the test case example. Done, using some random placements - I didn't manage to find the real Arabidopsis cytoband data which would have been nicer. https://github.com/biopython/biopython/commit/24deaca63ba55e28519a4c85650ad74e849f203e Peter From p.j.a.cock at googlemail.com Wed Oct 5 18:31:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 23:31:18 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> Message-ID: On Wed, Oct 5, 2011 at 3:40 PM, Peter Cock wrote: > > In particular, doing the labels (and their placement) for the whole > of a chromosome (and not just for a segment) would allow us to > squeeze in more labels (e.g. in example I showed using the > vertical space currently reserved for the telomeres). > Done, https://github.com/biopython/biopython/commit/d3d19440bdbaabbf4cd305e43dea627f68cf6ecf We may want to review how chromosome segment labels work - probably simplest to add them to the dynamically placed label list, otherwise the two can overlap. Peter From tiagoantao at gmail.com Thu Oct 6 12:17:40 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 17:17:40 +0100 Subject: [Biopython-dev] bio.expasy potential bug? Message-ID: Hi, This might be a red herring but: http://biopython.org/DIST/docs/api/Bio.ExPASy-module.html : sprot_search_ful(text, make_wild=None, swissprot=1, trembl=None, cgi='http://www.expasy.ch/cgi-bin/sprot-search-ful') That cgi does not exist... Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Thu Oct 6 12:23:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 17:23:03 +0100 Subject: [Biopython-dev] bio.expasy potential bug? In-Reply-To: References: Message-ID: 2011/10/6 Tiago Ant?o : > Hi, > > This might be a red herring but: > http://biopython.org/DIST/docs/api/Bio.ExPASy-module.html : > sprot_search_ful(text, make_wild=None, swissprot=1, trembl=None, > cgi='http://www.expasy.ch/cgi-bin/sprot-search-ful') > > That cgi does not exist... > > Tiago Looks like they've changed the URL or turned off a redirect :( If you can work out what they should be, please go ahead an fix it. A working unit test would be good (mark it as requires internet). Peter From tiagoantao at gmail.com Thu Oct 6 12:33:11 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 17:33:11 +0100 Subject: [Biopython-dev] bio.expasy potential bug? In-Reply-To: References: Message-ID: 2011/10/6 Peter Cock : > Looks like they've changed the URL or turned off a redirect :( > > If you can work out what they should be, please go ahead an fix it. > A working unit test would be good (mark it as requires internet). I will add the bug to redmine. I currently am pressed on time to sort this out :( I can have a look next week. From redmine at redmine.open-bio.org Thu Oct 6 13:06:26 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 6 Oct 2011 17:06:26 +0000 Subject: [Biopython-dev] [Biopython - Bug #3301] (New) Bio.ExPASy sprot_search_ful has wrong cgi address Message-ID: Issue #3301 has been reported by Tiago Antao. ---------------------------------------- Bug #3301: Bio.ExPASy sprot_search_ful has wrong cgi address https://redmine.open-bio.org/issues/3301 Author: Tiago Antao Status: New Priority: Normal Assignee: Category: Target version: URL: The Bio.ExPASy sprot_search_ful has a cgi of http://www.expasy.ch/cgi-bin/sprot-search-ful , but that URL is not available anymore. See: http://biopython.org/DIST/docs/api/Bio.ExPASy-module.html ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From keith.hughitt at gmail.com Fri Oct 7 07:18:10 2011 From: keith.hughitt at gmail.com (Keith Hughitt) Date: Fri, 7 Oct 2011 07:18:10 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: Okay, I took at stab at it. The code is in the master branch of my fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73 I wasn't sure what the best choices are for id/name so for now I stored the gid in id (and also in the annotations), and the accession as name. Any suggestions? I also haven't written any test code yet. Should I parameterize TitleFunctions.simple_check and multi_check, or is there another approach you would advise? Keith From p.j.a.cock at googlemail.com Fri Oct 7 08:49:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 13:49:30 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt wrote: > Okay, I took at stab at it. The code is in the master branch of my > fork:?https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73 You are only handling gi||ref|| whereas the NCBI have a *lot* of other variations to consider: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html This is quite an open ended bit of work... > I wasn't sure what the best choices are for id/name so for now I stored the > gid in id (and also in the annotations), and the accession as name. Any > suggestions? I suggest collecting a selection of matched NCBI FASTA and GenBank/GenPept files, and how Biopython handles the GenBank/GenPept version (format name "genbank" alias "gb" in Bio.SeqIO) and try to make handling the FASTA version as "fasta-ncbi" do the same. e.g. From our unit tests (from the NCBI FTP site), these are a pair: Tests/GenBank/NC_005816.gb Tests/GenBank/NC_005816.fna > I also haven't written any test code yet. Should I parameterize > TitleFunctions.simple_check and multi_check, or is there > another approach you would advise? > Keith Probably write some completely new tests. e.g. Use the existing test files mentioned above, and verify that both the "genbank" and the "fasta-ncbi" parser give the same results (ignoring things not in the FASTA file of course). Peter From andrew.sczesnak at med.nyu.edu Fri Oct 7 11:38:04 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 07 Oct 2011 11:38:04 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: <4E8F1CDC.8090500@med.nyu.edu> Adding my unsolicited opinion here, what do y'all think of this NCBIFasta parser being a more general "callback" parser, where a function passed to read() or write() translates some arbitrary delimited-text into an (id, name, description) tuple, as in: def x(seqrec): # gi||ref|| y = seqrec.description.strip().split("|") # gi acc desc return (y[1], y[3]. y[4]) # calls x on every record in the FASTA for seqrec in SeqIO.parse(fp, "fasta", x): print seqrec This would be similar to key_function in SeqIO.to_dict() and would shift the responsibility of handling variation in formats to the user. Alternatively, a few functions to parse different styles of description lines could be included in the module. Andrew On 10/07/2011 08:49 AM, Peter Cock wrote: > On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt wrote: >> Okay, I took at stab at it. The code is in the master branch of my >> fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73 > > You are only handling gi||ref|| > whereas the NCBI have a *lot* of other variations to consider: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html > > This is quite an open ended bit of work... > >> I wasn't sure what the best choices are for id/name so for now I stored the >> gid in id (and also in the annotations), and the accession as name. Any >> suggestions? > > I suggest collecting a selection of matched NCBI FASTA and > GenBank/GenPept files, and how Biopython handles the > GenBank/GenPept version (format name "genbank" alias "gb" > in Bio.SeqIO) and try to make handling the FASTA version as > "fasta-ncbi" do the same. > > e.g. From our unit tests (from the NCBI FTP site), these are > a pair: > > Tests/GenBank/NC_005816.gb > Tests/GenBank/NC_005816.fna > >> I also haven't written any test code yet. Should I parameterize >> TitleFunctions.simple_check and multi_check, or is there >> another approach you would advise? >> Keith > > Probably write some completely new tests. e.g. Use the > existing test files mentioned above, and verify that both > the "genbank" and the "fasta-ncbi" parser give the same > results (ignoring things not in the FASTA file of course). > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Oct 7 12:00:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 17:00:52 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: <4E8F1CDC.8090500@med.nyu.edu> References: <4E8F1CDC.8090500@med.nyu.edu> Message-ID: On Fri, Oct 7, 2011 at 4:38 PM, Andrew Sczesnak wrote: > Adding my unsolicited opinion here, what do y'all think of this NCBIFasta > parser being a more general "callback" parser, where a function passed to > read() or write() translates some arbitrary delimited-text into ... > > This would be similar to key_function in SeqIO.to_dict() and would shift the > responsibility of handling variation in formats to the user. Alternatively, > a few functions to parse different styles of description lines could be > included in the module. > > Andrew Hi Andrew, Interesting idea, although it doesn't fit that well with the current (deliberately) simple high level Bio.SeqIO.parse/read API, that doesn't mean we can't do it (see Bio.Phylo.parse). In this case I fail to see what benefit this gives over the current situation, where the user can do this themselves with the current FASTA parser, e.g. With a function and a generator expression, records = (do_ncbi_my_way(record) for record in SeqIO.parse(filename, "fasta")) or more simply within a loop: for record in SeqIO.parse(filename, "fasta")): do_ncbi_my_way(record) #Do stuff with record etc. Maybe it is down to personal preference of coding style? I would much prefer a new "fasta-ncbi" parser in SeqIO that handled all the documented NCBI FASTA identifiers. I'm being negative here - but please don't let that deter you from posting ideas. This is a public list and we/I welcome constructive criticism and alternative ideas to the table. Regards, Peter From p.j.a.cock at googlemail.com Fri Oct 7 12:16:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 17:16:55 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: <4E8F239D.30504@med.nyu.edu> References: <4E8F1CDC.8090500@med.nyu.edu> <4E8F239D.30504@med.nyu.edu> Message-ID: On Fri, Oct 7, 2011 at 5:06 PM, Andrew Sczesnak wrote: >> >> Maybe it is down to personal preference of coding style? > > I agree, there isn't much difference between specifying the callback > function in parse() or within the loop. To me, this points out that > re-implementing a FASTA parser simply for a format of description > line seems unnecessary. > > If a user is interesting in extracting a particular piece of information > from a FASTA description and knows the input format of the file, how > difficult is it for them to split() it on their own? What exactly are the > advantages of a separate parser? Not enough of an advantage for me personally to have gone and written it myself ;) I can see some benefits in extracting information from the NCBI identifier and storing them in the SeqRecord's dbxref list and annotation dictionary (as consistently with our other parsers as possible) if you are going to want to use those fields yourself. Perhaps Keith can explain his interest with some examples? Peter From andrew.sczesnak at med.nyu.edu Fri Oct 7 12:06:53 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 07 Oct 2011 12:06:53 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: <4E8F1CDC.8090500@med.nyu.edu> Message-ID: <4E8F239D.30504@med.nyu.edu> On 10/07/2011 12:00 PM, Peter Cock wrote: > Hi Andrew, > > Interesting idea, although it doesn't fit that well with the current > (deliberately) simple high level Bio.SeqIO.parse/read API, > that doesn't mean we can't do it (see Bio.Phylo.parse). > > In this case I fail to see what benefit this gives over the current > situation, where the user can do this themselves with the > current FASTA parser, > > e.g. With a function and a generator expression, > > records = (do_ncbi_my_way(record) for record in SeqIO.parse(filename, "fasta")) > > or more simply within a loop: > > for record in SeqIO.parse(filename, "fasta")): > do_ncbi_my_way(record) > #Do stuff with record > > etc. > > Maybe it is down to personal preference of coding style? I agree, there isn't much difference between specifying the callback function in parse() or within the loop. To me, this points out that re-implementing a FASTA parser simply for a format of description line seems unnecessary. If a user is interesting in extracting a particular piece of information from a FASTA description and knows the input format of the file, how difficult is it for them to split() it on their own? What exactly are the advantages of a separate parser? > I would much prefer a new "fasta-ncbi" parser in SeqIO > that handled all the documented NCBI FASTA identifiers. > > I'm being negative here - but please don't let that deter you > from posting ideas. This is a public list and we/I welcome > constructive criticism and alternative ideas to the table. > > Regards, > > Peter From keith.hughitt at gmail.com Fri Oct 7 13:02:30 2011 From: keith.hughitt at gmail.com (Keith Hughitt) Date: Fri, 7 Oct 2011 13:02:30 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: <4E8F1CDC.8090500@med.nyu.edu> <4E8F239D.30504@med.nyu.edu> Message-ID: It's really just meant to be a bit of "polish." Originally I was thinking not about having a separate parser but simply extending the existing FASTA parser to recognize common formats (e.g. NCBI) and choose better ids, annotations, etc. Since that would create problems in terms of backwards compatibility, however, adding a new parser seemed like the next best option. Part of the goal, personally, was also just to find a small but useful task I could work on to begin to learn the code and contribute some. It shouldn't be forced though, so I don't want to contribute something unless it's actually an improvement. Keith On Fri, Oct 7, 2011 at 12:16 PM, Peter Cock wrote: > On Fri, Oct 7, 2011 at 5:06 PM, Andrew Sczesnak > wrote: > >> > >> Maybe it is down to personal preference of coding style? > > > > I agree, there isn't much difference between specifying the callback > > function in parse() or within the loop. To me, this points out that > > re-implementing a FASTA parser simply for a format of description > > line seems unnecessary. > > > > If a user is interesting in extracting a particular piece of information > > from a FASTA description and knows the input format of the file, how > > difficult is it for them to split() it on their own? What exactly are the > > advantages of a separate parser? > > Not enough of an advantage for me personally to have gone > and written it myself ;) > > I can see some benefits in extracting information from the > NCBI identifier and storing them in the SeqRecord's dbxref > list and annotation dictionary (as consistently with our other > parsers as possible) if you are going to want to use those > fields yourself. > > Perhaps Keith can explain his interest with some examples? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From b.invergo at gmail.com Mon Oct 10 06:36:47 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Mon, 10 Oct 2011 12:36:47 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output Message-ID: <1318243007.12974.16.camel@localhost.localdomain> Hi all, I've received a request to implement the parsing of the main supplementary output files of the PAML programs ('rst' files). I can't submit a bug on Bugzilla, so I'll just announce my intention to work on this here on the list. One question though. The rst file for baseml includes an alignment which is in the Phylip sequential format. I thought that it would be nice to parse that directly into a Biopython MultipleSeqAlignment. It's my understanding that Biopython only supports the interleaved format. Would it be worth it for me to extend that functionality to include the sequential format or would it be preferable to convert the alignments to be interleaved within the parser itself? Regards, Brandon Invergo From p.j.a.cock at googlemail.com Mon Oct 10 08:21:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Oct 2011 13:21:52 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318243007.12974.16.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> Message-ID: On Mon, Oct 10, 2011 at 11:36 AM, Brandon Invergo wrote: > Hi all, > I've received a request to implement the parsing of the main > supplementary output files of the PAML programs ('rst' files). I can't > submit a bug on Bugzilla, so I'll just announce my intention to work on > this here on the list. That's because we moved to RedMine, there should have been a link on the old Bugzilla page, but anyway its here: https://redmine.open-bio.org/projects/biopython > One question though. The rst file for baseml includes an alignment which > is in the Phylip sequential format. I thought that it would be nice to > parse that directly into a Biopython MultipleSeqAlignment. It's my > understanding that Biopython only supports the interleaved format. Would > it be worth it for me to extend that functionality to include the > sequential format or would it be preferable to convert the alignments to > be interleaved within the parser itself? > > Regards, > Brandon Invergo If you can extend the current PHYLIP parser (strict or relaxed) to cover interleaved and sequential, that would be nice. For strict mode at least, we can in principle follow whatever the original PHYLIP tools do to detect this automatically. It may be safer to make it explicit though - from what I recall without seeing the PHYLIP implementation's source code it was not obvious how to do this reliably. Peter From b.invergo at gmail.com Mon Oct 10 09:22:18 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Mon, 10 Oct 2011 15:22:18 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> Message-ID: <1318252938.12974.54.camel@localhost.localdomain> Hi Peter > That's because we moved to RedMine, there should have > been a link on the old Bugzilla page, but anyway its here: > https://redmine.open-bio.org/projects/biopython Ok, I'll file an enhancement request there. I didn't see a link on the Bugzilla page and there are still some links to Bugzilla on the wiki, like in the "What's being worked on" section. I missed the Issue Tracker link on the left (incidentally, I think this is a design problem of the typical wiki layout and not Biopython-specific...I never notice the contents of that list), so it might be advisable to include the link under the Contribute heading of the main page. > If you can extend the current PHYLIP parser (strict or relaxed) > to cover interleaved and sequential, that would be nice. For > strict mode at least, we can in principle follow whatever the > original PHYLIP tools do to detect this automatically. It may > be safer to make it explicit though - from what I recall without > seeing the PHYLIP implementation's source code it was not > obvious how to do this reliably. Ok, I'll take a look at the PHYLIP source code to see how they do it there. I'll report back with problems/notable progress/questions. Cheers, Brandon From redmine at redmine.open-bio.org Mon Oct 10 09:29:47 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 10 Oct 2011 13:29:47 +0000 Subject: [Biopython-dev] [Biopython - Feature #3303] (New) Support PHYLIP sequential alignment format in AlignIO Message-ID: Issue #3303 has been reported by Brandon Invergo. ---------------------------------------- Feature #3303: Support PHYLIP sequential alignment format in AlignIO https://redmine.open-bio.org/issues/3303 Author: Brandon Invergo Status: New Priority: Normal Assignee: Brandon Invergo Category: Target version: URL: Currently only PHYLIP alignments in the interleaved format can be read by AlignIO however since some programs still work on the sequential format it would be helpful to be able to support that as well. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Oct 10 09:31:13 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 10 Oct 2011 13:31:13 +0000 Subject: [Biopython-dev] [Biopython - Feature #3304] (New) Parse PAML supplementary (rst) output files Message-ID: Issue #3304 has been reported by Brandon Invergo. ---------------------------------------- Feature #3304: Parse PAML supplementary (rst) output files https://redmine.open-bio.org/issues/3304 Author: Brandon Invergo Status: New Priority: Normal Assignee: Brandon Invergo Category: Target version: URL: PAML programs create several output files, the main one of which is already parsed by the Bio.Phylo.PAML modules. The primary supplementary output files ('rst' files) also contain information that is useful for some users so they should be parsed as well. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon Oct 10 12:35:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Oct 2011 17:35:15 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318252938.12974.54.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318252938.12974.54.camel@localhost.localdomain> Message-ID: On Mon, Oct 10, 2011 at 2:22 PM, Brandon Invergo wrote: > Hi Peter > >> That's because we moved to RedMine, there should have >> been a link on the old Bugzilla page, but anyway its here: >> https://redmine.open-bio.org/projects/biopython > > Ok, I'll file an enhancement request there. I didn't see a link on the > Bugzilla page and there are still some links to Bugzilla on the wiki, > like in the "What's being worked on" section. Fixed, thanks. > I missed the Issue Tracker > link on the left (incidentally, I think this is a design problem of the > typical wiki layout and not Biopython-specific...I never notice the > contents of that list), so it might be advisable to include the link > under the Contribute heading of the main page. Good idea, done. Peter From p.j.a.cock at googlemail.com Mon Oct 10 17:47:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Oct 2011 22:47:03 +0100 Subject: [Biopython-dev] Moving strand & db ref from SeqFeature to FeatureLocation Message-ID: This was on the "SeqFeature start/end and making positions act like ints" thread last month: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html On Mon, Sep 19, 2011 at 10:03 AM, Peter Cock wrote: >> Well, slightly easier - I have some more dramatic changes to >> the SeqFeature and FeatureLocation objects planned, but I'm >> still playing with this. > > One of the key changes (which can be done without > really changing the API) is to move the database & > accession and the strand from the SeqFeature to the > FeatureLocation. These are intimately connected with > the location, as much as the start/end. > > This is one of the things I've been working on here: > https://github.com/peterjc/biopython/commits/f_loc > > The other key change on that experimental branch > is moving away from sub_features for join locations > (etc). Here I was trying a new CoupoundLocation > object, but am still wondering if this should be done > in the SeqFeature or FeatureLocation object instead > (or if SeqFeature should subclass FeatureLocation). > > Peter That branch needs some manual merge conflict resolution with the integer subclassing position changes that landed on the trunk, which I've started: https://github.com/peterjc/biopython/tree/f_loc2 Would someone like to review that please? It moves the strand, ref and db_ref properties from the SeqFeature object to the FeatureLocation object, implementing read/write proxy methods for backward compatibility. Other than the commit which changes the __str__ method (the fine details of which I am happy to tweak with discussion) this should be almost 100% back compatible: https://github.com/peterjc/biopython/commit/fed003821d0d223a7b3042ccc3bdf8442348f043 The one break I am aware of is you can't now create a SeqFeature with an empty location and then try to set the strand or db regs before setting the location object. (which is what the GenBank parser was doing). The motivation is that the strand and (optional) database reference for which the location start/end apply are both essential parts of the location information, and I feel never should have been attached to the SeqFeature directly. Furthermore, this separation is useful as a step towards reworking the current use of the SeqFeature's sub_feature list for multi-part locations (e.g. joins in GenBank/EMBL), more on this later. Thanks, Peter From b.invergo at gmail.com Tue Oct 11 03:51:26 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 09:51:26 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> Message-ID: <1318319486.3137.19.camel@localhost.localdomain> > If you can extend the current PHYLIP parser (strict or relaxed) > to cover interleaved and sequential, that would be nice. For > strict mode at least, we can in principle follow whatever the > original PHYLIP tools do to detect this automatically. It may > be safer to make it explicit though - from what I recall without > seeing the PHYLIP implementation's source code it was not > obvious how to do this reliably. > I checked out the PHYLIP code and yes it's not really obvious how the mode is detected. In fact, it seems that many of the programs ask for user input to specify the format of the alignment. So, regarding making it explicit, I'm not sure if this is what you meant but I was thinking it might be simplest to add another Iterator/Writer pair in the PhylipIO module for SequentialPhylip which inherit from the basic Phylip classes, overriding the next() method in the iterator and the write_alignment() method in the writer, much in the way that the RelaxedPhylip classes work. This would mean that there would be no flexibility in the naming rules (ie relaxed vs strict) for the SequentialPhylip format, unless I were to also make a RelaxedSequentialPhylip pair of classes. PAML relaxes the sequence name length restriction to 30 characters and since the whole reason for embarking on this exercise was to support PAML's output of PHYLIP alignments, if only one naming convention is to be implemented I think it would be best to default to the relaxed rules. Slightly unrelated musings: I was thinking that with Biopython's support for reading PHYLIP alignments and Newick trees into objects, at some point it would probably be convenient to make the Bio.Phylo.PAML package more integrated by allowing the user to pass in such objects as arguments rather than writing them to files first; the PAML module could write them to temp files itself. I think some minor changes might have to be made in places (ie for PAML to accept interleaved alignments, the header line must contain an 'I' flag after the seq # and seq len integers) and I'd have to think about how best to allow passing such objects while still retaining the ability to specify filenames without using kludgy, non-pythonic type-checking. Anyway, another task for another day, but I thought I'd throw it out there. Regards, Brandon From p.j.a.cock at googlemail.com Tue Oct 11 04:20:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 09:20:52 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318319486.3137.19.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 8:51 AM, Brandon Invergo wrote: >> If you can extend the current PHYLIP parser (strict or relaxed) >> to cover interleaved and sequential, that would be nice. For >> strict mode at least, we can in principle follow whatever the >> original PHYLIP tools do to detect this automatically. It may >> be safer to make it explicit though - from what I recall without >> seeing the PHYLIP implementation's source code it was not >> obvious how to do this reliably. >> > I checked out the PHYLIP code and yes it's not really obvious how the > mode is detected. In fact, it seems that many of the programs ask for > user input to specify the format of the alignment. > > So, regarding making it explicit, I'm not sure if this is what you meant > but I was thinking it might be simplest to add another Iterator/Writer > pair in the PhylipIO module for SequentialPhylip which inherit from the > basic Phylip classes, overriding the next() method in the iterator and > the write_alignment() method in the writer, much in the way that the > RelaxedPhylip classes work. Something like that as a new format variant, yes. > This would mean that there would be no flexibility in the naming rules > (ie relaxed vs strict) for the SequentialPhylip format, unless I were to > also make a RelaxedSequentialPhylip pair of classes. PAML relaxes the > sequence name length restriction to 30 characters and since the whole > reason for embarking on this exercise was to support PAML's output of > PHYLIP alignments, if only one naming convention is to be implemented I > think it would be best to default to the relaxed rules. Practical. > Slightly unrelated musings: I was thinking that with Biopython's support > for reading PHYLIP alignments and Newick trees into objects, at some > point it would probably be convenient to make the Bio.Phylo.PAML package > more integrated by allowing the user to pass in such objects as > arguments rather than writing them to files first; the PAML module could > write them to temp files itself. I think some minor changes might have > to be made in places (ie for PAML to accept interleaved alignments, the > header line must contain an 'I' flag after the seq # and seq len > integers) and I'd have to think about how best to allow passing such > objects while still retaining the ability to specify filenames without > using kludgy, non-pythonic type-checking. Anyway, another task for > another day, but I thought I'd throw it out there. Do we need to write the "I" flag in our PHYLIP output? Peter From b.invergo at gmail.com Tue Oct 11 05:33:13 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 11:33:13 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> Message-ID: <1318325593.3137.51.camel@localhost.localdomain> > Something like that as a new format variant, yes. > > > ... > > Practical. > Ok, I'll start working on that then. > Do we need to write the "I" flag in our PHYLIP output? It took me a while to hunt down information on PHYLIP flags. I found this link which mentions them: http://www.no.embnet.org/phylipdoc/ They're only used by the program which is using the alignment as input, corresponding to the PHYLIP programs' menu options. In general, they have no affect on the format of the alignment (aside from the 'S'/sequential vs 'I'/interleaved flags). However, some of them might require extra information immediately below the header line, before the alignment starts. This complicates things. (see below for some PAML examples) However, since there's no real standardization to the use of the phylip format, not all programs pay attention to these flags. In my own work, I've used TCoffee to generate interleaved alignments and then I have to add in the 'I' after the fact. As another example, the current Biopython PhylipIO would not recognize a header line with options as a valid header line, since there would be more than 2 "parts". So, if some programs can take options flags (at least PHYLIP and PAML programs) while other programs may not like their inclusion, they would need to be treated specially. I would suggest that the PhylipIterator classes be modified to recognize the existence of options, but not necessarily do anything with them, and that the PhylipWriter classes be modified to optionally take a string containing option flags to append to the header line, ie 'I', 'GC', etc. As for the supplementary information for the options, I'm not sure if those complicate matters beyond the scope of Biopython's intended functionality, or whether there should be yet another optional string argument to the writer. The PhylipIterators would then need to be modified to handle the possible existence of these supplementary lines as well. Anyway, I don't think this is an immediate concern and I personally wouldn't approach it until I start working on the idea of better integrating the PAML module with the rest of Biopython. -brandon Here are some examples: 5 895 G G 4 3 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 1231231231231231231231231231231231231 444444444444444444444444444444444444444444444444444444444444 444444444444444444444444444444444444444444444444444444444444 444444444444444444444444444444444444444444444444444444444444 444444444444444444 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 12312312312312312312312312312312312312312312312312312312312 Human AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTTACATCCTCATTACTATT CTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATC........ Chimpanzee ......... "The first line of the file contains the option character G. The second line begins with a G at the first column, followed by the number of site classes. The following lines contain the site marks, one for each site in the sequence (or each codon in the case of codonml). The site mark specifies which class each site is from. If there are g classes, the marks should be 1, 2, ..., g, and if g > 9, the marks need to be separated by spaces. The total number of marks must be equal to the total number of sites in each sequence." ******** 5 1000 G G 4 100 200 300 400 Sequence 1 TCGATAGATAGGTTTTAGGGGGGGGGGTAAAAAAAAA....... "This [alignment has 5 sequences of] 1000 nucleotides from 4 genes, obtained from concatenating four genes with 100, 200, 300, and 400 nucleotides from genes 1, 2, 3, and 4, respectively. The" ******** 5 855 GC human GTG CTG TCT CCT ... 5 sequences, 855 nucleotides, length must be a multiple of three ******** 5 300 G G2 40 60 sequence1 ..... "This data set has 5 sequences, each of 300 nucleotides (100 codons), which are partitioned into two genes, with the first gene having 40 codons and the second gene 60 codons." From p.j.a.cock at googlemail.com Tue Oct 11 05:37:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 10:37:48 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318325593.3137.51.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 10:33 AM, Brandon Invergo wrote: >> Do we need to write the "I" flag in our PHYLIP output? > > It took me a while to hunt down information on PHYLIP flags. I found > this link which mentions them: > http://www.no.embnet.org/phylipdoc/ > They're only used by the program which is using the alignment as input, > corresponding to the PHYLIP programs' menu options. In general, they > have no affect on the format of the alignment (aside from the > 'S'/sequential vs 'I'/interleaved flags). However, some of them might > require extra information immediately below the header line, before the > alignment starts. This complicates things. (see below for some PAML > examples) Some of those examples don't really look like PHYLIP anymore to me. If there is any simple change to allow the current parser to cope with (but ignore) any extra meta data like this, that sounds sensible (with unit tests of course - grin). Peter From b.invergo at gmail.com Tue Oct 11 06:01:59 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 12:01:59 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> Message-ID: <1318327319.3137.70.camel@localhost.localdomain> > Some of those examples don't really look like PHYLIP anymore to me. > > If there is any simple change to allow the current parser to cope > with (but ignore) any extra meta data like this, that sounds sensible > (with unit tests of course - grin). Agreed, it can get quite messy, though look at the link I provided; even the PHYLIP-specific example that they give includes some supplementary info at the top, as well as a tree at the bottom: 4 40 W W 0101001111 0101110101 0101110011 1101010110 dmras1 GTCGTCGTTG GACCTGGAGG CGTGGGCAAG spras GTAGTTGTAG GAGATGGTGG TGTTGGTAAA scras1 GTAGTTGTCG GTGGAGGTGG CGTTGGTAAA scras2 GTCGTCGTTG GTGGTGGTGG TGTTGGTAAA TCCGCGCTCA AGTGCTTTGA TCTGCTTTAA TCTGCTTTGA 1 ((dmras1,ddrasa),((hschras,spras),(scras1,scras2))); I agree that trying to shoehorn that functionality into Biopython as written would be a mess. Another option that I can think of, however, would be to shift such extra formatting duties to the Biopython application interface which needs them, since that's the only place they're relevant. So I could, for example, make a PAML-specific subclass of PhylipWriter which handles all these weird PAML-specific options. Or if there were to be a PHYLIP interface and the program took that above example as input, it would be the duty of the interface to write a file with those options, the alignment and the tree all together. Just a thought. For the short term, though, when I implement the sequential format, I'll go ahead and update the code to at least handle flags in the header line. To handle the supp. info should be straight forward, since I believe that each supp. line must begin with the option flag that requires the info; if the option flag exists in the header, ignore any following lines which begin with that flag character. Unit tests will abound. -brandon From p.j.a.cock at googlemail.com Tue Oct 11 06:13:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 11:13:03 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318327319.3137.70.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 11:01 AM, Brandon Invergo wrote: > >> Some of those examples don't really look like PHYLIP anymore to me. >> >> If there is any simple change to allow the current parser to cope >> with (but ignore) any extra meta data like this, that sounds sensible >> (with unit tests of course - grin). > > Agreed, it can get quite messy, though look at the link I provided; even > the PHYLIP-specific example that they give includes some supplementary > info at the top, as well as a tree at the bottom: > > ?4 ? 40 ? W > W ? ? ? ? 0101001111 0101110101 0101110011 > ? ? ? ? ?1101010110 > dmras1 ? ?GTCGTCGTTG GACCTGGAGG CGTGGGCAAG > > spras ? ? GTAGTTGTAG GAGATGGTGG TGTTGGTAAA > scras1 ? ?GTAGTTGTCG GTGGAGGTGG CGTTGGTAAA > scras2 ? ?GTCGTCGTTG GTGGTGGTGG TGTTGGTAAA > ? ? ? ? ?TCCGCGCTCA > ? ? ? ? ?AGTGCTTTGA > ? ? ? ? ?TCTGCTTTAA > ? ? ? ? ?TCTGCTTTGA > 1 > ((dmras1,ddrasa),((hschras,spras),(scras1,scras2))); > I would consider that to be a meta file containing a PHYLIP alignment and a tree, but in itself it isn't a PHYLIP alignment. That looks like exactly the kind of issue NEXUS was designed to solve: how to embed alignments, trees and other stuff into a single plain text file for input into a phylogenetic tool. Doesn't PHYLIP have an XML format these days? Trying to parse something like that text (without a formal standard) seems like a painful exercise and long term maintenance headache. Peter From b.invergo at gmail.com Tue Oct 11 06:37:39 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 12:37:39 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: <1318329459.3137.82.camel@localhost.localdomain> > I would consider that to be a meta file containing a PHYLIP > alignment and a tree, but in itself it isn't a PHYLIP alignment. > > That looks like exactly the kind of issue NEXUS was designed > to solve: how to embed alignments, trees and other stuff into > a single plain text file for input into a phylogenetic tool. > > Doesn't PHYLIP have an XML format these days? Trying > to parse something like that text (without a formal standard) > seems like a painful exercise and long term maintenance > headache. I'm not suggesting that Biopython parse and store the information because I agree that it would be an unmaintainable nightmare. To bring myself out of the clouds a bit and back to the basics of my original intent: if I work on better integrating the PAML module so that the user can pass a MultipleSeqAlignment object, I will need a way to write that alignment to a file with potentially more information than the default PhylipWriter allows. So, just as simple as that, Bio.Phylo.PAML would need its own alignment writer....something I'm not going to worry about right now. With this mentality, then yes, anything containing such option flags and info is no longer a PHYLIP alignment but is rather an input file to some program. As such, the existing PhylipIO module should *not* be modified to handle this metadata. Please ignore all my other half-baked ideas. So, current, phylip-related tasks: - implement SequentialPhylipWriter and SequentialPhylipIterator classes in PhylipIO That's it, I think. I'll revisit this alignment-writing stuff at some other point. One task at a time... -brandon From p.j.a.cock at googlemail.com Tue Oct 11 07:05:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 12:05:48 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318329459.3137.82.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> <1318329459.3137.82.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 11:37 AM, Brandon Invergo wrote: >> I would consider that to be a meta file containing a PHYLIP >> alignment and a tree, but in itself it isn't a PHYLIP alignment. >> >> That looks like exactly the kind of issue NEXUS was designed >> to solve: how to embed alignments, trees and other stuff into >> a single plain text file for input into a phylogenetic tool. >> >> Doesn't PHYLIP have an XML format these days? Trying >> to parse something like that text (without a formal standard) >> seems like a painful exercise and long term maintenance >> headache. > > I'm not suggesting that Biopython parse and store the information > because I agree that it would be an unmaintainable nightmare. To bring > myself out of the clouds a bit and back to the basics of my original > intent: if I work on better integrating the PAML module so that the user > can pass a MultipleSeqAlignment object, I will need a way to write that > alignment to a file with potentially more information than the default > PhylipWriter allows. So, just as simple as that, Bio.Phylo.PAML would > need its own alignment writer....something I'm not going to worry about > right now. > > With this mentality, then yes, anything containing such option flags and > info is no longer a PHYLIP alignment but is rather an input file to some > program. As such, the existing PhylipIO module should *not* be modified > to handle this metadata. Please ignore all my other half-baked ideas. What you could think about is having the Bio.Phylo.PAML create this file, and call the existing PhylipIO module with the handle to write the alignment part - and perhaps the Bio.Phylo module with the handle to write any tree. > So, current, phylip-related tasks: > - implement SequentialPhylipWriter and SequentialPhylipIterator classes > in PhylipIO > > That's it, I think. I'll revisit this alignment-writing stuff at some > other point. One task at a time... > > -brandon That sounds like a manageable step to start with :) Peter From chapmanb at 50mail.com Tue Oct 11 07:20:31 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 11 Oct 2011 07:20:31 -0400 Subject: [Biopython-dev] Moving strand & db ref from SeqFeature to FeatureLocation In-Reply-To: References: Message-ID: <8739ez4vwg.fsf@kunkel.i-did-not-set--mail-host-address--so-tickle-me> Peter; > https://github.com/peterjc/biopython/tree/f_loc2 > > It moves the strand, ref and db_ref properties from > the SeqFeature object to the FeatureLocation object, > implementing read/write proxy methods for backward > compatibility. Thanks for the integer work and for this. I'm agreed that this is a more logical way to store the strand (and cross-ref) information. +1 from me on checking it in, Brad From p.j.a.cock at googlemail.com Tue Oct 11 07:28:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 12:28:35 +0100 Subject: [Biopython-dev] Moving strand & db ref from SeqFeature to FeatureLocation In-Reply-To: <8739ez4vwg.fsf@kunkel.i-did-not-set--mail-host-address--so-tickle-me> References: <8739ez4vwg.fsf@kunkel.i-did-not-set--mail-host-address--so-tickle-me> Message-ID: On Tue, Oct 11, 2011 at 12:20 PM, Brad Chapman wrote: > > Peter; > >> https://github.com/peterjc/biopython/tree/f_loc2 >> >> It moves the strand, ref and db_ref properties from >> the SeqFeature object to the FeatureLocation object, >> implementing read/write proxy methods for backward >> compatibility. > > Thanks for the integer work and for this. I'm agreed that this is a more > logical way to store the strand (and cross-ref) information. +1 from me > on checking it in, > Brad OK, that's done. Cheers Brad. As I said before, if anyone doesn't like the new printing of the FeatureLocation with how I present the strand and database reference, we can change that. There are examples in the SeqFeature.py and SeqRecord.py docstrings. Regards, Peter From eric.talevich at gmail.com Tue Oct 11 08:55:57 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 11 Oct 2011 08:55:57 -0400 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 6:13 AM, Peter Cock wrote: > > That looks like exactly the kind of issue NEXUS was designed > to solve: how to embed alignments, trees and other stuff into > a single plain text file for input into a phylogenetic tool. > > Doesn't PHYLIP have an XML format these days? Trying > to parse something like that text (without a formal standard) > seems like a painful exercise and long term maintenance > headache. > > The Phylip programs seqboot and retree have XML formats that look almost like SeqXML and phyloXML, but they're not quite compatible, e.g. attribute names are slightly different. This is probably because they were written before those standard formats existed -- pretty sure the retree XML format, sort of described in Inferring Phylogenies (2004) as an example of how a future XML tree format might look, was an inspiration for phyloXML. There hasn't been much development on these parts of the Phylip codebase lately, though. If someone wanted to write a patch to bring these formats into compliance with the closest standards, I bet Joe would accept the patch. Discussion: https://www.facebook.com/permalink.php?story_fbid=256082801069968&id=115402811804635 -E From p.j.a.cock at googlemail.com Tue Oct 11 09:04:20 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 14:04:20 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 1:55 PM, Eric Talevich wrote: > On Tue, Oct 11, 2011 at 6:13 AM, Peter Cock > wrote: >> >> That looks like exactly the kind of issue NEXUS was designed >> to solve: how to embed alignments, trees and other stuff into >> a single plain text file for input into a phylogenetic tool. >> >> Doesn't PHYLIP have an XML format these days? Trying >> to parse something like that text (without a formal standard) >> seems like a painful exercise and long term maintenance >> headache. >> > > The Phylip programs seqboot and retree have XML formats that look almost > like SeqXML and phyloXML, but they're not quite compatible, e.g. attribute > names are slightly different. > > This is probably because they were written before those standard formats > existed -- pretty sure the retree XML format, sort of described in Inferring > Phylogenies (2004) as an example of how a future XML tree format might look, > was an inspiration for phyloXML. There hasn't been much development on these > parts of the Phylip codebase lately, though. If someone wanted to write a > patch to bring these formats into compliance with the closest standards, I > bet Joe would accept the patch. > > Discussion: > https://www.facebook.com/permalink.php?story_fbid=256082801069968&id=115402811804635 > > -E Good plan - anyone here familiar with the PHYLIP code base? Peter From chapmanb at 50mail.com Thu Oct 13 10:05:57 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 13 Oct 2011 10:05:57 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs Message-ID: <871uuhm1fe.fsf@fastmail.fm> Hi all; Biopython's setup.py currently has an interactive question/answer session to remind users to optionally install NumPy if it's not present. This is useful for by-hand installations, but problematic with automated installers. One useful feature of setuptools is the 'install_requires' attribute in setup.py. This allows your programs to define the requirements and have them automatically installed from PyPi. It's a great way to include useful libraries without having to fret excessively about users installing dependencies. Unfortunately if you use install_requires with Biopython, and NumPy is not installed, automated scripts will get stuck in the question/answer dialog. To resolve this issue, I wrote a small patch that adds NumPy to Biopython's install_requires and skips the Q/A only in cases where it is installed via pip or easy_install: https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b If someone is able to review this, it would be great to get it into Biopython for the next release. Brad From p.j.a.cock at googlemail.com Thu Oct 13 10:20:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Oct 2011 15:20:46 +0100 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: <871uuhm1fe.fsf@fastmail.fm> References: <871uuhm1fe.fsf@fastmail.fm> Message-ID: On Thu, Oct 13, 2011 at 3:05 PM, Brad Chapman wrote: > > Hi all; > Biopython's setup.py currently has an interactive question/answer > session to remind users to optionally install NumPy if it's not > present. This is useful for by-hand installations, but problematic with > automated installers. > > One useful feature of setuptools is the 'install_requires' attribute in > setup.py. This allows your programs to define the requirements and have > them automatically installed from PyPi. It's a great way to include > useful libraries without having to fret excessively about users > installing dependencies. > > Unfortunately if you use install_requires with Biopython, and NumPy is > not installed, automated scripts will get stuck in the question/answer > dialog. To resolve this issue, I wrote a small patch that adds NumPy to > Biopython's install_requires and skips the Q/A only in cases where it is > installed via pip or easy_install: > > https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b > > If someone is able to review this, it would be great to get it into > Biopython for the next release. > > Brad I can appreciate the usefulness of this, but don't know enough about pip and easy_install to comment on the implementation. Anyone else? Peter From eric.talevich at gmail.com Thu Oct 13 14:00:22 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 Oct 2011 14:00:22 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: <871uuhm1fe.fsf@fastmail.fm> References: <871uuhm1fe.fsf@fastmail.fm> Message-ID: On Thu, Oct 13, 2011 at 10:05 AM, Brad Chapman wrote: > > Hi all; > Biopython's setup.py currently has an interactive question/answer > session to remind users to optionally install NumPy if it's not > present. This is useful for by-hand installations, but problematic with > automated installers. > > One useful feature of setuptools is the 'install_requires' attribute in > setup.py. This allows your programs to define the requirements and have > them automatically installed from PyPi. It's a great way to include > useful libraries without having to fret excessively about users > installing dependencies. > > Unfortunately if you use install_requires with Biopython, and NumPy is > not installed, automated scripts will get stuck in the question/answer > dialog. To resolve this issue, I wrote a small patch that adds NumPy to > Biopython's install_requires and skips the Q/A only in cases where it is > installed via pip or easy_install: > > > https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b > > If someone is able to review this, it would be great to get it into > Biopython for the next release. > > Hi Brad, Looks cool to me, except the sys.argv parsing gets a little gritty (understandably): Line 115: if dist_dir.find("egg-dist-tmp") >= 0: Could this be `if 'egg-dist-tmp' in dist_dir`? Line 118: if sys.argv in [["-c", "develop", "--no-deps"], ["-c", "egg_info"]]: Does pip allow rearranging arguments? Would `--no-deps -c develop` also be valid? If so, should that be added as a third item in the list-of-args? -Eric From chapmanb at 50mail.com Fri Oct 14 06:00:37 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Oct 2011 06:00:37 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: <871uuhm1fe.fsf@fastmail.fm> Message-ID: <87hb3b51ve.fsf@fastmail.fm> Eric and Peter; Thanks much for taking a look at this patch. > Looks cool to me, except the sys.argv parsing gets a little gritty > (understandably): Absolutely. Unfortunately the python installation space is pretty messy. Neither pip not easy_install gives any formal declaration so you have to resort to these hacks to infer that they are doing the install. Luckily I don't think any of these options are something people would do directly from the command line. > Line 115: > > if dist_dir.find("egg-dist-tmp") >= 0: > > Could this be `if 'egg-dist-tmp' in dist_dir`? > Line 118: > > if sys.argv in [["-c", "develop", "--no-deps"], > ["-c", "egg_info"]]: > > Does pip allow rearranging arguments? Would `--no-deps -c develop` also be > valid? > If so, should that be added as a third item in the list-of-args? Awesome, thanks for the suggestions. I checked both of these in. Thanks again, Brad From p.j.a.cock at googlemail.com Fri Oct 14 06:53:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Oct 2011 11:53:42 +0100 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: <87hb3b51ve.fsf@fastmail.fm> References: <871uuhm1fe.fsf@fastmail.fm> <87hb3b51ve.fsf@fastmail.fm> Message-ID: On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: > > Awesome, thanks for the suggestions. I checked both of these in. > I'll test the branch today, and merge it to the trunk if it looks good on Python 2 / 3 / Jython / PyPy. Peter From p.j.a.cock at googlemail.com Fri Oct 14 06:55:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Oct 2011 11:55:52 +0100 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: <871uuhm1fe.fsf@fastmail.fm> <87hb3b51ve.fsf@fastmail.fm> Message-ID: On Fri, Oct 14, 2011 at 11:53 AM, Peter Cock wrote: > On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: >> >> Awesome, thanks for the suggestions. I checked both of these in. >> > > I'll test the branch today, and merge it to the trunk if it looks good > on Python 2 / 3 / Jython / PyPy. > $ jython setup.py install /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: Unknown distribution option: 'install_requires' warnings.warn(msg) running install running build running build_py ... That's with Jython 2.5.2 under Mac OS X Snow Leopard. Same with pypy 1.6, $ pypy setup.py install /Users/pjcock/Downloads/Software/pypy-1.6/lib-python/modified-2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires' warnings.warn(msg) running install running build running build_py ... Can we avoid that warning? Peter From chapmanb at 50mail.com Fri Oct 14 08:26:06 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Oct 2011 08:26:06 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: <871uuhm1fe.fsf@fastmail.fm> <87hb3b51ve.fsf@fastmail.fm> Message-ID: <87ehyf4v4x.fsf@fastmail.fm> Peter; Thanks for testing this and helping with the merge > $ jython setup.py install > /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: > Unknown distribution option: 'install_requires' > warnings.warn(msg) [...] > Can we avoid that warning? This is a warning from distutils, so you would also see this on regular ol' Python without setuptools installed. Likewise it should go away on jython or pypy if they have setuptools or distribute installed. Unfortunately I don't have a way around it since this is an argument to setup. Most modern installations should have setuptools and can take advantage of install_requires. If it's a problem we could use 'warnings' to ignore it. Brad From cmccoy at fhcrc.org Fri Oct 14 13:11:15 2011 From: cmccoy at fhcrc.org (Connor McCoy) Date: Fri, 14 Oct 2011 10:11:15 -0700 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs Message-ID: Hi Brad, Eric, and Peter, Sorry to jump in. Regarding the install_requires warnings: If you're interested, you can include the distribute_setup.py file from http://python-distribute.org/distribute_setup.py in BioPython, and add a short conditional import: try: from setuptools import setup, find_packages except ImportError: import distribute_setup distribute_setup.use_setuptools() from setuptools import setup, find_packages Which will download and install distribute if it isn't available in the python installation; the remainder of the setup can assume setuptools is available. Sphinx (https://bitbucket.org/birkenfeld/sphinx/src/f1f641602bb2/setup.py) and some other projects use this. Connor On Fri, Oct 14, 2011 at 9:00 AM, wrote: > Send Biopython-dev mailing list submissions to > ? ? ? ?biopython-dev at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython-dev > or, via email, send a message with subject or body 'help' to > ? ? ? ?biopython-dev-request at lists.open-bio.org > > You can reach the person managing the list at > ? ? ? ?biopython-dev-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython-dev digest..." > > > Today's Topics: > > ? 1. Re: NumPy dialog when Biopython installed from automated > ? ? ?programs (Eric Talevich) > ? 2. Re: NumPy dialog when Biopython installed from ? ?automated > ? ? ?programs (Brad Chapman) > ? 3. Re: NumPy dialog when Biopython installed from automated > ? ? ?programs (Peter Cock) > ? 4. Re: NumPy dialog when Biopython installed from automated > ? ? ?programs (Peter Cock) > ? 5. Re: NumPy dialog when Biopython installed from ? ?automated > ? ? ?programs (Brad Chapman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 13 Oct 2011 14:00:22 -0400 > From: Eric Talevich > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from automated programs > To: Brad Chapman > Cc: Biopython-Dev Mailing List > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Thu, Oct 13, 2011 at 10:05 AM, Brad Chapman wrote: > >> >> Hi all; >> Biopython's setup.py currently has an interactive question/answer >> session to remind users to optionally install NumPy if it's not >> present. This is useful for by-hand installations, but problematic with >> automated installers. >> >> One useful feature of setuptools is the 'install_requires' attribute in >> setup.py. This allows your programs to define the requirements and have >> them automatically installed from PyPi. It's a great way to include >> useful libraries without having to fret excessively about users >> installing dependencies. >> >> Unfortunately if you use install_requires with Biopython, and NumPy is >> not installed, automated scripts will get stuck in the question/answer >> dialog. To resolve this issue, I wrote a small patch that adds NumPy to >> Biopython's install_requires and skips the Q/A only in cases where it is >> installed via pip or easy_install: >> >> >> https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b >> >> If someone is able to review this, it would be great to get it into >> Biopython for the next release. >> >> > Hi Brad, > > Looks cool to me, except the sys.argv parsing gets a little gritty > (understandably): > > Line 115: > > ? ?if dist_dir.find("egg-dist-tmp") >= 0: > > Could this be `if 'egg-dist-tmp' in dist_dir`? > > > Line 118: > > ? ?if sys.argv in [["-c", "develop", "--no-deps"], > ? ? ? ? ? ? ? ? ? ?["-c", "egg_info"]]: > > Does pip allow rearranging arguments? Would `--no-deps -c develop` also be > valid? > If so, should that be added as a third item in the list-of-args? > > > -Eric > > > ------------------------------ > > Message: 2 > Date: Fri, 14 Oct 2011 06:00:37 -0400 > From: Brad Chapman > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from ? ?automated programs > To: Eric Talevich > Cc: , Biopython-Dev Mailing List > Message-ID: <87hb3b51ve.fsf at fastmail.fm> > Content-Type: text/plain; charset=us-ascii > > > Eric and Peter; > Thanks much for taking a look at this patch. > >> Looks cool to me, except the sys.argv parsing gets a little gritty >> (understandably): > > Absolutely. Unfortunately the python installation space is pretty > messy. Neither pip not easy_install gives any formal declaration so you > have to resort to these hacks to infer that they are doing the > install. Luckily I don't think any of these options are something people > would do directly from the command line. > >> Line 115: >> >> ? ? if dist_dir.find("egg-dist-tmp") >= 0: >> >> Could this be `if 'egg-dist-tmp' in dist_dir`? > >> Line 118: >> >> ? ? if sys.argv in [["-c", "develop", "--no-deps"], >> ? ? ? ? ? ? ? ? ? ? ["-c", "egg_info"]]: >> >> Does pip allow rearranging arguments? Would `--no-deps -c develop` also be >> valid? >> If so, should that be added as a third item in the list-of-args? > > Awesome, thanks for the suggestions. I checked both of these in. > > Thanks again, > Brad > > > ------------------------------ > > Message: 3 > Date: Fri, 14 Oct 2011 11:53:42 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from automated programs > To: Brad Chapman > Cc: Biopython-Dev Mailing List > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: >> >> Awesome, thanks for the suggestions. I checked both of these in. >> > > I'll test the branch today, and merge it to the trunk if it looks good > on Python 2 / 3 / Jython / PyPy. > > Peter > > > ------------------------------ > > Message: 4 > Date: Fri, 14 Oct 2011 11:55:52 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from automated programs > To: Brad Chapman > Cc: Biopython-Dev Mailing List > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Oct 14, 2011 at 11:53 AM, Peter Cock wrote: >> On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: >>> >>> Awesome, thanks for the suggestions. I checked both of these in. >>> >> >> I'll test the branch today, and merge it to the trunk if it looks good >> on Python 2 / 3 / Jython / PyPy. >> > > $ jython setup.py install > /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: > Unknown distribution option: 'install_requires' > ?warnings.warn(msg) > running install > running build > running build_py > ... > > > That's with Jython 2.5.2 under Mac OS X Snow Leopard. Same with pypy 1.6, > > $ pypy setup.py install > /Users/pjcock/Downloads/Software/pypy-1.6/lib-python/modified-2.7/distutils/dist.py:267: > UserWarning: Unknown distribution option: 'install_requires' > ?warnings.warn(msg) > running install > running build > running build_py > ... > > Can we avoid that warning? > > Peter > > > ------------------------------ > > Message: 5 > Date: Fri, 14 Oct 2011 08:26:06 -0400 > From: Brad Chapman > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from ? ?automated programs > To: Peter Cock > Cc: , Biopython-Dev Mailing List > Message-ID: <87ehyf4v4x.fsf at fastmail.fm> > Content-Type: text/plain; charset=us-ascii > > > Peter; > Thanks for testing this and helping with the merge > >> $ jython setup.py install >> /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: >> Unknown distribution option: 'install_requires' >> ? warnings.warn(msg) > [...] >> Can we avoid that warning? > > This is a warning from distutils, so you would also see this on regular > ol' Python without setuptools installed. Likewise it should go away on > jython or pypy if they have setuptools or distribute installed. > > Unfortunately I don't have a way around it since this is an argument to > setup. Most modern installations should have setuptools and can take > advantage of install_requires. > > If it's a problem we could use 'warnings' to ignore it. > > Brad > > > ------------------------------ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 105, Issue 12 > ********************************************** > From carlcrott at gmail.com Sun Oct 16 21:24:27 2011 From: carlcrott at gmail.com (carl crott) Date: Sun, 16 Oct 2011 21:24:27 -0400 Subject: [Biopython-dev] fixes on the tutorials Message-ID: So the tutorials I'm running through have some bugs in them ... would anyone like me to fix these? tutorial 2.4.1 should be something like: from Bio import SeqIO handle = open("ls_orchid.fasta", "rU") for seq_record in SeqIO.parse(handle, "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) handle.close() and tutorial 2.4.2: from Bio import SeqIO handle = open("ls_orchid.gbk", "rU") for seq_record in SeqIO.parse(handle, "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) handle.close() From chapmanb at 50mail.com Sun Oct 16 21:29:49 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 16 Oct 2011 21:29:49 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: Message-ID: <8739eso16a.fsf@fastmail.fm> Connor; Thanks for the idea on the auto-install of setuptools/distribute. I'm open to this or sticking with the warning, whichever everyone prefers. Traditionally the setup has tried to be lightweight so you could install Biopython without anything else, but having distribute installed is pretty useful so it might be nice to encourage this. Brad > Sorry to jump in. Regarding the install_requires warnings: > > If you're interested, you can include the distribute_setup.py file > from http://python-distribute.org/distribute_setup.py in BioPython, > and add a short conditional import: > > try: > from setuptools import setup, find_packages > except ImportError: > import distribute_setup > distribute_setup.use_setuptools() > from setuptools import setup, find_packages > > Which will download and install distribute if it isn't available in > the python installation; the remainder of the setup can assume > setuptools is available. Sphinx > (https://bitbucket.org/birkenfeld/sphinx/src/f1f641602bb2/setup.py) > and some other projects use this. From p.j.a.cock at googlemail.com Mon Oct 17 03:55:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 08:55:54 +0100 Subject: [Biopython-dev] fixes on the tutorials In-Reply-To: References: Message-ID: On Mon, Oct 17, 2011 at 2:24 AM, carl crott wrote: > So the tutorials I'm running through have some bugs in them ... > > would anyone like me to fix these? > Hi Carl, What's the bug? > > tutorial 2.4.1 should be something like: > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for seq_record in SeqIO.parse(handle, "fasta"): > ? ?print seq_record.id > ? ?print repr(seq_record.seq) > ? ?print len(seq_record) > handle.close() > Your example above looks fine (and the tutorial used to say that), but the current version is shorter: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) We could alternatively (now that we've dropped Python 2.4) open the handle with a with statement. The same applies to the GenBank example. Perhaps you are using an old version of Biopython (where Bio.SeqIO.parse(...) does not accept a filename)? Could you clarify please, Thanks, Peter From p.j.a.cock at googlemail.com Mon Oct 17 06:10:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 11:10:54 +0100 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality Message-ID: Hi Joshua and everyone, It looks like Joshua's email (below) got lost in the spam filter (possibly due to the attachment). The core of his patch was as follows (there were also lots of white space changes). @@ -694,6 +714,15 @@ class FeatureLocation(object): for i in range(self._start, self._end): yield i + def __eq__(self, other): + """Compares a FeatureLocation for equality""" + if not isinstance(other, FeatureLocation): + return False + if self.start() == other.start() and \ + self.end() == other.end(): + return True + return False + @@ -255,6 +255,26 @@ class SeqFeature(object): qualifiers = dict(self.qualifiers.iteritems()), sub_features = [f._flip(length) for f in self.sub_features[::-1]]) + def __eq__(self, other): + """Compare between this SeqFeature and other. + + ref, ref_db and qualifiers are not needed for comparison""" + if not isinstance(other, SeqFeature): + return False + if (self.id != "" + and other.id != "" and + self.id == other.id): + return True # Can we trust this? + for x in ('location', 'type', 'strand', 'location_operator'): + if (getattr(self, x) and getattr(other, x) and \ + getattr(self, x) != getattr(other, x)): + return False + for f in self.sub_features: + if f not in other.sub_features: + return False + else: + return True + def extract(self, parent_sequence): """Extract feature sequence from the supplied parent sequence. Note the patch will not apply to the trunk, perhaps it is against the current release? First (logically), is defining __eq__ for the FeatureLocation, and second is defining __eq__ for the SeqFeature. This hides the fact that we need to compare position objects, e.g. is BeforePosition(5) == ExactPosition(5)?, the answer is yes, which I have now clarified in the docstrings: https://github.com/biopython/biopython/commit/55feea75f7ab55eac4ef4e320567d746ce41120a Other than the fact that I think the ref and ref_db should be checked when comparing locations, adding location comparison seems like a good idea. Note that with the recent changes on the trunk, the strand, ref and ref_db now belong to the FeatureLocation not the SeqFeature. Extending this to cover the SeqFeature leaves the ID, type, etc and is fiddly: Particularly the question of annotation. These are essentially the same reasons why we don't support SeqRecord equality. Joshua - would you like to update your patch against the code in github, just for the FeatureLocation __eq__ method, to include the strand, ref and red_db properties? Thanks, Peter ---------- Forwarded message ---------- From:?"Joshua Ismael Haase Hern?ndez" To:?biopython-dev at biopython.org Date:?Mon, 17 Oct 2011 01:06:17 -0500 Subject:?[patch] SeqFeature comparison for equality Hi there. I was working on a testcase for a custom program which should extract the same features I had planned. Since SeqFeature lacs comparison method, there is no easy way to test for feature in test_gene.features: ? ?self.asserIn(feature, myparser(file).features) So I added comparison methods and they work fine. Patch attached. My changes are under Biopython license. From p.j.a.cock at googlemail.com Mon Oct 17 11:03:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 16:03:42 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: Hi Michiel, Regarding code using Bio.File, which you asked about deprecating last month: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html I objected at the time because I was using it for the TogoWS code I was working on, On Thu, Sep 8, 2011 at 4:25 PM, Peter Cock wrote: On Wed, Sep 7, 2011 at 3:36 PM, Peter Cock wrote: >>> If the server could be relied on to always give an >>> HTTP error code this wouldn't be needed: >>> >>> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py >>> > > ... > > [Some of those TogoWS checks are probably superfluous > right now, I'm still polishing the error handling - some of > which will rely on TogoWS itself catching more conditions] I've updated my TogoWS to rely on the HTTP error codes, and removed the heuristic error detection which required Bio.File for the UndoHandle. That seems to be working fine now. That leaves Bio/SCOP/__init__.py as the only existing or imminent code using Bio.File, so if we can sort that out, we can deprecate Bio.File as you suggested. Regards, Peter From anaryin at gmail.com Mon Oct 17 11:13:37 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 17:13:37 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: Hey Peter, all, Sorry to peek in. I was going over some code lately together with Eric and he suggested I use Bio.File as it was done in plenty of Bio.*IO modules. What is this deprecation about then? Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/17 Peter Cock > Hi Michiel, > > Regarding code using Bio.File, which you asked about > deprecating last month: > > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html > > I objected at the time because I was using it for the > TogoWS code I was working on, > > On Thu, Sep 8, 2011 at 4:25 PM, Peter Cock > wrote: > On Wed, Sep 7, 2011 at 3:36 PM, Peter Cock > wrote: > >>> If the server could be relied on to always give an > >>> HTTP error code this wouldn't be needed: > >>> > >>> > https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > >>> > > > > ... > > > > [Some of those TogoWS checks are probably superfluous > > right now, I'm still polishing the error handling - some of > > which will rely on TogoWS itself catching more conditions] > > I've updated my TogoWS to rely on the HTTP error codes, > and removed the heuristic error detection which required > Bio.File for the UndoHandle. That seems to be working fine > now. > > That leaves Bio/SCOP/__init__.py as the only existing or > imminent code using Bio.File, so if we can sort that out, > we can deprecate Bio.File as you suggested. > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Mon Oct 17 11:44:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 16:44:35 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Mon, Oct 17, 2011 at 4:13 PM, Jo?o Rodrigues wrote: > Hey Peter, all, > Sorry to peek in. I was going over some code lately together with Eric and > he suggested I use Bio.File as it was done in plenty of Bio.*IO modules. > What is this deprecation about then? > Cheers, Hi Jo?o, Perhaps you misunderstood Eric, Bio.File is not used widely at all. See Michiel's email at the start of this thread: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html Peter From anaryin at gmail.com Mon Oct 17 12:10:40 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 18:10:40 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: Hi Peter, To be honest, I didn't see much of a point to use the module but for consistency's sake. I grep'ed Bio.File in my biopython dir and I got a few more modules with Bio.File, don't know if you were aware. Bio/Application/__init__.py:from Bio import File Bio/Blast/NCBIStandalone.py:from Bio import File Bio/PDB/parse_pdb_header.py:from Bio import File Bio/Phylo/_io.py:from Bio import File Bio/SCOP/__init__.py: from Bio import File Just wanting to clear my doubts about this, thanks! Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/17 Peter Cock > On Mon, Oct 17, 2011 at 4:13 PM, Jo?o Rodrigues wrote: > > Hey Peter, all, > > Sorry to peek in. I was going over some code lately together with Eric > and > > he suggested I use Bio.File as it was done in plenty of Bio.*IO modules. > > What is this deprecation about then? > > Cheers, > > Hi Jo?o, > > Perhaps you misunderstood Eric, Bio.File is not used widely at all. > See Michiel's email at the start of this thread: > > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html > > Peter > From p.j.a.cock at googlemail.com Mon Oct 17 12:26:14 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 17:26:14 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Mon, Oct 17, 2011 at 5:10 PM, Jo?o Rodrigues wrote: > Hi Peter, > To be honest, I didn't see much of a point to use the module but for > consistency's sake. Michiel's point was [at that time] there was very little useful code if any in Bio.File, so could we deprecate it? > I grep'ed Bio.File in my biopython dir and I got a few more modules > with Bio.File, don't know if you were aware. > > Bio/Application/__init__.py:from Bio import File > Bio/Blast/NCBIStandalone.py:from Bio import File > Bio/PDB/parse_pdb_header.py:from Bio import File > Bio/Phylo/_io.py:from Bio import File > Bio/SCOP/__init__.py: ? ?from Bio import File > > Just wanting to clear my doubts about this, thanks! > Cheers, Oh - I remember now. We recently added the as_handle context manager to Bio.File, and that is a useful bit of functionality of general interest. At the time I had forgotten about Michiel's suggestion we deprecate Bio.File, which is unfortunate, but we can still change this before our next release. So, should we keep Bio.File for as_handle (even if everything else in Bio.File is to be deprecated), or should we move the new as_handle functionality somewhere else and deprecate all of Bio.File. Thanks for double checking Jo?o, Peter From anaryin at gmail.com Mon Oct 17 13:21:28 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 19:21:28 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: > > At the time I had forgotten about Michiel's suggestion > we deprecate Bio.File, which is unfortunate, but we > can still change this before our next release. > > So, should we keep Bio.File for as_handle (even if > everything else in Bio.File is to be deprecated), or > should we move the new as_handle functionality > somewhere else and deprecate all of Bio.File. > I think it doesn't make sense to keep the module for 5 lines of code. if isinstance(handleish, basestring): with open(handleish, mode) as fp: yield fp else: yield handleish I'd either place them in __init__.py or just insert them in all Bio.*IO modules wherever needed. If we had more snippets in common with all *IOs, it would be valuable and understandable to have a separate module, but as is it's a bit unnecessary IMHO. > > Thanks for double checking Jo?o, > No problem. Cheers, Jo?o From hahj87 at gmail.com Mon Oct 17 13:57:53 2011 From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=) Date: Mon, 17 Oct 2011 12:57:53 -0500 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: El 17 de octubre de 2011 12:15, Peter Cock escribi?: > Hi Joshua, > > Could you CC the biopython-dev mailing list, unless you > specifically want to discuss something in private? > Sorry about that, I thought i was answering to mailin list. > > 2011/10/17 Joshua Ismael Haase Hern?ndez : > > I'm on it. > > > > Will add __eq__ to FeatureLocation on trunk. > > Great. > > In the short term, you can just work on it directly with a copy of the > official repository and send me a patch (use git patch > file.patch) > > The "best" way is to fork biopython on github, and create your > own branch with these changes. > > > I think BeforeLocation should check if the second is before, > > After check if it is after, etc, and this can be done in locations. > > > > Before I implement those: do you agree? > > > > In that case, AbstractLocation instances > > should check if ExactLocation instances are > > inside their range, and AbstractLocation > > instances to be exactly the same. > > This positions would be the same: OneOfPosition(5, 11, 15), ExactPosition(11), AfterPosition(4), BeforePosition(16), WithinPosition(5, 16), > No. Having tried this myself, it is very complicated. > I think I'm missing something, why is it hard?, I see it as a cases listing. > Also, there are constraints with the Python language > about equality, hashing and comparisons (e.g. for > membership in lists, or use as dictionary keys). > I don't think anyone should use Features as dictionary keys, they will use Feature Id for that, but maybe someona wants a set of features (which just now is like a list of all sequences)... I which cases that should be a problem? (I'm biothechnology engineer, so I don't see all caveats, and i don't really have deep undestanding about how python works) The current behaviour of simple comparison of > the positions as an integer is at least simple. > > > About SeqFeature, I think they should be > > the same if they share all locations. > > You don't care about feature type and ID? ;) > maybe not, a comparison could skip iterating the locations if we have the same type and id, still not sure that's a good method (thus the comment ?# Can we trust this?? on my patch) but a feature 'CDS' is sometimes equivalent to feature 'mRNA', in that case ID and type would both be different in seqfeatures. > > Peter > From p.j.a.cock at googlemail.com Mon Oct 17 14:07:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 19:07:27 +0100 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: 2011/10/17 Joshua Ismael Haase Hern?ndez : > ... > > This positions would be the same: > > OneOfPosition(5, 11, 15), > ExactPosition(11), > AfterPosition(4), > BeforePosition(16), > WithinPosition(5, 16), I don't understand what you are asking here. Those positions do not look the same to me. >> >> No. Having tried this myself, it is very complicated. > > I think I'm missing something, why is it hard?, > I see it as a cases listing. Well, try it and write lots of unit tests, and I'll review it. >> >> Also, there are constraints with the Python language >> about equality, hashing and comparisons (e.g. for >> membership in lists, or use as dictionary keys). > > I don't think anyone should use Features as dictionary keys, > they will use Feature Id for that, but maybe someona wants a > set of features (which just now is like a list of all sequences)... > > I which cases that should be a problem? (I'm biothechnology > engineer, so I don't see all caveats, and i don't really have > deep undestanding about how python works) Using positions as dictionary keys seems reasonable. Using a SeqFeature as a key is not possible as they are mutable objects. >> The current behaviour of simple comparison of >> the positions as an integer is at least simple. >> >> > About SeqFeature, I think they should be >> > the same if they share all locations. >> >> You don't care about feature type and ID? ?;) > > maybe not, a comparison could skip iterating > the locations if we have the same type and id, > still not sure that's a good method (thus the comment > ?# Can we trust this?? on my patch) but a feature > 'CDS' is sometimes equivalent to feature 'mRNA', > in that case ID and type would both be different > in seqfeatures. A gene, mRNA and CDS might all have the same position, but they are different features. Peter From hahj87 at gmail.com Mon Oct 17 14:27:19 2011 From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=) Date: Mon, 17 Oct 2011 13:27:19 -0500 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: El 17 de octubre de 2011 13:07, Peter Cock escribi?: > 2011/10/17 Joshua Ismael Haase Hern?ndez : > > ... > > > > This positions would be the same: > > > > OneOfPosition(5, 11, 15), > > ExactPosition(11), > > AfterPosition(4), > > BeforePosition(16), > > WithinPosition(5, 16), > > I don't understand what you are asking here. Those > positions do not look the same to me. > > They are not *exactly* the same, but besides AfterPosition and BeforePosition, ExactPosition(11) is included in OneOfPosition(5, 11, 15), ExactPosition(11) is after AfterPosition(4) ExactPosition(11) is before BeforePosition(16) ExactPosition(11) is included in WithinPosition(5, 16) All positions in OneOfPosition are before BeforePosition, after AfterPosition, within WithinPosition, and includes ExactPosition. Al positions in WithinPosition are after AfterPosition, before BeforePosition. BeforePosition and AfterPosition can't be equal. How should I name the TestCases? From p.j.a.cock at googlemail.com Mon Oct 17 15:03:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 20:03:15 +0100 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: 2011/10/17 Joshua Ismael Haase Hern?ndez : > > > El 17 de octubre de 2011 13:07, Peter Cock > escribi?: >> >> 2011/10/17 Joshua Ismael Haase Hern?ndez : >> > ... >> > >> > This positions would be the same: >> > >> > OneOfPosition(5, 11, 15), >> > ExactPosition(11), >> > AfterPosition(4), >> > BeforePosition(16), >> > WithinPosition(5, 16), >> >> I don't understand what you are asking here. Those >> positions do not look the same to me. >> > > They are not *exactly* the same, but besides > AfterPosition and BeforePosition, > ExactPosition(11) is included in OneOfPosition(5, 11, 15), > ExactPosition(11) is after AfterPosition(4) > ExactPosition(11) is before BeforePosition(16) > ExactPosition(11) is included in WithinPosition(5, 16) > All positions in OneOfPosition are before BeforePosition, > after AfterPosition, within WithinPosition, and includes > ExactPosition. > Al positions in WithinPosition are after AfterPosition, > before BeforePosition. > BeforePosition and AfterPosition can't be equal. > It might help it you wrote these out explicitly, e.g. currently: >>> from Bio.SeqFeature import * >>> a = BeforePosition(10) >>> b = AfterPosition(10) >>> a == b == 10 True Currently BeforePosition and AfterPosition act like the integer position for comparison etc. I find this reasonable given we have to treat them as the integer for things like extracting the sequence. > How should I name the TestCases? > Something like test_SeqFeature.py and using unittest. Most existing tests in this area are in doctests and test_SeqIO_feature.py Peter From andrea at biocomp.unibo.it Tue Oct 18 08:59:05 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 18 Oct 2011 14:59:05 +0200 (CEST) Subject: [Biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: Hi, I don't know if this can help, but I've been subclassing seqfeature and seqrecord objects to assert equalities. I've attached the very simple code for the seqfeature equality Handling complex location equalities with a given set of rules could be misleading. a feature starting in position 11 is different, for me, from one located at position 12. Andrea > ------------------------------ > > Message: 4 > Date: Mon, 17 Oct 2011 12:57:53 -0500 > From: Joshua Ismael Haase Hern?ndez > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Peter Cock > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > El 17 de octubre de 2011 12:15, Peter Cock > escribi?: > >> Hi Joshua, >> >> Could you CC the biopython-dev mailing list, unless you >> specifically want to discuss something in private? >> > > Sorry about that, I thought i was answering to mailin list. > >> >> 2011/10/17 Joshua Ismael Haase Hern?ndez : >> > I'm on it. >> > >> > Will add __eq__ to FeatureLocation on trunk. >> >> Great. >> >> In the short term, you can just work on it directly with a copy of the >> official repository and send me a patch (use git patch > file.patch) >> >> The "best" way is to fork biopython on github, and create your >> own branch with these changes. >> >> > I think BeforeLocation should check if the second is before, >> > After check if it is after, etc, and this can be done in locations. >> > >> > Before I implement those: do you agree? >> > >> > In that case, AbstractLocation instances >> > should check if ExactLocation instances are >> > inside their range, and AbstractLocation >> > instances to be exactly the same. >> >> > This positions would be the same: > > OneOfPosition(5, 11, 15), > ExactPosition(11), > AfterPosition(4), > BeforePosition(16), > WithinPosition(5, 16), > > >> No. Having tried this myself, it is very complicated. >> > > I think I'm missing something, why is it hard?, > I see it as a cases listing. > > >> Also, there are constraints with the Python language >> about equality, hashing and comparisons (e.g. for >> membership in lists, or use as dictionary keys). >> > > I don't think anyone should use Features as dictionary keys, > they will use Feature Id for that, but maybe someona wants a > set of features (which just now is like a list of all sequences)... > > I which cases that should be a problem? (I'm biothechnology > engineer, so I don't see all caveats, and i don't really have > deep undestanding about how python works) > > The current behaviour of simple comparison of >> the positions as an integer is at least simple. >> >> > About SeqFeature, I think they should be >> > the same if they share all locations. >> >> You don't care about feature type and ID? ;) >> > > maybe not, a comparison could skip iterating > the locations if we have the same type and id, > still not sure that's a good method (thus the comment > ?# Can we trust this?? on my patch) but a feature > 'CDS' is sometimes equivalent to feature 'mRNA', > in that case ID and type would both be different > in seqfeatures. > >> >> Peter >> > > > > ------------------------------ > > Message: 5 > Date: Mon, 17 Oct 2011 19:07:27 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Joshua Ismael Haase Hern?ndez > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > 2011/10/17 Joshua Ismael Haase Hern?ndez : >> ... >> >> This positions would be the same: >> >> OneOfPosition(5, 11, 15), >> ExactPosition(11), >> AfterPosition(4), >> BeforePosition(16), >> WithinPosition(5, 16), > > I don't understand what you are asking here. Those > positions do not look the same to me. > >>> >>> No. Having tried this myself, it is very complicated. >> >> I think I'm missing something, why is it hard?, >> I see it as a cases listing. > > Well, try it and write lots of unit tests, and I'll review it. > >>> >>> Also, there are constraints with the Python language >>> about equality, hashing and comparisons (e.g. for >>> membership in lists, or use as dictionary keys). >> >> I don't think anyone should use Features as dictionary keys, >> they will use Feature Id for that, but maybe someona wants a >> set of features (which just now is like a list of all sequences)... >> >> I which cases that should be a problem? (I'm biothechnology >> engineer, so I don't see all caveats, and i don't really have >> deep undestanding about how python works) > > Using positions as dictionary keys seems reasonable. > > Using a SeqFeature as a key is not possible as they > are mutable objects. > >>> The current behaviour of simple comparison of >>> the positions as an integer is at least simple. >>> >>> > About SeqFeature, I think they should be >>> > the same if they share all locations. >>> >>> You don't care about feature type and ID? ?;) >> >> maybe not, a comparison could skip iterating >> the locations if we have the same type and id, >> still not sure that's a good method (thus the comment >> ?# Can we trust this?? on my patch) but a feature >> 'CDS' is sometimes equivalent to feature 'mRNA', >> in that case ID and type would both be different >> in seqfeatures. > > A gene, mRNA and CDS might all have the same > position, but they are different features. > > Peter > > > > ------------------------------ > > Message: 6 > Date: Mon, 17 Oct 2011 13:27:19 -0500 > From: Joshua Ismael Haase Hern?ndez > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Peter Cock > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > El 17 de octubre de 2011 13:07, Peter Cock > escribi?: > >> 2011/10/17 Joshua Ismael Haase Hern?ndez : >> > ... >> > >> > This positions would be the same: >> > >> > OneOfPosition(5, 11, 15), >> > ExactPosition(11), >> > AfterPosition(4), >> > BeforePosition(16), >> > WithinPosition(5, 16), >> >> I don't understand what you are asking here. Those >> positions do not look the same to me. >> >> > They are not *exactly* the same, but besides > AfterPosition and BeforePosition, > ExactPosition(11) is included in OneOfPosition(5, 11, 15), > ExactPosition(11) is after AfterPosition(4) > ExactPosition(11) is before BeforePosition(16) > ExactPosition(11) is included in WithinPosition(5, 16) > All positions in OneOfPosition are before BeforePosition, > after AfterPosition, within WithinPosition, and includes > ExactPosition. > Al positions in WithinPosition are after AfterPosition, > before BeforePosition. > > BeforePosition and AfterPosition can't be equal. > > How should I name the TestCases? > > > > ------------------------------ > > Message: 7 > Date: Mon, 17 Oct 2011 20:03:15 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Joshua Ismael Haase Hern?ndez > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > 2011/10/17 Joshua Ismael Haase Hern?ndez : >> >> >> El 17 de octubre de 2011 13:07, Peter Cock >> escribi?: >>> >>> 2011/10/17 Joshua Ismael Haase Hern?ndez : >>> > ... >>> > >>> > This positions would be the same: >>> > >>> > OneOfPosition(5, 11, 15), >>> > ExactPosition(11), >>> > AfterPosition(4), >>> > BeforePosition(16), >>> > WithinPosition(5, 16), >>> >>> I don't understand what you are asking here. Those >>> positions do not look the same to me. >>> >> >> They are not *exactly* the same, but besides >> AfterPosition and BeforePosition, >> ExactPosition(11) is included in OneOfPosition(5, 11, 15), >> ExactPosition(11) is after AfterPosition(4) >> ExactPosition(11) is before BeforePosition(16) >> ExactPosition(11) is included in WithinPosition(5, 16) >> All positions in OneOfPosition are before BeforePosition, >> after AfterPosition, within WithinPosition, and includes >> ExactPosition. >> Al positions in WithinPosition are after AfterPosition, >> before BeforePosition. >> BeforePosition and AfterPosition can't be equal. >> > > It might help it you wrote these out explicitly, > e.g. currently: > > >>> from Bio.SeqFeature import * > >>> a = BeforePosition(10) > >>> b = AfterPosition(10) > >>> a == b == 10 > True > > Currently BeforePosition and AfterPosition act like > the integer position for comparison etc. I find this > reasonable given we have to treat them as the > integer for things like extracting the sequence. > >> How should I name the TestCases? >> > > Something like test_SeqFeature.py and using > unittest. Most existing tests in this area are in > doctests and test_SeqIO_feature.py > > Peter > > > > ------------------------------ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 105, Issue 15 > ********************************************** > -------------- next part -------------- A non-text attachment was scrubbed... Name: seqfeature_eq.py Type: text/x-python-script Size: 1505 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Oct 18 09:20:34 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 14:20:34 +0100 Subject: [Biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 1:59 PM, Andrea Pierleoni wrote: > Hi, > I don't know if this can help, > but I've been subclassing seqfeature and seqrecord objects to assert > equalities. > I've attached the very simple code for the seqfeature equality > Handling complex location equalities with a given set of rules could be > misleading. > a feature starting in position 11 is different, for me, from one located > at position 12. > > Andrea That looks reasonable for basic SeqFeature comparison, although comparing the annotations in the qualifiers dict is debatable (as with SeqRecord object's annotation). Given the way join locations (etc) are currently handled, it would be important to also compare the sub-features. I think it would be more practical to first (and perhaps only) implement equality testing for FeatureLocation (checking start, end, strand, ref and db_ref), then you can compare the location of a SeqFeature easily with: f1.location == f2.location. Peter From carlcrott at gmail.com Tue Oct 18 12:18:39 2011 From: carlcrott at gmail.com (carl crott) Date: Tue, 18 Oct 2011 12:18:39 -0400 Subject: [Biopython-dev] fixes on the tutorials In-Reply-To: References: Message-ID: Peter and other devs, I'm deeply interested in any kind of HMM applications ... As I'm not quite a biologist if you guys wanted to 'sic me' on any particular bug related to these let me know .. however as far as the GIT stuff .. that would be more of the control for updates and merging all the code that you guys work on separately. toodles! -Carl On Tue, Oct 18, 2011 at 5:36 AM, Peter Cock wrote: > On Mon, Oct 17, 2011 at 2:34 PM, Peter Cock > wrote: > > ... > > > > P.S. Don't forget to CC the mailing list ;) > > Apologies for posting that to the wrong development mailing list > (samtools rather than biopython), I need to be more careful with > autocomplete. > > Peter > -- Carl Crott Web Applications Engineer www.black-glass.com 412-610-0600 From mjldehoon at yahoo.com Tue Oct 18 22:39:53 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 18 Oct 2011 19:39:53 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1318991993.91732.YahooMailClassic@web161201.mail.bf1.yahoo.com> Hi Peter, > That leaves Bio/SCOP/__init__.py as the only existing or > imminent code using Bio.File, so if we can sort that out, > we can deprecate Bio.File as you suggested. In Bio/SCOP/__init__.py, Bio.File.UndoHandle is used in the _open function, which is an internal function used in the "search" function in Bio.SCOP. The UndoHandle is used to wrap a handle returned by urllib.urlopen. This search function returns a handle to data in HTML format. I don't think we have a parser for it. This suggests that there is no specific purpose for UndoHandle in Bio.SCOP._open. So I would suggest to just remove the UndoHandle from Bio.SCOP._open and return the urllib.urlopen handle directly. Any objections? --Michiel. --- On Mon, 10/17/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.File > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Monday, October 17, 2011, 11:03 AM > Hi Michiel, > > Regarding code using Bio.File, which you asked about > deprecating last month: > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html > > I objected at the time because I was using it for the > TogoWS code I was working on, > > On Thu, Sep 8, 2011 at 4:25 PM, Peter Cock > wrote: > On Wed, Sep 7, 2011 at 3:36 PM, Peter Cock > wrote: > >>> If the server could be relied on to always > give an > >>> HTTP error code this wouldn't be needed: > >>> > >>> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > >>> > > > > ... > > > > [Some of those TogoWS checks are probably superfluous > > right now, I'm still polishing the error handling - > some of > > which will rely on TogoWS itself catching more > conditions] > > I've updated my TogoWS to rely on the HTTP error codes, > and removed the heuristic error detection which required > Bio.File for the UndoHandle. That seems to be working fine > now. > > That leaves Bio/SCOP/__init__.py as the only existing or > imminent code using Bio.File, so if we can sort that out, > we can deprecate Bio.File as you suggested. > > Regards, > > Peter > From mjldehoon at yahoo.com Tue Oct 18 22:46:33 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 18 Oct 2011 19:46:33 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> I agree that it doesn't make sense to have a separate module for this. Even if we put it in Bio/__init__.py, people are likely to forget about it, and we will end up with some modules that use this code in Bio/__init__.py and other modules that copy this code in their source code. As this code is very short, I would just copy it into the modules that use it. Best, --Michiel. --- On Mon, 10/17/11, Jo?o Rodrigues wrote: I think it doesn't make sense to keep the module for 5 lines of code.? ? ? if isinstance(handleish, basestring): ? ? ? ? with open(handleish, mode) as fp:? ? ? ? ? ? yield fp ? ? else:? ? ? ? yield handleish I'd either place them in __init__.py or just insert them in all Bio.*IO modules wherever needed. If we had more snippets in common with all *IOs, it would be valuable and understandable to have a separate module, but as is it's a bit unnecessary IMHO. From p.j.a.cock at googlemail.com Wed Oct 19 04:49:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 09:49:27 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1318991993.91732.YahooMailClassic@web161201.mail.bf1.yahoo.com> References: <1318991993.91732.YahooMailClassic@web161201.mail.bf1.yahoo.com> Message-ID: On Wed, Oct 19, 2011 at 3:39 AM, Michiel de Hoon wrote: > Hi Peter, > >> That leaves Bio/SCOP/__init__.py as the only existing or >> imminent code using Bio.File, so if we can sort that out, >> we can deprecate Bio.File as you suggested. > > In Bio/SCOP/__init__.py, Bio.File.UndoHandle is used in the _open > function, which is an internal function used in the "search" function > in Bio.SCOP. The UndoHandle is used to wrap a handle returned > by urllib.urlopen. Should we change that to use urllib2 for better error handling, as in Bio.Entrez's _open? > This search function returns a handle to data in HTML format. > I don't think we have a parser for it. This suggests that there is > no specific purpose for UndoHandle in Bio.SCOP._open. I wonder if that is a sign of URL rot, it would make more sense to get plain text back. Sadly there were no unit tests for this at all until now, and I don't yet do anything with the handle other than confirm we get one! https://github.com/biopython/biopython/commit/10b94a7b5611edde5fe05f95406d927e5a6a02d9 > So I would suggest to just remove the UndoHandle from > Bio.SCOP._open and return the urllib.urlopen handle directly. > > Any objections? Sounds fine. Peter From p.j.a.cock at googlemail.com Wed Oct 19 04:53:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 09:53:25 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> References: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: 2011/10/19 Michiel de Hoon > > I agree that it doesn't make sense to have a separate module for this. For just the one little function, maybe not. I suspect we may want more "File related" things like this for Python 3, what with text vs binary handles and so on, in which case keeping Bio/File.py is sensible. > Even if we put it in Bio/__init__.py, people are likely to forget about > it, and we will end up with some modules that use this code in > Bio/__init__.py and other modules that copy this code in their > source code. As this code is very short, I would just copy it into > the modules that use it. It may be short, but duplicating this function all over the place seems like a very bad idea. I think we should just be vigilant in making sure it is used uniformly wherever we want to accept either a handle or a filename. Perhaps some of the historically handle-only parsers should start using it now? Peter From anaryin at gmail.com Wed Oct 19 07:46:26 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Oct 2011 13:46:26 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: Hey Peter, > For just the one little function, maybe not. I suspect we may want > more "File related" things like this for Python 3, what with text vs > binary handles and so on, in which case keeping Bio/File.py is > sensible. > What kind of "things" are we talking about here? Could they be anticipated? > > It may be short, but duplicating this function all over the place > seems like a very bad idea. I think we should just be vigilant in > making sure it is used uniformly wherever we want to accept > either a handle or a filename. Perhaps some of the historically > handle-only parsers should start using it now? > Duplicating is not a beautiful solution I must agree, but keeping a module and adding an import statement in every parser for only 5 lines isn't neither. I suggest we keep Bio.File, deprecating all the other functions, and meanwhile look at which changes we could include due to Py3. From p.j.a.cock at googlemail.com Wed Oct 19 08:28:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 13:28:03 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: On Wed, Oct 19, 2011 at 12:46 PM, Jo?o Rodrigues wrote: > Hey Peter, > >> >> For just the one little function, maybe not. I suspect we may want >> more "File related" things like this for Python 3, what with text vs >> binary handles and so on, in which case keeping Bio/File.py is >> sensible. > > What kind of "things" are we talking about here? Could they be >?anticipated? > For instance, in Python 3 it might be useful for a parsing text files efficiently to use binary mode (i.e. byte strings not unicode) but also have universal newlines (which I think happens for you automatically in Python 3 for text mode, i.e. unicode). Surprisingly open(filename, "rbU") is accepted in Python 3, but it acts like "rb", typical binary read mode. >> It may be short, but duplicating this function all over the place >> seems like a very bad idea. I think we should just be vigilant in >> making sure it is used uniformly wherever we want to accept >> either a handle or a filename. Perhaps some of the historically >> handle-only parsers should start using it now? > > Duplicating is not a beautiful solution I must agree, but keeping > a module and adding an import statement in every parser for > only 5 lines isn't neither. > I suggest we keep Bio.File, deprecating all the other functions, and > meanwhile look at which changes we could include due to Py3. Yes, that's what I am suggesting. Peter From mjldehoon at yahoo.com Sat Oct 22 08:17:58 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 22 Oct 2011 05:17:58 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1319285878.88223.YahooMailClassic@web161206.mail.bf1.yahoo.com> OK, done. Best, --Michiel --- On Wed, 10/19/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.File > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Wednesday, October 19, 2011, 4:49 AM > On Wed, Oct 19, 2011 at 3:39 AM, > Michiel de Hoon > wrote: > > Hi Peter, > > > >> That leaves Bio/SCOP/__init__.py as the only > existing or > >> imminent code using Bio.File, so if we can sort > that out, > >> we can deprecate Bio.File as you suggested. > > > > In Bio/SCOP/__init__.py, Bio.File.UndoHandle is used > in the _open > > function, which is an internal function used in the > "search" function > > in Bio.SCOP. The UndoHandle is used to wrap a handle > returned > > by urllib.urlopen. > > Should we change that to use urllib2 for better error > handling, > as in Bio.Entrez's _open? > > > This search function returns a handle to data in HTML > format. > > I don't think we have a parser for it. This suggests > that there is > > no specific purpose for UndoHandle in Bio.SCOP._open. > > I wonder if that is a sign of URL rot, it would make more > sense > to get plain text back. Sadly there were no unit tests for > this at > all until now, and I don't yet do anything with the handle > other > than confirm we get one! > > https://github.com/biopython/biopython/commit/10b94a7b5611edde5fe05f95406d927e5a6a02d9 > > > So I would suggest to just remove the UndoHandle from > > Bio.SCOP._open and return the urllib.urlopen handle > directly. > > > > Any objections? > > Sounds fine. > > Peter > From p.j.a.cock at googlemail.com Wed Oct 26 07:11:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Oct 2011 12:11:57 +0100 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 12:02 PM, Jo?o Rodrigues wrote: > Hey Peter, > Thanks for the answer. How do I pass the matrix and which format should it > be on? Is there an example I could read? > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao Not that I know of, but adding one to the docstrings and test_pairwise2.py would be great. I think you use it with a score matrix as a dictionary from Bio.SubsMat.MatrixInfo Peter From eric.talevich at gmail.com Wed Oct 26 09:27:17 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 26 Oct 2011 09:27:17 -0400 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 7:11 AM, Peter Cock wrote: > On Wed, Oct 26, 2011 at 12:02 PM, Jo?o Rodrigues > wrote: > > Hey Peter, > > Thanks for the answer. How do I pass the matrix and which format should > it > > be on? Is there an example I could read? > > Jo?o [...] Rodrigues > > http://nmr.chem.uu.nl/~joao > > Not that I know of, but adding one to the docstrings and test_pairwise2.py > would be great. I think you use it with a score matrix as a dictionary from > Bio.SubsMat.MatrixInfo > > Peter > > Here's an example: from Bio import pairwise2, SeqIO from Bio.SubsMat.MatrixInfo import blosum62 # pairwise2 works with raw strings, not SeqRecords seq1 = str(SeqIO.read("seq1.fa", "fasta")) seq2 = str(SeqIO.read("seq2.fa", "fasta")) results = pairwise2.align.globalds(seq1, seq2, blosum62, -10, -0.5) # Returns a tuple: (seqA, seqB, score, begin, end) return results[0][2] From anaryin at gmail.com Wed Oct 26 09:31:29 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Oct 2011 15:31:29 +0200 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: Hello all, Coming back after lunch... I managed to load a matrix using this: from Bio import pairwise2 from Bio.SubsMat import MatrixInfo as m #print dir(m) matrix = m.blosum60 pairwise2.align.localdx(seqA, seqB, matrix) Thanks a lot for the help, it was simple after all, just a bit hard to start with.. From redmine at redmine.open-bio.org Thu Oct 27 00:55:53 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 27 Oct 2011 04:55:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3308] (New) SeqIO FastaIO: Blank Descriptor causes Indes Out of Range Message-ID: Issue #3308 has been reported by Darren Cullerne. ---------------------------------------- Bug #3308: SeqIO FastaIO: Blank Descriptor causes Indes Out of Range https://redmine.open-bio.org/issues/3308 Author: Darren Cullerne Status: New Priority: Normal Assignee: Category: Target version: URL: Entering a FASTA sequence with a blank descriptor: ">" "ACTAGTACTAGATCAGACTACAGTACAGAGAGGACATCTATACTACGAGAGACATACTACTCAGCATACGATAC" Causes the following error: File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 532, in parse for r in i: File "C:\Python27\lib\site-packages\Bio\SeqIO\FastaIO.py", line 49, in FastaIterator id = descr.split()[0] IndexError: list index out of range Please let me know if there is any further information you require. Thanks, ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Oct 27 10:03:42 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 27 Oct 2011 14:03:42 +0000 Subject: [Biopython-dev] [Biopython - Bug #3309] (New) GenBank Scanner expects sequence lines to start at position 9 Message-ID: Issue #3309 has been reported by Liam Childs. ---------------------------------------- Bug #3309: GenBank Scanner expects sequence lines to start at position 9 https://redmine.open-bio.org/issues/3309 Author: Liam Childs Status: New Priority: Normal Assignee: Category: Target version: 1.57 URL: Some programs (eg. Vector NTI and Lasegene) produce GenBank files where the sequences start at an index on the line other than index 9. I don't know how tightly defined the GenBank file format is, but if the indent for the start of the sequence can be variable, it seems to me there is a simple fix. Current version (Bio/GenBank/Scanner.py:904): line = self.line ... 15 lines if len(line) > 9 and line[9:10]!=' ': raise ValueError("Sequence line mal-formed, '%s'"% line) seq_lines.append(line[idx + 1:]) #remove spaces later Simple fix 1 (variable per file): line = self.line idx = line.find('1') + 1 ... 15 lines if len(line) > idx and line[idx:idx + 1]!=' ': raise ValueError("Sequence line mal-formed, '%s'"% line) seq_lines.append(line[idx + 1:]) #remove spaces later The index can be obtained in any number of ways, this was the simplest I could think of off the top of my head. If sequences are allowed to start at a position other than '1', then maybe a regular expression should be used instead. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu Oct 27 10:46:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Oct 2011 15:46:08 +0100 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 2:31 PM, Jo?o Rodrigues wrote: > Hello all, > Coming back after lunch... > I managed to load a matrix using this: > > from Bio import pairwise2 > from Bio.SubsMat import MatrixInfo as m > #print dir(m) > matrix = m.blosum60 > pairwise2.align.localdx(seqA, seqB, matrix) > > Thanks a lot for the help, it was simple after all, just a bit hard to start > with.. Hi Jo?o, Could you write a little documentation for the pairwise2 docstring? Just something short based on the above example would be great (ideally as a doctest). Thanks, Peter From anaryin at gmail.com Thu Oct 27 10:52:25 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 27 Oct 2011 16:52:25 +0200 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: Sure thing. The docstring is actually pretty explicit, it's just missing the part that you can get the matrices from SubsMat. Or at least, not that clear. I'll go over it this weekend, maybe earlier. Best, Jo?o From p.j.a.cock at googlemail.com Fri Oct 28 12:15:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 17:15:36 +0100 Subject: [Biopython-dev] Fwd: [Utilities-announce] Upcoming Release of NCBI EFetch version 2.0 In-Reply-To: References: Message-ID: Hi all, We may need to update Bio.Entrez for EFetch v2.0 soon, although at first glance there is nothing that will obviously cause trouble... Peter ---------- Forwarded message ---------- From: Date: Fri, Oct 28, 2011 at 4:15 PM Subject: [Utilities-announce] Upcoming Release of NCBI EFetch version 2.0 To: NLM/NCBI List utilities-announce Upcoming Release of EFetch version 2.0 In November 2011 NCBI plans to release version 2.0 of EFetch. The major changes and updates are as follows: ????????? EFetch now supports the following databases: biosample, biosystems and sra ????????? EFetch now has defined default values for &retmode and &rettype for all supported databases (please see Table 1 for all supported values of these parameters) ????????? EFetch no longer supports &retmode=html; requests containing &retmode=html will return data using the default &retmode value for the specified database (&db) ????????? EFetch requests including &rettype=docsum will return XML data equivalent to ESummary output Details about EFetch can be found at http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch An updated, complete listing of supported &rettype and &retmode values can be found at http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?report=objectonly Release notes about this and future releases can be found at http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes Please write to info at ncbi.nlm.nih.gov if you have any questions about these changes. _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce -------------- next part -------------- _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce From redmine at redmine.open-bio.org Fri Oct 28 19:45:53 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 28 Oct 2011 23:45:53 +0000 Subject: [Biopython-dev] [Biopython - Feature #3310] (New) HMMER parser Message-ID: Issue #3310 has been reported by J M. ---------------------------------------- Feature #3310: HMMER parser https://redmine.open-bio.org/issues/3310 Author: J M Status: New Priority: Normal Assignee: Category: Target version: URL: This is a parser for the output of hmmsearch from the HMMER package. Given the output of the hmmsearch, this program can retrieve information for each of the alignments including the expected values, the starting and ending positions of each alignment, as well as insert, deletion and mismatch information for each alignment. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Oct 28 22:00:07 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 29 Oct 2011 02:00:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3311] (New) GFF parser fails to intelligently break lines Message-ID: Issue #3311 has been reported by gahoo lee. ---------------------------------------- Bug #3311: GFF parser fails to intelligently break lines https://redmine.open-bio.org/issues/3311 Author: gahoo lee Status: New Priority: Normal Assignee: Category: Target version: URL: Move from "BioStar":http://biostar.stackexchange.com/questions/13651/gff-parsing-in-python-is-not-so-perfect I use BCBio.GFF to parse "chr01.gff3":ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/chr01.dir/chr01.gff3 and "all.gff3":ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/all.gff3 . But things didn't work out as I expect. Here's the code: @from BCBio import GFF limits = dict(gff_type = ["gene","mRNA","CDS"]) gff_handle = open('chr01.gff3') for rec in GFF.parse(gff_handle,target_lines=1000,limit_info=limits): #Chromosome seq level for gene_feature in rec.features: #gene level for mRNA_feature in gene_feature.sub_features: #mRNA level print mRNA_feature.type print mRNA_feature.qualifiers['Alias']@ And I got: @Traceback (most recent call last): File "R:\Untitled 1.py", line 14, in print mRNA_feature.qualifiers['Alias'] KeyError: 'Alias'@ And the 'type' is "CDS" which is not correct. When parsing without @target_lines=1000@ everything is ok. But parsing all.gff3 came to the same problem. Maybe all.gff3 is too huge to parse. The problem might be due to the parser did not recognise the entry boudary correctly. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon Oct 3 11:20:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Oct 2011 12:20:21 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome Message-ID: Hi Brad (et al), You might have seen on Twitter at the end of last week I mentioned some work to extend Brad's Bio.Graphics.BasicChromosome to allow features within a chromosome segment, optionally with labels. The branch is here: https://github.com/peterjc/biopython/tree/chr_diag I put together a non-trivial example of showing the tRNA genes in Arabidopsis as a unit test in test_GraphicsChromosome.py - this is deliberately showing too many features in order to check the label placement algorithm: http://twitpic.com/6sgr1m This kind of figure is also used for showing SNP placement and genetic marker loci used in breeding etc. If I had put more (or a more uniform set of) features you'd get something worthy of the nickname "millipede diagram", looking like a segmented body (the chromosome) with thousands of legs (the lines for the labels). This isn't quite backwards compatible - the old code draws the chromosomes left aligned within their allocated space, while I put them centrally in order to draw labels on either side. Iddo sounded enthusiastic on Twitter. Does this look worth including as is? Would someone (doesn't have to be Brad) like to test/review it please? Thanks, Peter From bioinformed at gmail.com Mon Oct 3 21:28:21 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 3 Oct 2011 17:28:21 -0400 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 7:20 AM, Peter Cock wrote: > You might have seen on Twitter at the end of last week I mentioned > some work to extend Brad's Bio.Graphics.BasicChromosome to allow > features within a chromosome segment, optionally with labels. > > This looks to be extremely useful. Is there any support for layouts to stack or pack chromosomes? I'm thinking of diagrams for humans, where we don't fit as well in linear displays. I also think supporting chromosome bands would be extremely useful. These could include full cytobands, centromeres, euchromatic vs hetrochromatic regions, user configurable bands (e.g. linkage regions, IBD blocks, etc.) The figure shows off what I'm thinking about the banding and layout, even though it uses colored circles instead of text labels: http://www.genome.gov/multimedia/illustrations/GWAS_2011_1.pdf If there is interest, I may have some time to work on these features once the basic infrastructure is stable. Best regards, -Kevin From p.j.a.cock at googlemail.com Mon Oct 3 22:24:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Oct 2011 23:24:12 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Monday, October 3, 2011, Kevin Jacobs <jacobs at bioinformed.com> < bioinformed at gmail.com> wrote: > On Mon, Oct 3, 2011 at 7:20 AM, Peter Cock wrote: > >> You might have seen on Twitter at the end of last week I mentioned >> some work to extend Brad's Bio.Graphics.BasicChromosome to allow >> features within a chromosome segment, optionally with labels. >> >> > > This looks to be extremely useful. Is there any support for layouts to > stack or pack chromosomes? I'm thinking of diagrams for humans, where we > don't fit as well in linear displays. I also think supporting chromosome > bands would be extremely useful. These could include full cytobands, > centromeres, euchromatic vs hetrochromatic regions, user configurable bands > (e.g. linkage regions, IBD blocks, etc.) > > The figure shows off what I'm thinking about the banding and layout, even > though it uses colored circles instead of text labels: > http://www.genome.gov/multimedia/illustrations/GWAS_2011_1.pdf > > If there is interest, I may have some time to work on these features once > the basic infrastructure is stable. > > Best regards, > -Kevin Hi Kevin, I'm glad to hear there is some interest in this :) That example you linked to is interesting - there are several things of specific interest - and helps as I'm not yet familiar with all the technical terms you used. Notches in the chromosome which I assume are centromeres (I can see how that might be added to the Bio code as another segment type, similar to the telomeres). Coloured background regions in the chromosome (should be able to do this already), some of which are hatched (not possible right now... would have to look into ReportLab's capabilities here). This is what you meant by banding? Multiple coloured dots for labels. Doable, but a nice API might be tricky. For layout did you mean the fact this isn't just a row of chromosomes left to right, but here there are two rows? I'm inclined to say the user should just move things in the PDF for a final version using Adobe of Inkscape ;) Regards, Peter From keith.hughitt at gmail.com Tue Oct 4 11:31:51 2011 From: keith.hughitt at gmail.com (Keith Hughitt) Date: Tue, 4 Oct 2011 07:31:51 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator Message-ID: Hi all, I was thinking recently that it would be nice if the FASTA file reader were able to check for known formats (e.g. NCBI) and then use that information to choose better values for name, id, etc. After some discussion with Peter Cock on GitHub, however, he convinced me that this would be problematic in terms of backwards compatibility, and that instead a better approach might be to add a new sub-format ("fasta-ncbi") to the list of supported format readers. This could go something like: 1. Create a new function in SeqIO.FastaIO for parsing NCBI-formatted FASTA files. Add it the the mapping of iterators. 2. FastaIO.NCBIFasterIterator will simply call FASTAIterator and then modify the result by assigning a new id, name, etc (other suggestions?) 3. FastaIO.NCBIFastaWriter (modify and subclass FastaIO.FastaWriter?) 4. Modify code that interacts with NCBI services which return FASTA files and have it return a NCBIFasterIterator (First use a deprecation/warning to let users know of the pending change?) Does this sound like it would be a useful feature? What about the basic approach outlined above? Any suggestions? Keith From p.j.a.cock at googlemail.com Tue Oct 4 11:46:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Oct 2011 12:46:19 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: On Tue, Oct 4, 2011 at 12:31 PM, Keith Hughitt wrote: > Hi all, > > I was thinking recently that it would be nice if the FASTA file reader were > able to check for known formats (e.g. NCBI) and then use that information to > choose better values for name, id, etc. > > After some discussion with Peter Cock on GitHub, however, he convinced me > that this would be problematic in terms of backwards compatibility, and that > instead a better approach might be to add a new sub-format ("fasta-ncbi") to > the list of supported format readers. > > This could go something like: > > 1. Create a new function in SeqIO.FastaIO for parsing NCBI-formatted FASTA > files. Add it the the mapping of iterators. Yes. > 2. FastaIO.NCBIFasterIterator will simply call FASTAIterator and then modify > the result by assigning a new id, name, etc (other suggestions?) Store the GI number in the SeqRecord's annotation under key "gi" to match the GenBank parser. There may be other things like this. If the FASTA header does not match the NCBI style, that should probably trigger an exception. > 3. FastaIO.NCBIFastaWriter (modify and subclass FastaIO.FastaWriter?) This will be harder, but yes in principle. > 4. Modify code that interacts with NCBI services which return FASTA files > and have it return a NCBIFasterIterator (First use a deprecation/warning to > let users know of the pending change?) No need. I'm pretty sure all the NCBI code (like Bio.Entrez) returns handles so it is up to the end user to decide what to do with the data, e.g. parse it with the current SeqIO "fasta" format, or save it straight to disk. > Does this sound like it would be a useful feature? What about the basic > approach outlined above? Any suggestions? > > Keith Yes, it sounds useful. I'm not sure where the most current NCBI documentation is, but this is a good start: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html Peter From chapmanb at 50mail.com Wed Oct 5 12:03:31 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Oct 2011 08:03:31 -0400 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> Peter; > >> You might have seen on Twitter at the end of last week I mentioned > >> some work to extend Brad's Bio.Graphics.BasicChromosome to allow > >> features within a chromosome segment, optionally with labels. This is awesome, thanks for extending it. All of your tweaks are good improvements, and I'm +1 for including it in the next release. Please improve away. Thanks much, Brad From bioinformed at gmail.com Wed Oct 5 13:16:56 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 5 Oct 2011 09:16:56 -0400 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 6:24 PM, Peter Cock wrote: > Notches in the chromosome which I assume are centromeres > (I can see how that might be added to the Bio code as another > segment type, similar to the telomeres). > Yes-- although the visual style for centromeres need not be precisely as shown in my example. > Coloured background regions in the chromosome (should be > able to do this already), some of which are hatched (not possible > right now... would have to look into ReportLab's capabilities here). > This is what you meant by banding? > Yes-- being able to show cytobands and custom bands to designate regions will be very useful for me. As before, I'm not wed to the cross-hatching, in fact the standard displays use only grayscale. Multiple coloured dots for labels. Doable, but a nice API might > be tricky. > I don't much care about those -- I'd be happy with text labels. > For layout did you mean the fact this isn't just a row of > chromosomes left to right, but here there are two rows? > I'm inclined to say the user should just move things in > the PDF for a final version using Adobe of Inkscape ;) > Correct. I'd prefer to have some programmatic control of layout, since I'd hate to have to manually edit every whole-genome plot. Since I'm working exclusively with human data for now, it would be possible to pre-specify a few standard layouts and avoid the trouble of supporting dynamic features. Just let me know when the code is stable enough to start poking around. I'll float a proposal for what I think could be done to obtain feedback before I commit much time to coding. Thanks, -Kevin From p.j.a.cock at googlemail.com Wed Oct 5 13:32:34 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 14:32:34 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Wed, Oct 5, 2011 at 2:16 PM, Kevin Jacobs wrote: > On Mon, Oct 3, 2011 at 6:24 PM, Peter Cock wrote: >> >> Notches in the chromosome which I assume are centromeres >> (I can see how that might be added to the Bio code as another >> segment type, similar to the telomeres). > > Yes-- although the visual style for centromeres need not be precisely as > shown in my example. > >> >> Coloured background regions in the chromosome (should be >> able to do this already), some of which are hatched (not possible >> right now... would have to look into ReportLab's capabilities here). >> This is what you meant by banding? > > Yes-- being able to show cytobands and custom bands to designate regions > will be very useful for me. ?As before, I'm not wed to the cross-hatching, > in fact the standard displays use only grayscale. OK - simple colours are easy, I can add that to the test case example. >> >> Multiple coloured dots for labels. Doable, but a nice API might >> be tricky. > > I don't much care about those -- I'd be happy with text labels. > Good. >> >> For layout did you mean the fact this isn't just a row of >> chromosomes left to right, but here there are two rows? >> I'm inclined to say the user should just move things in >> the PDF for a final version using Adobe of Inkscape ;) > > Correct. ?I'd prefer to have some?programmatic?control of layout, since I'd > hate to have to manually edit every whole-genome plot. ?Since I'm working > exclusively with human data for now, it would be possible to pre-specify a > few standard layouts and avoid the trouble of supporting dynamic features. > Just let me know when the code is stable enough to start poking around. > ?I'll float a proposal for what I think could be done to obtain feedback > before I commit much time to coding. Would an option for using multiple rows be enough? It wouldn't be quite as compact as the tweaked human example you showed - but probably good enough to print on a single page. Another option is to do the PDF editing programmatically, for example with ReportLab. You can embed multiple (smaller) PDF files within a larger container. Its a bit fiddly, but would be worth the effort for a major pipeline where you always use the same (few) organism(s). Peter From p.j.a.cock at googlemail.com Wed Oct 5 14:40:56 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 15:40:56 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> References: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> Message-ID: On Wed, Oct 5, 2011 at 1:03 PM, Brad Chapman wrote: > > Peter; > >> >> You might have seen on Twitter at the end of last week I mentioned >> >> some work to extend Brad's Bio.Graphics.BasicChromosome to allow >> >> features within a chromosome segment, optionally with labels. > > This is awesome, thanks for extending it. All of your tweaks are good > improvements, and I'm +1 for including it in the next release. Please > improve away. Awesome. I've applied the current branch to the trunk, although I'm not promising there won't be changes to the new stuff between now and the next release. In particular, doing the labels (and their placement) for the whole of a chromosome (and not just for a segment) would allow us to squeeze in more labels (e.g. in example I showed using the vertical space currently reserved for the telomeres). Peter From p.j.a.cock at googlemail.com Wed Oct 5 21:17:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 22:17:38 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: Message-ID: On Wed, Oct 5, 2011 at 2:32 PM, Peter Cock wrote: > On Wed, Oct 5, 2011 at 2:16 PM, Kevin Jacobs wrote: >> On Mon, Oct 3, 2011 at 6:24 PM, Peter Cock wrote: >>> Coloured background regions in the chromosome (should be >>> able to do this already), some of which are hatched (not possible >>> right now... would have to look into ReportLab's capabilities here). >>> This is what you meant by banding? >> >> Yes-- being able to show cytobands and custom bands to designate regions >> will be very useful for me. ?As before, I'm not wed to the cross-hatching, >> in fact the standard displays use only grayscale. > > OK - simple colours are easy, I can add that to the test case example. Done, using some random placements - I didn't manage to find the real Arabidopsis cytoband data which would have been nicer. https://github.com/biopython/biopython/commit/24deaca63ba55e28519a4c85650ad74e849f203e Peter From p.j.a.cock at googlemail.com Wed Oct 5 22:31:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Oct 2011 23:31:18 +0100 Subject: [Biopython-dev] Enhancements to Bio.Graphics.BasicChromosome In-Reply-To: References: <87k48j8x2k.fsf@sobchak.i-did-not-set--mail-host-address--so-tickle-me> Message-ID: On Wed, Oct 5, 2011 at 3:40 PM, Peter Cock wrote: > > In particular, doing the labels (and their placement) for the whole > of a chromosome (and not just for a segment) would allow us to > squeeze in more labels (e.g. in example I showed using the > vertical space currently reserved for the telomeres). > Done, https://github.com/biopython/biopython/commit/d3d19440bdbaabbf4cd305e43dea627f68cf6ecf We may want to review how chromosome segment labels work - probably simplest to add them to the dynamically placed label list, otherwise the two can overlap. Peter From tiagoantao at gmail.com Thu Oct 6 16:17:40 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 17:17:40 +0100 Subject: [Biopython-dev] bio.expasy potential bug? Message-ID: Hi, This might be a red herring but: http://biopython.org/DIST/docs/api/Bio.ExPASy-module.html : sprot_search_ful(text, make_wild=None, swissprot=1, trembl=None, cgi='http://www.expasy.ch/cgi-bin/sprot-search-ful') That cgi does not exist... Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Thu Oct 6 16:23:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 17:23:03 +0100 Subject: [Biopython-dev] bio.expasy potential bug? In-Reply-To: References: Message-ID: 2011/10/6 Tiago Ant?o : > Hi, > > This might be a red herring but: > http://biopython.org/DIST/docs/api/Bio.ExPASy-module.html : > sprot_search_ful(text, make_wild=None, swissprot=1, trembl=None, > cgi='http://www.expasy.ch/cgi-bin/sprot-search-ful') > > That cgi does not exist... > > Tiago Looks like they've changed the URL or turned off a redirect :( If you can work out what they should be, please go ahead an fix it. A working unit test would be good (mark it as requires internet). Peter From tiagoantao at gmail.com Thu Oct 6 16:33:11 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 17:33:11 +0100 Subject: [Biopython-dev] bio.expasy potential bug? In-Reply-To: References: Message-ID: 2011/10/6 Peter Cock : > Looks like they've changed the URL or turned off a redirect :( > > If you can work out what they should be, please go ahead an fix it. > A working unit test would be good (mark it as requires internet). I will add the bug to redmine. I currently am pressed on time to sort this out :( I can have a look next week. From redmine at redmine.open-bio.org Thu Oct 6 17:06:26 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 6 Oct 2011 17:06:26 +0000 Subject: [Biopython-dev] [Biopython - Bug #3301] (New) Bio.ExPASy sprot_search_ful has wrong cgi address Message-ID: Issue #3301 has been reported by Tiago Antao. ---------------------------------------- Bug #3301: Bio.ExPASy sprot_search_ful has wrong cgi address https://redmine.open-bio.org/issues/3301 Author: Tiago Antao Status: New Priority: Normal Assignee: Category: Target version: URL: The Bio.ExPASy sprot_search_ful has a cgi of http://www.expasy.ch/cgi-bin/sprot-search-ful , but that URL is not available anymore. See: http://biopython.org/DIST/docs/api/Bio.ExPASy-module.html ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From keith.hughitt at gmail.com Fri Oct 7 11:18:10 2011 From: keith.hughitt at gmail.com (Keith Hughitt) Date: Fri, 7 Oct 2011 07:18:10 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: Okay, I took at stab at it. The code is in the master branch of my fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73 I wasn't sure what the best choices are for id/name so for now I stored the gid in id (and also in the annotations), and the accession as name. Any suggestions? I also haven't written any test code yet. Should I parameterize TitleFunctions.simple_check and multi_check, or is there another approach you would advise? Keith From p.j.a.cock at googlemail.com Fri Oct 7 12:49:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 13:49:30 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt wrote: > Okay, I took at stab at it. The code is in the master branch of my > fork:?https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73 You are only handling gi||ref|| whereas the NCBI have a *lot* of other variations to consider: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html This is quite an open ended bit of work... > I wasn't sure what the best choices are for id/name so for now I stored the > gid in id (and also in the annotations), and the accession as name. Any > suggestions? I suggest collecting a selection of matched NCBI FASTA and GenBank/GenPept files, and how Biopython handles the GenBank/GenPept version (format name "genbank" alias "gb" in Bio.SeqIO) and try to make handling the FASTA version as "fasta-ncbi" do the same. e.g. From our unit tests (from the NCBI FTP site), these are a pair: Tests/GenBank/NC_005816.gb Tests/GenBank/NC_005816.fna > I also haven't written any test code yet. Should I parameterize > TitleFunctions.simple_check and multi_check, or is there > another approach you would advise? > Keith Probably write some completely new tests. e.g. Use the existing test files mentioned above, and verify that both the "genbank" and the "fasta-ncbi" parser give the same results (ignoring things not in the FASTA file of course). Peter From andrew.sczesnak at med.nyu.edu Fri Oct 7 15:38:04 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 07 Oct 2011 11:38:04 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: Message-ID: <4E8F1CDC.8090500@med.nyu.edu> Adding my unsolicited opinion here, what do y'all think of this NCBIFasta parser being a more general "callback" parser, where a function passed to read() or write() translates some arbitrary delimited-text into an (id, name, description) tuple, as in: def x(seqrec): # gi||ref|| y = seqrec.description.strip().split("|") # gi acc desc return (y[1], y[3]. y[4]) # calls x on every record in the FASTA for seqrec in SeqIO.parse(fp, "fasta", x): print seqrec This would be similar to key_function in SeqIO.to_dict() and would shift the responsibility of handling variation in formats to the user. Alternatively, a few functions to parse different styles of description lines could be included in the module. Andrew On 10/07/2011 08:49 AM, Peter Cock wrote: > On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt wrote: >> Okay, I took at stab at it. The code is in the master branch of my >> fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73 > > You are only handling gi||ref|| > whereas the NCBI have a *lot* of other variations to consider: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html > > This is quite an open ended bit of work... > >> I wasn't sure what the best choices are for id/name so for now I stored the >> gid in id (and also in the annotations), and the accession as name. Any >> suggestions? > > I suggest collecting a selection of matched NCBI FASTA and > GenBank/GenPept files, and how Biopython handles the > GenBank/GenPept version (format name "genbank" alias "gb" > in Bio.SeqIO) and try to make handling the FASTA version as > "fasta-ncbi" do the same. > > e.g. From our unit tests (from the NCBI FTP site), these are > a pair: > > Tests/GenBank/NC_005816.gb > Tests/GenBank/NC_005816.fna > >> I also haven't written any test code yet. Should I parameterize >> TitleFunctions.simple_check and multi_check, or is there >> another approach you would advise? >> Keith > > Probably write some completely new tests. e.g. Use the > existing test files mentioned above, and verify that both > the "genbank" and the "fasta-ncbi" parser give the same > results (ignoring things not in the FASTA file of course). > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Oct 7 16:00:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 17:00:52 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: <4E8F1CDC.8090500@med.nyu.edu> References: <4E8F1CDC.8090500@med.nyu.edu> Message-ID: On Fri, Oct 7, 2011 at 4:38 PM, Andrew Sczesnak wrote: > Adding my unsolicited opinion here, what do y'all think of this NCBIFasta > parser being a more general "callback" parser, where a function passed to > read() or write() translates some arbitrary delimited-text into ... > > This would be similar to key_function in SeqIO.to_dict() and would shift the > responsibility of handling variation in formats to the user. Alternatively, > a few functions to parse different styles of description lines could be > included in the module. > > Andrew Hi Andrew, Interesting idea, although it doesn't fit that well with the current (deliberately) simple high level Bio.SeqIO.parse/read API, that doesn't mean we can't do it (see Bio.Phylo.parse). In this case I fail to see what benefit this gives over the current situation, where the user can do this themselves with the current FASTA parser, e.g. With a function and a generator expression, records = (do_ncbi_my_way(record) for record in SeqIO.parse(filename, "fasta")) or more simply within a loop: for record in SeqIO.parse(filename, "fasta")): do_ncbi_my_way(record) #Do stuff with record etc. Maybe it is down to personal preference of coding style? I would much prefer a new "fasta-ncbi" parser in SeqIO that handled all the documented NCBI FASTA identifiers. I'm being negative here - but please don't let that deter you from posting ideas. This is a public list and we/I welcome constructive criticism and alternative ideas to the table. Regards, Peter From p.j.a.cock at googlemail.com Fri Oct 7 16:16:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 17:16:55 +0100 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: <4E8F239D.30504@med.nyu.edu> References: <4E8F1CDC.8090500@med.nyu.edu> <4E8F239D.30504@med.nyu.edu> Message-ID: On Fri, Oct 7, 2011 at 5:06 PM, Andrew Sczesnak wrote: >> >> Maybe it is down to personal preference of coding style? > > I agree, there isn't much difference between specifying the callback > function in parse() or within the loop. To me, this points out that > re-implementing a FASTA parser simply for a format of description > line seems unnecessary. > > If a user is interesting in extracting a particular piece of information > from a FASTA description and knows the input format of the file, how > difficult is it for them to split() it on their own? What exactly are the > advantages of a separate parser? Not enough of an advantage for me personally to have gone and written it myself ;) I can see some benefits in extracting information from the NCBI identifier and storing them in the SeqRecord's dbxref list and annotation dictionary (as consistently with our other parsers as possible) if you are going to want to use those fields yourself. Perhaps Keith can explain his interest with some examples? Peter From andrew.sczesnak at med.nyu.edu Fri Oct 7 16:06:53 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 07 Oct 2011 12:06:53 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: <4E8F1CDC.8090500@med.nyu.edu> Message-ID: <4E8F239D.30504@med.nyu.edu> On 10/07/2011 12:00 PM, Peter Cock wrote: > Hi Andrew, > > Interesting idea, although it doesn't fit that well with the current > (deliberately) simple high level Bio.SeqIO.parse/read API, > that doesn't mean we can't do it (see Bio.Phylo.parse). > > In this case I fail to see what benefit this gives over the current > situation, where the user can do this themselves with the > current FASTA parser, > > e.g. With a function and a generator expression, > > records = (do_ncbi_my_way(record) for record in SeqIO.parse(filename, "fasta")) > > or more simply within a loop: > > for record in SeqIO.parse(filename, "fasta")): > do_ncbi_my_way(record) > #Do stuff with record > > etc. > > Maybe it is down to personal preference of coding style? I agree, there isn't much difference between specifying the callback function in parse() or within the loop. To me, this points out that re-implementing a FASTA parser simply for a format of description line seems unnecessary. If a user is interesting in extracting a particular piece of information from a FASTA description and knows the input format of the file, how difficult is it for them to split() it on their own? What exactly are the advantages of a separate parser? > I would much prefer a new "fasta-ncbi" parser in SeqIO > that handled all the documented NCBI FASTA identifiers. > > I'm being negative here - but please don't let that deter you > from posting ideas. This is a public list and we/I welcome > constructive criticism and alternative ideas to the table. > > Regards, > > Peter From keith.hughitt at gmail.com Fri Oct 7 17:02:30 2011 From: keith.hughitt at gmail.com (Keith Hughitt) Date: Fri, 7 Oct 2011 13:02:30 -0400 Subject: [Biopython-dev] Creating a NCBIFastaIterator In-Reply-To: References: <4E8F1CDC.8090500@med.nyu.edu> <4E8F239D.30504@med.nyu.edu> Message-ID: It's really just meant to be a bit of "polish." Originally I was thinking not about having a separate parser but simply extending the existing FASTA parser to recognize common formats (e.g. NCBI) and choose better ids, annotations, etc. Since that would create problems in terms of backwards compatibility, however, adding a new parser seemed like the next best option. Part of the goal, personally, was also just to find a small but useful task I could work on to begin to learn the code and contribute some. It shouldn't be forced though, so I don't want to contribute something unless it's actually an improvement. Keith On Fri, Oct 7, 2011 at 12:16 PM, Peter Cock wrote: > On Fri, Oct 7, 2011 at 5:06 PM, Andrew Sczesnak > wrote: > >> > >> Maybe it is down to personal preference of coding style? > > > > I agree, there isn't much difference between specifying the callback > > function in parse() or within the loop. To me, this points out that > > re-implementing a FASTA parser simply for a format of description > > line seems unnecessary. > > > > If a user is interesting in extracting a particular piece of information > > from a FASTA description and knows the input format of the file, how > > difficult is it for them to split() it on their own? What exactly are the > > advantages of a separate parser? > > Not enough of an advantage for me personally to have gone > and written it myself ;) > > I can see some benefits in extracting information from the > NCBI identifier and storing them in the SeqRecord's dbxref > list and annotation dictionary (as consistently with our other > parsers as possible) if you are going to want to use those > fields yourself. > > Perhaps Keith can explain his interest with some examples? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From b.invergo at gmail.com Mon Oct 10 10:36:47 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Mon, 10 Oct 2011 12:36:47 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output Message-ID: <1318243007.12974.16.camel@localhost.localdomain> Hi all, I've received a request to implement the parsing of the main supplementary output files of the PAML programs ('rst' files). I can't submit a bug on Bugzilla, so I'll just announce my intention to work on this here on the list. One question though. The rst file for baseml includes an alignment which is in the Phylip sequential format. I thought that it would be nice to parse that directly into a Biopython MultipleSeqAlignment. It's my understanding that Biopython only supports the interleaved format. Would it be worth it for me to extend that functionality to include the sequential format or would it be preferable to convert the alignments to be interleaved within the parser itself? Regards, Brandon Invergo From p.j.a.cock at googlemail.com Mon Oct 10 12:21:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Oct 2011 13:21:52 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318243007.12974.16.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> Message-ID: On Mon, Oct 10, 2011 at 11:36 AM, Brandon Invergo wrote: > Hi all, > I've received a request to implement the parsing of the main > supplementary output files of the PAML programs ('rst' files). I can't > submit a bug on Bugzilla, so I'll just announce my intention to work on > this here on the list. That's because we moved to RedMine, there should have been a link on the old Bugzilla page, but anyway its here: https://redmine.open-bio.org/projects/biopython > One question though. The rst file for baseml includes an alignment which > is in the Phylip sequential format. I thought that it would be nice to > parse that directly into a Biopython MultipleSeqAlignment. It's my > understanding that Biopython only supports the interleaved format. Would > it be worth it for me to extend that functionality to include the > sequential format or would it be preferable to convert the alignments to > be interleaved within the parser itself? > > Regards, > Brandon Invergo If you can extend the current PHYLIP parser (strict or relaxed) to cover interleaved and sequential, that would be nice. For strict mode at least, we can in principle follow whatever the original PHYLIP tools do to detect this automatically. It may be safer to make it explicit though - from what I recall without seeing the PHYLIP implementation's source code it was not obvious how to do this reliably. Peter From b.invergo at gmail.com Mon Oct 10 13:22:18 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Mon, 10 Oct 2011 15:22:18 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> Message-ID: <1318252938.12974.54.camel@localhost.localdomain> Hi Peter > That's because we moved to RedMine, there should have > been a link on the old Bugzilla page, but anyway its here: > https://redmine.open-bio.org/projects/biopython Ok, I'll file an enhancement request there. I didn't see a link on the Bugzilla page and there are still some links to Bugzilla on the wiki, like in the "What's being worked on" section. I missed the Issue Tracker link on the left (incidentally, I think this is a design problem of the typical wiki layout and not Biopython-specific...I never notice the contents of that list), so it might be advisable to include the link under the Contribute heading of the main page. > If you can extend the current PHYLIP parser (strict or relaxed) > to cover interleaved and sequential, that would be nice. For > strict mode at least, we can in principle follow whatever the > original PHYLIP tools do to detect this automatically. It may > be safer to make it explicit though - from what I recall without > seeing the PHYLIP implementation's source code it was not > obvious how to do this reliably. Ok, I'll take a look at the PHYLIP source code to see how they do it there. I'll report back with problems/notable progress/questions. Cheers, Brandon From redmine at redmine.open-bio.org Mon Oct 10 13:29:47 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 10 Oct 2011 13:29:47 +0000 Subject: [Biopython-dev] [Biopython - Feature #3303] (New) Support PHYLIP sequential alignment format in AlignIO Message-ID: Issue #3303 has been reported by Brandon Invergo. ---------------------------------------- Feature #3303: Support PHYLIP sequential alignment format in AlignIO https://redmine.open-bio.org/issues/3303 Author: Brandon Invergo Status: New Priority: Normal Assignee: Brandon Invergo Category: Target version: URL: Currently only PHYLIP alignments in the interleaved format can be read by AlignIO however since some programs still work on the sequential format it would be helpful to be able to support that as well. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Oct 10 13:31:13 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 10 Oct 2011 13:31:13 +0000 Subject: [Biopython-dev] [Biopython - Feature #3304] (New) Parse PAML supplementary (rst) output files Message-ID: Issue #3304 has been reported by Brandon Invergo. ---------------------------------------- Feature #3304: Parse PAML supplementary (rst) output files https://redmine.open-bio.org/issues/3304 Author: Brandon Invergo Status: New Priority: Normal Assignee: Brandon Invergo Category: Target version: URL: PAML programs create several output files, the main one of which is already parsed by the Bio.Phylo.PAML modules. The primary supplementary output files ('rst' files) also contain information that is useful for some users so they should be parsed as well. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon Oct 10 16:35:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Oct 2011 17:35:15 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318252938.12974.54.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318252938.12974.54.camel@localhost.localdomain> Message-ID: On Mon, Oct 10, 2011 at 2:22 PM, Brandon Invergo wrote: > Hi Peter > >> That's because we moved to RedMine, there should have >> been a link on the old Bugzilla page, but anyway its here: >> https://redmine.open-bio.org/projects/biopython > > Ok, I'll file an enhancement request there. I didn't see a link on the > Bugzilla page and there are still some links to Bugzilla on the wiki, > like in the "What's being worked on" section. Fixed, thanks. > I missed the Issue Tracker > link on the left (incidentally, I think this is a design problem of the > typical wiki layout and not Biopython-specific...I never notice the > contents of that list), so it might be advisable to include the link > under the Contribute heading of the main page. Good idea, done. Peter From p.j.a.cock at googlemail.com Mon Oct 10 21:47:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Oct 2011 22:47:03 +0100 Subject: [Biopython-dev] Moving strand & db ref from SeqFeature to FeatureLocation Message-ID: This was on the "SeqFeature start/end and making positions act like ints" thread last month: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html On Mon, Sep 19, 2011 at 10:03 AM, Peter Cock wrote: >> Well, slightly easier - I have some more dramatic changes to >> the SeqFeature and FeatureLocation objects planned, but I'm >> still playing with this. > > One of the key changes (which can be done without > really changing the API) is to move the database & > accession and the strand from the SeqFeature to the > FeatureLocation. These are intimately connected with > the location, as much as the start/end. > > This is one of the things I've been working on here: > https://github.com/peterjc/biopython/commits/f_loc > > The other key change on that experimental branch > is moving away from sub_features for join locations > (etc). Here I was trying a new CoupoundLocation > object, but am still wondering if this should be done > in the SeqFeature or FeatureLocation object instead > (or if SeqFeature should subclass FeatureLocation). > > Peter That branch needs some manual merge conflict resolution with the integer subclassing position changes that landed on the trunk, which I've started: https://github.com/peterjc/biopython/tree/f_loc2 Would someone like to review that please? It moves the strand, ref and db_ref properties from the SeqFeature object to the FeatureLocation object, implementing read/write proxy methods for backward compatibility. Other than the commit which changes the __str__ method (the fine details of which I am happy to tweak with discussion) this should be almost 100% back compatible: https://github.com/peterjc/biopython/commit/fed003821d0d223a7b3042ccc3bdf8442348f043 The one break I am aware of is you can't now create a SeqFeature with an empty location and then try to set the strand or db regs before setting the location object. (which is what the GenBank parser was doing). The motivation is that the strand and (optional) database reference for which the location start/end apply are both essential parts of the location information, and I feel never should have been attached to the SeqFeature directly. Furthermore, this separation is useful as a step towards reworking the current use of the SeqFeature's sub_feature list for multi-part locations (e.g. joins in GenBank/EMBL), more on this later. Thanks, Peter From b.invergo at gmail.com Tue Oct 11 07:51:26 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 09:51:26 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> Message-ID: <1318319486.3137.19.camel@localhost.localdomain> > If you can extend the current PHYLIP parser (strict or relaxed) > to cover interleaved and sequential, that would be nice. For > strict mode at least, we can in principle follow whatever the > original PHYLIP tools do to detect this automatically. It may > be safer to make it explicit though - from what I recall without > seeing the PHYLIP implementation's source code it was not > obvious how to do this reliably. > I checked out the PHYLIP code and yes it's not really obvious how the mode is detected. In fact, it seems that many of the programs ask for user input to specify the format of the alignment. So, regarding making it explicit, I'm not sure if this is what you meant but I was thinking it might be simplest to add another Iterator/Writer pair in the PhylipIO module for SequentialPhylip which inherit from the basic Phylip classes, overriding the next() method in the iterator and the write_alignment() method in the writer, much in the way that the RelaxedPhylip classes work. This would mean that there would be no flexibility in the naming rules (ie relaxed vs strict) for the SequentialPhylip format, unless I were to also make a RelaxedSequentialPhylip pair of classes. PAML relaxes the sequence name length restriction to 30 characters and since the whole reason for embarking on this exercise was to support PAML's output of PHYLIP alignments, if only one naming convention is to be implemented I think it would be best to default to the relaxed rules. Slightly unrelated musings: I was thinking that with Biopython's support for reading PHYLIP alignments and Newick trees into objects, at some point it would probably be convenient to make the Bio.Phylo.PAML package more integrated by allowing the user to pass in such objects as arguments rather than writing them to files first; the PAML module could write them to temp files itself. I think some minor changes might have to be made in places (ie for PAML to accept interleaved alignments, the header line must contain an 'I' flag after the seq # and seq len integers) and I'd have to think about how best to allow passing such objects while still retaining the ability to specify filenames without using kludgy, non-pythonic type-checking. Anyway, another task for another day, but I thought I'd throw it out there. Regards, Brandon From p.j.a.cock at googlemail.com Tue Oct 11 08:20:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 09:20:52 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318319486.3137.19.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 8:51 AM, Brandon Invergo wrote: >> If you can extend the current PHYLIP parser (strict or relaxed) >> to cover interleaved and sequential, that would be nice. For >> strict mode at least, we can in principle follow whatever the >> original PHYLIP tools do to detect this automatically. It may >> be safer to make it explicit though - from what I recall without >> seeing the PHYLIP implementation's source code it was not >> obvious how to do this reliably. >> > I checked out the PHYLIP code and yes it's not really obvious how the > mode is detected. In fact, it seems that many of the programs ask for > user input to specify the format of the alignment. > > So, regarding making it explicit, I'm not sure if this is what you meant > but I was thinking it might be simplest to add another Iterator/Writer > pair in the PhylipIO module for SequentialPhylip which inherit from the > basic Phylip classes, overriding the next() method in the iterator and > the write_alignment() method in the writer, much in the way that the > RelaxedPhylip classes work. Something like that as a new format variant, yes. > This would mean that there would be no flexibility in the naming rules > (ie relaxed vs strict) for the SequentialPhylip format, unless I were to > also make a RelaxedSequentialPhylip pair of classes. PAML relaxes the > sequence name length restriction to 30 characters and since the whole > reason for embarking on this exercise was to support PAML's output of > PHYLIP alignments, if only one naming convention is to be implemented I > think it would be best to default to the relaxed rules. Practical. > Slightly unrelated musings: I was thinking that with Biopython's support > for reading PHYLIP alignments and Newick trees into objects, at some > point it would probably be convenient to make the Bio.Phylo.PAML package > more integrated by allowing the user to pass in such objects as > arguments rather than writing them to files first; the PAML module could > write them to temp files itself. I think some minor changes might have > to be made in places (ie for PAML to accept interleaved alignments, the > header line must contain an 'I' flag after the seq # and seq len > integers) and I'd have to think about how best to allow passing such > objects while still retaining the ability to specify filenames without > using kludgy, non-pythonic type-checking. Anyway, another task for > another day, but I thought I'd throw it out there. Do we need to write the "I" flag in our PHYLIP output? Peter From b.invergo at gmail.com Tue Oct 11 09:33:13 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 11:33:13 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> Message-ID: <1318325593.3137.51.camel@localhost.localdomain> > Something like that as a new format variant, yes. > > > ... > > Practical. > Ok, I'll start working on that then. > Do we need to write the "I" flag in our PHYLIP output? It took me a while to hunt down information on PHYLIP flags. I found this link which mentions them: http://www.no.embnet.org/phylipdoc/ They're only used by the program which is using the alignment as input, corresponding to the PHYLIP programs' menu options. In general, they have no affect on the format of the alignment (aside from the 'S'/sequential vs 'I'/interleaved flags). However, some of them might require extra information immediately below the header line, before the alignment starts. This complicates things. (see below for some PAML examples) However, since there's no real standardization to the use of the phylip format, not all programs pay attention to these flags. In my own work, I've used TCoffee to generate interleaved alignments and then I have to add in the 'I' after the fact. As another example, the current Biopython PhylipIO would not recognize a header line with options as a valid header line, since there would be more than 2 "parts". So, if some programs can take options flags (at least PHYLIP and PAML programs) while other programs may not like their inclusion, they would need to be treated specially. I would suggest that the PhylipIterator classes be modified to recognize the existence of options, but not necessarily do anything with them, and that the PhylipWriter classes be modified to optionally take a string containing option flags to append to the header line, ie 'I', 'GC', etc. As for the supplementary information for the options, I'm not sure if those complicate matters beyond the scope of Biopython's intended functionality, or whether there should be yet another optional string argument to the writer. The PhylipIterators would then need to be modified to handle the possible existence of these supplementary lines as well. Anyway, I don't think this is an immediate concern and I personally wouldn't approach it until I start working on the idea of better integrating the PAML module with the rest of Biopython. -brandon Here are some examples: 5 895 G G 4 3 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 1231231231231231231231231231231231231 444444444444444444444444444444444444444444444444444444444444 444444444444444444444444444444444444444444444444444444444444 444444444444444444444444444444444444444444444444444444444444 444444444444444444 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 123123123123123123123123123123123123123123123123123123123123 12312312312312312312312312312312312312312312312312312312312 Human AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTTACATCCTCATTACTATT CTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATC........ Chimpanzee ......... "The first line of the file contains the option character G. The second line begins with a G at the first column, followed by the number of site classes. The following lines contain the site marks, one for each site in the sequence (or each codon in the case of codonml). The site mark specifies which class each site is from. If there are g classes, the marks should be 1, 2, ..., g, and if g > 9, the marks need to be separated by spaces. The total number of marks must be equal to the total number of sites in each sequence." ******** 5 1000 G G 4 100 200 300 400 Sequence 1 TCGATAGATAGGTTTTAGGGGGGGGGGTAAAAAAAAA....... "This [alignment has 5 sequences of] 1000 nucleotides from 4 genes, obtained from concatenating four genes with 100, 200, 300, and 400 nucleotides from genes 1, 2, 3, and 4, respectively. The" ******** 5 855 GC human GTG CTG TCT CCT ... 5 sequences, 855 nucleotides, length must be a multiple of three ******** 5 300 G G2 40 60 sequence1 ..... "This data set has 5 sequences, each of 300 nucleotides (100 codons), which are partitioned into two genes, with the first gene having 40 codons and the second gene 60 codons." From p.j.a.cock at googlemail.com Tue Oct 11 09:37:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 10:37:48 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318325593.3137.51.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 10:33 AM, Brandon Invergo wrote: >> Do we need to write the "I" flag in our PHYLIP output? > > It took me a while to hunt down information on PHYLIP flags. I found > this link which mentions them: > http://www.no.embnet.org/phylipdoc/ > They're only used by the program which is using the alignment as input, > corresponding to the PHYLIP programs' menu options. In general, they > have no affect on the format of the alignment (aside from the > 'S'/sequential vs 'I'/interleaved flags). However, some of them might > require extra information immediately below the header line, before the > alignment starts. This complicates things. (see below for some PAML > examples) Some of those examples don't really look like PHYLIP anymore to me. If there is any simple change to allow the current parser to cope with (but ignore) any extra meta data like this, that sounds sensible (with unit tests of course - grin). Peter From b.invergo at gmail.com Tue Oct 11 10:01:59 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 12:01:59 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> Message-ID: <1318327319.3137.70.camel@localhost.localdomain> > Some of those examples don't really look like PHYLIP anymore to me. > > If there is any simple change to allow the current parser to cope > with (but ignore) any extra meta data like this, that sounds sensible > (with unit tests of course - grin). Agreed, it can get quite messy, though look at the link I provided; even the PHYLIP-specific example that they give includes some supplementary info at the top, as well as a tree at the bottom: 4 40 W W 0101001111 0101110101 0101110011 1101010110 dmras1 GTCGTCGTTG GACCTGGAGG CGTGGGCAAG spras GTAGTTGTAG GAGATGGTGG TGTTGGTAAA scras1 GTAGTTGTCG GTGGAGGTGG CGTTGGTAAA scras2 GTCGTCGTTG GTGGTGGTGG TGTTGGTAAA TCCGCGCTCA AGTGCTTTGA TCTGCTTTAA TCTGCTTTGA 1 ((dmras1,ddrasa),((hschras,spras),(scras1,scras2))); I agree that trying to shoehorn that functionality into Biopython as written would be a mess. Another option that I can think of, however, would be to shift such extra formatting duties to the Biopython application interface which needs them, since that's the only place they're relevant. So I could, for example, make a PAML-specific subclass of PhylipWriter which handles all these weird PAML-specific options. Or if there were to be a PHYLIP interface and the program took that above example as input, it would be the duty of the interface to write a file with those options, the alignment and the tree all together. Just a thought. For the short term, though, when I implement the sequential format, I'll go ahead and update the code to at least handle flags in the header line. To handle the supp. info should be straight forward, since I believe that each supp. line must begin with the option flag that requires the info; if the option flag exists in the header, ignore any following lines which begin with that flag character. Unit tests will abound. -brandon From p.j.a.cock at googlemail.com Tue Oct 11 10:13:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 11:13:03 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318327319.3137.70.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 11:01 AM, Brandon Invergo wrote: > >> Some of those examples don't really look like PHYLIP anymore to me. >> >> If there is any simple change to allow the current parser to cope >> with (but ignore) any extra meta data like this, that sounds sensible >> (with unit tests of course - grin). > > Agreed, it can get quite messy, though look at the link I provided; even > the PHYLIP-specific example that they give includes some supplementary > info at the top, as well as a tree at the bottom: > > ?4 ? 40 ? W > W ? ? ? ? 0101001111 0101110101 0101110011 > ? ? ? ? ?1101010110 > dmras1 ? ?GTCGTCGTTG GACCTGGAGG CGTGGGCAAG > > spras ? ? GTAGTTGTAG GAGATGGTGG TGTTGGTAAA > scras1 ? ?GTAGTTGTCG GTGGAGGTGG CGTTGGTAAA > scras2 ? ?GTCGTCGTTG GTGGTGGTGG TGTTGGTAAA > ? ? ? ? ?TCCGCGCTCA > ? ? ? ? ?AGTGCTTTGA > ? ? ? ? ?TCTGCTTTAA > ? ? ? ? ?TCTGCTTTGA > 1 > ((dmras1,ddrasa),((hschras,spras),(scras1,scras2))); > I would consider that to be a meta file containing a PHYLIP alignment and a tree, but in itself it isn't a PHYLIP alignment. That looks like exactly the kind of issue NEXUS was designed to solve: how to embed alignments, trees and other stuff into a single plain text file for input into a phylogenetic tool. Doesn't PHYLIP have an XML format these days? Trying to parse something like that text (without a formal standard) seems like a painful exercise and long term maintenance headache. Peter From b.invergo at gmail.com Tue Oct 11 10:37:39 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 11 Oct 2011 12:37:39 +0200 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: <1318329459.3137.82.camel@localhost.localdomain> > I would consider that to be a meta file containing a PHYLIP > alignment and a tree, but in itself it isn't a PHYLIP alignment. > > That looks like exactly the kind of issue NEXUS was designed > to solve: how to embed alignments, trees and other stuff into > a single plain text file for input into a phylogenetic tool. > > Doesn't PHYLIP have an XML format these days? Trying > to parse something like that text (without a formal standard) > seems like a painful exercise and long term maintenance > headache. I'm not suggesting that Biopython parse and store the information because I agree that it would be an unmaintainable nightmare. To bring myself out of the clouds a bit and back to the basics of my original intent: if I work on better integrating the PAML module so that the user can pass a MultipleSeqAlignment object, I will need a way to write that alignment to a file with potentially more information than the default PhylipWriter allows. So, just as simple as that, Bio.Phylo.PAML would need its own alignment writer....something I'm not going to worry about right now. With this mentality, then yes, anything containing such option flags and info is no longer a PHYLIP alignment but is rather an input file to some program. As such, the existing PhylipIO module should *not* be modified to handle this metadata. Please ignore all my other half-baked ideas. So, current, phylip-related tasks: - implement SequentialPhylipWriter and SequentialPhylipIterator classes in PhylipIO That's it, I think. I'll revisit this alignment-writing stuff at some other point. One task at a time... -brandon From p.j.a.cock at googlemail.com Tue Oct 11 11:05:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 12:05:48 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: <1318329459.3137.82.camel@localhost.localdomain> References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> <1318329459.3137.82.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 11:37 AM, Brandon Invergo wrote: >> I would consider that to be a meta file containing a PHYLIP >> alignment and a tree, but in itself it isn't a PHYLIP alignment. >> >> That looks like exactly the kind of issue NEXUS was designed >> to solve: how to embed alignments, trees and other stuff into >> a single plain text file for input into a phylogenetic tool. >> >> Doesn't PHYLIP have an XML format these days? Trying >> to parse something like that text (without a formal standard) >> seems like a painful exercise and long term maintenance >> headache. > > I'm not suggesting that Biopython parse and store the information > because I agree that it would be an unmaintainable nightmare. To bring > myself out of the clouds a bit and back to the basics of my original > intent: if I work on better integrating the PAML module so that the user > can pass a MultipleSeqAlignment object, I will need a way to write that > alignment to a file with potentially more information than the default > PhylipWriter allows. So, just as simple as that, Bio.Phylo.PAML would > need its own alignment writer....something I'm not going to worry about > right now. > > With this mentality, then yes, anything containing such option flags and > info is no longer a PHYLIP alignment but is rather an input file to some > program. As such, the existing PhylipIO module should *not* be modified > to handle this metadata. Please ignore all my other half-baked ideas. What you could think about is having the Bio.Phylo.PAML create this file, and call the existing PhylipIO module with the handle to write the alignment part - and perhaps the Bio.Phylo module with the handle to write any tree. > So, current, phylip-related tasks: > - implement SequentialPhylipWriter and SequentialPhylipIterator classes > in PhylipIO > > That's it, I think. I'll revisit this alignment-writing stuff at some > other point. One task at a time... > > -brandon That sounds like a manageable step to start with :) Peter From chapmanb at 50mail.com Tue Oct 11 11:20:31 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 11 Oct 2011 07:20:31 -0400 Subject: [Biopython-dev] Moving strand & db ref from SeqFeature to FeatureLocation In-Reply-To: References: Message-ID: <8739ez4vwg.fsf@kunkel.i-did-not-set--mail-host-address--so-tickle-me> Peter; > https://github.com/peterjc/biopython/tree/f_loc2 > > It moves the strand, ref and db_ref properties from > the SeqFeature object to the FeatureLocation object, > implementing read/write proxy methods for backward > compatibility. Thanks for the integer work and for this. I'm agreed that this is a more logical way to store the strand (and cross-ref) information. +1 from me on checking it in, Brad From p.j.a.cock at googlemail.com Tue Oct 11 11:28:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 12:28:35 +0100 Subject: [Biopython-dev] Moving strand & db ref from SeqFeature to FeatureLocation In-Reply-To: <8739ez4vwg.fsf@kunkel.i-did-not-set--mail-host-address--so-tickle-me> References: <8739ez4vwg.fsf@kunkel.i-did-not-set--mail-host-address--so-tickle-me> Message-ID: On Tue, Oct 11, 2011 at 12:20 PM, Brad Chapman wrote: > > Peter; > >> https://github.com/peterjc/biopython/tree/f_loc2 >> >> It moves the strand, ref and db_ref properties from >> the SeqFeature object to the FeatureLocation object, >> implementing read/write proxy methods for backward >> compatibility. > > Thanks for the integer work and for this. I'm agreed that this is a more > logical way to store the strand (and cross-ref) information. +1 from me > on checking it in, > Brad OK, that's done. Cheers Brad. As I said before, if anyone doesn't like the new printing of the FeatureLocation with how I present the strand and database reference, we can change that. There are examples in the SeqFeature.py and SeqRecord.py docstrings. Regards, Peter From eric.talevich at gmail.com Tue Oct 11 12:55:57 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 11 Oct 2011 08:55:57 -0400 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 6:13 AM, Peter Cock wrote: > > That looks like exactly the kind of issue NEXUS was designed > to solve: how to embed alignments, trees and other stuff into > a single plain text file for input into a phylogenetic tool. > > Doesn't PHYLIP have an XML format these days? Trying > to parse something like that text (without a formal standard) > seems like a painful exercise and long term maintenance > headache. > > The Phylip programs seqboot and retree have XML formats that look almost like SeqXML and phyloXML, but they're not quite compatible, e.g. attribute names are slightly different. This is probably because they were written before those standard formats existed -- pretty sure the retree XML format, sort of described in Inferring Phylogenies (2004) as an example of how a future XML tree format might look, was an inspiration for phyloXML. There hasn't been much development on these parts of the Phylip codebase lately, though. If someone wanted to write a patch to bring these formats into compliance with the closest standards, I bet Joe would accept the patch. Discussion: https://www.facebook.com/permalink.php?story_fbid=256082801069968&id=115402811804635 -E From p.j.a.cock at googlemail.com Tue Oct 11 13:04:20 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Oct 2011 14:04:20 +0100 Subject: [Biopython-dev] Parsing PAML supplementary output In-Reply-To: References: <1318243007.12974.16.camel@localhost.localdomain> <1318319486.3137.19.camel@localhost.localdomain> <1318325593.3137.51.camel@localhost.localdomain> <1318327319.3137.70.camel@localhost.localdomain> Message-ID: On Tue, Oct 11, 2011 at 1:55 PM, Eric Talevich wrote: > On Tue, Oct 11, 2011 at 6:13 AM, Peter Cock > wrote: >> >> That looks like exactly the kind of issue NEXUS was designed >> to solve: how to embed alignments, trees and other stuff into >> a single plain text file for input into a phylogenetic tool. >> >> Doesn't PHYLIP have an XML format these days? Trying >> to parse something like that text (without a formal standard) >> seems like a painful exercise and long term maintenance >> headache. >> > > The Phylip programs seqboot and retree have XML formats that look almost > like SeqXML and phyloXML, but they're not quite compatible, e.g. attribute > names are slightly different. > > This is probably because they were written before those standard formats > existed -- pretty sure the retree XML format, sort of described in Inferring > Phylogenies (2004) as an example of how a future XML tree format might look, > was an inspiration for phyloXML. There hasn't been much development on these > parts of the Phylip codebase lately, though. If someone wanted to write a > patch to bring these formats into compliance with the closest standards, I > bet Joe would accept the patch. > > Discussion: > https://www.facebook.com/permalink.php?story_fbid=256082801069968&id=115402811804635 > > -E Good plan - anyone here familiar with the PHYLIP code base? Peter From chapmanb at 50mail.com Thu Oct 13 14:05:57 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 13 Oct 2011 10:05:57 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs Message-ID: <871uuhm1fe.fsf@fastmail.fm> Hi all; Biopython's setup.py currently has an interactive question/answer session to remind users to optionally install NumPy if it's not present. This is useful for by-hand installations, but problematic with automated installers. One useful feature of setuptools is the 'install_requires' attribute in setup.py. This allows your programs to define the requirements and have them automatically installed from PyPi. It's a great way to include useful libraries without having to fret excessively about users installing dependencies. Unfortunately if you use install_requires with Biopython, and NumPy is not installed, automated scripts will get stuck in the question/answer dialog. To resolve this issue, I wrote a small patch that adds NumPy to Biopython's install_requires and skips the Q/A only in cases where it is installed via pip or easy_install: https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b If someone is able to review this, it would be great to get it into Biopython for the next release. Brad From p.j.a.cock at googlemail.com Thu Oct 13 14:20:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Oct 2011 15:20:46 +0100 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: <871uuhm1fe.fsf@fastmail.fm> References: <871uuhm1fe.fsf@fastmail.fm> Message-ID: On Thu, Oct 13, 2011 at 3:05 PM, Brad Chapman wrote: > > Hi all; > Biopython's setup.py currently has an interactive question/answer > session to remind users to optionally install NumPy if it's not > present. This is useful for by-hand installations, but problematic with > automated installers. > > One useful feature of setuptools is the 'install_requires' attribute in > setup.py. This allows your programs to define the requirements and have > them automatically installed from PyPi. It's a great way to include > useful libraries without having to fret excessively about users > installing dependencies. > > Unfortunately if you use install_requires with Biopython, and NumPy is > not installed, automated scripts will get stuck in the question/answer > dialog. To resolve this issue, I wrote a small patch that adds NumPy to > Biopython's install_requires and skips the Q/A only in cases where it is > installed via pip or easy_install: > > https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b > > If someone is able to review this, it would be great to get it into > Biopython for the next release. > > Brad I can appreciate the usefulness of this, but don't know enough about pip and easy_install to comment on the implementation. Anyone else? Peter From eric.talevich at gmail.com Thu Oct 13 18:00:22 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 Oct 2011 14:00:22 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: <871uuhm1fe.fsf@fastmail.fm> References: <871uuhm1fe.fsf@fastmail.fm> Message-ID: On Thu, Oct 13, 2011 at 10:05 AM, Brad Chapman wrote: > > Hi all; > Biopython's setup.py currently has an interactive question/answer > session to remind users to optionally install NumPy if it's not > present. This is useful for by-hand installations, but problematic with > automated installers. > > One useful feature of setuptools is the 'install_requires' attribute in > setup.py. This allows your programs to define the requirements and have > them automatically installed from PyPi. It's a great way to include > useful libraries without having to fret excessively about users > installing dependencies. > > Unfortunately if you use install_requires with Biopython, and NumPy is > not installed, automated scripts will get stuck in the question/answer > dialog. To resolve this issue, I wrote a small patch that adds NumPy to > Biopython's install_requires and skips the Q/A only in cases where it is > installed via pip or easy_install: > > > https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b > > If someone is able to review this, it would be great to get it into > Biopython for the next release. > > Hi Brad, Looks cool to me, except the sys.argv parsing gets a little gritty (understandably): Line 115: if dist_dir.find("egg-dist-tmp") >= 0: Could this be `if 'egg-dist-tmp' in dist_dir`? Line 118: if sys.argv in [["-c", "develop", "--no-deps"], ["-c", "egg_info"]]: Does pip allow rearranging arguments? Would `--no-deps -c develop` also be valid? If so, should that be added as a third item in the list-of-args? -Eric From chapmanb at 50mail.com Fri Oct 14 10:00:37 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Oct 2011 06:00:37 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: <871uuhm1fe.fsf@fastmail.fm> Message-ID: <87hb3b51ve.fsf@fastmail.fm> Eric and Peter; Thanks much for taking a look at this patch. > Looks cool to me, except the sys.argv parsing gets a little gritty > (understandably): Absolutely. Unfortunately the python installation space is pretty messy. Neither pip not easy_install gives any formal declaration so you have to resort to these hacks to infer that they are doing the install. Luckily I don't think any of these options are something people would do directly from the command line. > Line 115: > > if dist_dir.find("egg-dist-tmp") >= 0: > > Could this be `if 'egg-dist-tmp' in dist_dir`? > Line 118: > > if sys.argv in [["-c", "develop", "--no-deps"], > ["-c", "egg_info"]]: > > Does pip allow rearranging arguments? Would `--no-deps -c develop` also be > valid? > If so, should that be added as a third item in the list-of-args? Awesome, thanks for the suggestions. I checked both of these in. Thanks again, Brad From p.j.a.cock at googlemail.com Fri Oct 14 10:53:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Oct 2011 11:53:42 +0100 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: <87hb3b51ve.fsf@fastmail.fm> References: <871uuhm1fe.fsf@fastmail.fm> <87hb3b51ve.fsf@fastmail.fm> Message-ID: On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: > > Awesome, thanks for the suggestions. I checked both of these in. > I'll test the branch today, and merge it to the trunk if it looks good on Python 2 / 3 / Jython / PyPy. Peter From p.j.a.cock at googlemail.com Fri Oct 14 10:55:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Oct 2011 11:55:52 +0100 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: <871uuhm1fe.fsf@fastmail.fm> <87hb3b51ve.fsf@fastmail.fm> Message-ID: On Fri, Oct 14, 2011 at 11:53 AM, Peter Cock wrote: > On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: >> >> Awesome, thanks for the suggestions. I checked both of these in. >> > > I'll test the branch today, and merge it to the trunk if it looks good > on Python 2 / 3 / Jython / PyPy. > $ jython setup.py install /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: Unknown distribution option: 'install_requires' warnings.warn(msg) running install running build running build_py ... That's with Jython 2.5.2 under Mac OS X Snow Leopard. Same with pypy 1.6, $ pypy setup.py install /Users/pjcock/Downloads/Software/pypy-1.6/lib-python/modified-2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires' warnings.warn(msg) running install running build running build_py ... Can we avoid that warning? Peter From chapmanb at 50mail.com Fri Oct 14 12:26:06 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Oct 2011 08:26:06 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: <871uuhm1fe.fsf@fastmail.fm> <87hb3b51ve.fsf@fastmail.fm> Message-ID: <87ehyf4v4x.fsf@fastmail.fm> Peter; Thanks for testing this and helping with the merge > $ jython setup.py install > /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: > Unknown distribution option: 'install_requires' > warnings.warn(msg) [...] > Can we avoid that warning? This is a warning from distutils, so you would also see this on regular ol' Python without setuptools installed. Likewise it should go away on jython or pypy if they have setuptools or distribute installed. Unfortunately I don't have a way around it since this is an argument to setup. Most modern installations should have setuptools and can take advantage of install_requires. If it's a problem we could use 'warnings' to ignore it. Brad From cmccoy at fhcrc.org Fri Oct 14 17:11:15 2011 From: cmccoy at fhcrc.org (Connor McCoy) Date: Fri, 14 Oct 2011 10:11:15 -0700 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs Message-ID: Hi Brad, Eric, and Peter, Sorry to jump in. Regarding the install_requires warnings: If you're interested, you can include the distribute_setup.py file from http://python-distribute.org/distribute_setup.py in BioPython, and add a short conditional import: try: from setuptools import setup, find_packages except ImportError: import distribute_setup distribute_setup.use_setuptools() from setuptools import setup, find_packages Which will download and install distribute if it isn't available in the python installation; the remainder of the setup can assume setuptools is available. Sphinx (https://bitbucket.org/birkenfeld/sphinx/src/f1f641602bb2/setup.py) and some other projects use this. Connor On Fri, Oct 14, 2011 at 9:00 AM, wrote: > Send Biopython-dev mailing list submissions to > ? ? ? ?biopython-dev at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython-dev > or, via email, send a message with subject or body 'help' to > ? ? ? ?biopython-dev-request at lists.open-bio.org > > You can reach the person managing the list at > ? ? ? ?biopython-dev-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython-dev digest..." > > > Today's Topics: > > ? 1. Re: NumPy dialog when Biopython installed from automated > ? ? ?programs (Eric Talevich) > ? 2. Re: NumPy dialog when Biopython installed from ? ?automated > ? ? ?programs (Brad Chapman) > ? 3. Re: NumPy dialog when Biopython installed from automated > ? ? ?programs (Peter Cock) > ? 4. Re: NumPy dialog when Biopython installed from automated > ? ? ?programs (Peter Cock) > ? 5. Re: NumPy dialog when Biopython installed from ? ?automated > ? ? ?programs (Brad Chapman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 13 Oct 2011 14:00:22 -0400 > From: Eric Talevich > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from automated programs > To: Brad Chapman > Cc: Biopython-Dev Mailing List > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Thu, Oct 13, 2011 at 10:05 AM, Brad Chapman wrote: > >> >> Hi all; >> Biopython's setup.py currently has an interactive question/answer >> session to remind users to optionally install NumPy if it's not >> present. This is useful for by-hand installations, but problematic with >> automated installers. >> >> One useful feature of setuptools is the 'install_requires' attribute in >> setup.py. This allows your programs to define the requirements and have >> them automatically installed from PyPi. It's a great way to include >> useful libraries without having to fret excessively about users >> installing dependencies. >> >> Unfortunately if you use install_requires with Biopython, and NumPy is >> not installed, automated scripts will get stuck in the question/answer >> dialog. To resolve this issue, I wrote a small patch that adds NumPy to >> Biopython's install_requires and skips the Q/A only in cases where it is >> installed via pip or easy_install: >> >> >> https://github.com/chapmanb/biopython/commit/be53d850d721fc82af81bedcd9fb9034b0a2099b >> >> If someone is able to review this, it would be great to get it into >> Biopython for the next release. >> >> > Hi Brad, > > Looks cool to me, except the sys.argv parsing gets a little gritty > (understandably): > > Line 115: > > ? ?if dist_dir.find("egg-dist-tmp") >= 0: > > Could this be `if 'egg-dist-tmp' in dist_dir`? > > > Line 118: > > ? ?if sys.argv in [["-c", "develop", "--no-deps"], > ? ? ? ? ? ? ? ? ? ?["-c", "egg_info"]]: > > Does pip allow rearranging arguments? Would `--no-deps -c develop` also be > valid? > If so, should that be added as a third item in the list-of-args? > > > -Eric > > > ------------------------------ > > Message: 2 > Date: Fri, 14 Oct 2011 06:00:37 -0400 > From: Brad Chapman > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from ? ?automated programs > To: Eric Talevich > Cc: , Biopython-Dev Mailing List > Message-ID: <87hb3b51ve.fsf at fastmail.fm> > Content-Type: text/plain; charset=us-ascii > > > Eric and Peter; > Thanks much for taking a look at this patch. > >> Looks cool to me, except the sys.argv parsing gets a little gritty >> (understandably): > > Absolutely. Unfortunately the python installation space is pretty > messy. Neither pip not easy_install gives any formal declaration so you > have to resort to these hacks to infer that they are doing the > install. Luckily I don't think any of these options are something people > would do directly from the command line. > >> Line 115: >> >> ? ? if dist_dir.find("egg-dist-tmp") >= 0: >> >> Could this be `if 'egg-dist-tmp' in dist_dir`? > >> Line 118: >> >> ? ? if sys.argv in [["-c", "develop", "--no-deps"], >> ? ? ? ? ? ? ? ? ? ? ["-c", "egg_info"]]: >> >> Does pip allow rearranging arguments? Would `--no-deps -c develop` also be >> valid? >> If so, should that be added as a third item in the list-of-args? > > Awesome, thanks for the suggestions. I checked both of these in. > > Thanks again, > Brad > > > ------------------------------ > > Message: 3 > Date: Fri, 14 Oct 2011 11:53:42 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from automated programs > To: Brad Chapman > Cc: Biopython-Dev Mailing List > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: >> >> Awesome, thanks for the suggestions. I checked both of these in. >> > > I'll test the branch today, and merge it to the trunk if it looks good > on Python 2 / 3 / Jython / PyPy. > > Peter > > > ------------------------------ > > Message: 4 > Date: Fri, 14 Oct 2011 11:55:52 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from automated programs > To: Brad Chapman > Cc: Biopython-Dev Mailing List > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Oct 14, 2011 at 11:53 AM, Peter Cock wrote: >> On Fri, Oct 14, 2011 at 11:00 AM, Brad Chapman wrote: >>> >>> Awesome, thanks for the suggestions. I checked both of these in. >>> >> >> I'll test the branch today, and merge it to the trunk if it looks good >> on Python 2 / 3 / Jython / PyPy. >> > > $ jython setup.py install > /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: > Unknown distribution option: 'install_requires' > ?warnings.warn(msg) > running install > running build > running build_py > ... > > > That's with Jython 2.5.2 under Mac OS X Snow Leopard. Same with pypy 1.6, > > $ pypy setup.py install > /Users/pjcock/Downloads/Software/pypy-1.6/lib-python/modified-2.7/distutils/dist.py:267: > UserWarning: Unknown distribution option: 'install_requires' > ?warnings.warn(msg) > running install > running build > running build_py > ... > > Can we avoid that warning? > > Peter > > > ------------------------------ > > Message: 5 > Date: Fri, 14 Oct 2011 08:26:06 -0400 > From: Brad Chapman > Subject: Re: [Biopython-dev] NumPy dialog when Biopython installed > ? ? ? ?from ? ?automated programs > To: Peter Cock > Cc: , Biopython-Dev Mailing List > Message-ID: <87ehyf4v4x.fsf at fastmail.fm> > Content-Type: text/plain; charset=us-ascii > > > Peter; > Thanks for testing this and helping with the merge > >> $ jython setup.py install >> /Users/pjcock/jython2.5.2/Lib/distutils/dist.py:263: UserWarning: >> Unknown distribution option: 'install_requires' >> ? warnings.warn(msg) > [...] >> Can we avoid that warning? > > This is a warning from distutils, so you would also see this on regular > ol' Python without setuptools installed. Likewise it should go away on > jython or pypy if they have setuptools or distribute installed. > > Unfortunately I don't have a way around it since this is an argument to > setup. Most modern installations should have setuptools and can take > advantage of install_requires. > > If it's a problem we could use 'warnings' to ignore it. > > Brad > > > ------------------------------ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 105, Issue 12 > ********************************************** > From carlcrott at gmail.com Mon Oct 17 01:24:27 2011 From: carlcrott at gmail.com (carl crott) Date: Sun, 16 Oct 2011 21:24:27 -0400 Subject: [Biopython-dev] fixes on the tutorials Message-ID: So the tutorials I'm running through have some bugs in them ... would anyone like me to fix these? tutorial 2.4.1 should be something like: from Bio import SeqIO handle = open("ls_orchid.fasta", "rU") for seq_record in SeqIO.parse(handle, "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) handle.close() and tutorial 2.4.2: from Bio import SeqIO handle = open("ls_orchid.gbk", "rU") for seq_record in SeqIO.parse(handle, "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) handle.close() From chapmanb at 50mail.com Mon Oct 17 01:29:49 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 16 Oct 2011 21:29:49 -0400 Subject: [Biopython-dev] NumPy dialog when Biopython installed from automated programs In-Reply-To: References: Message-ID: <8739eso16a.fsf@fastmail.fm> Connor; Thanks for the idea on the auto-install of setuptools/distribute. I'm open to this or sticking with the warning, whichever everyone prefers. Traditionally the setup has tried to be lightweight so you could install Biopython without anything else, but having distribute installed is pretty useful so it might be nice to encourage this. Brad > Sorry to jump in. Regarding the install_requires warnings: > > If you're interested, you can include the distribute_setup.py file > from http://python-distribute.org/distribute_setup.py in BioPython, > and add a short conditional import: > > try: > from setuptools import setup, find_packages > except ImportError: > import distribute_setup > distribute_setup.use_setuptools() > from setuptools import setup, find_packages > > Which will download and install distribute if it isn't available in > the python installation; the remainder of the setup can assume > setuptools is available. Sphinx > (https://bitbucket.org/birkenfeld/sphinx/src/f1f641602bb2/setup.py) > and some other projects use this. From p.j.a.cock at googlemail.com Mon Oct 17 07:55:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 08:55:54 +0100 Subject: [Biopython-dev] fixes on the tutorials In-Reply-To: References: Message-ID: On Mon, Oct 17, 2011 at 2:24 AM, carl crott wrote: > So the tutorials I'm running through have some bugs in them ... > > would anyone like me to fix these? > Hi Carl, What's the bug? > > tutorial 2.4.1 should be something like: > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for seq_record in SeqIO.parse(handle, "fasta"): > ? ?print seq_record.id > ? ?print repr(seq_record.seq) > ? ?print len(seq_record) > handle.close() > Your example above looks fine (and the tutorial used to say that), but the current version is shorter: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) We could alternatively (now that we've dropped Python 2.4) open the handle with a with statement. The same applies to the GenBank example. Perhaps you are using an old version of Biopython (where Bio.SeqIO.parse(...) does not accept a filename)? Could you clarify please, Thanks, Peter From p.j.a.cock at googlemail.com Mon Oct 17 10:10:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 11:10:54 +0100 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality Message-ID: Hi Joshua and everyone, It looks like Joshua's email (below) got lost in the spam filter (possibly due to the attachment). The core of his patch was as follows (there were also lots of white space changes). @@ -694,6 +714,15 @@ class FeatureLocation(object): for i in range(self._start, self._end): yield i + def __eq__(self, other): + """Compares a FeatureLocation for equality""" + if not isinstance(other, FeatureLocation): + return False + if self.start() == other.start() and \ + self.end() == other.end(): + return True + return False + @@ -255,6 +255,26 @@ class SeqFeature(object): qualifiers = dict(self.qualifiers.iteritems()), sub_features = [f._flip(length) for f in self.sub_features[::-1]]) + def __eq__(self, other): + """Compare between this SeqFeature and other. + + ref, ref_db and qualifiers are not needed for comparison""" + if not isinstance(other, SeqFeature): + return False + if (self.id != "" + and other.id != "" and + self.id == other.id): + return True # Can we trust this? + for x in ('location', 'type', 'strand', 'location_operator'): + if (getattr(self, x) and getattr(other, x) and \ + getattr(self, x) != getattr(other, x)): + return False + for f in self.sub_features: + if f not in other.sub_features: + return False + else: + return True + def extract(self, parent_sequence): """Extract feature sequence from the supplied parent sequence. Note the patch will not apply to the trunk, perhaps it is against the current release? First (logically), is defining __eq__ for the FeatureLocation, and second is defining __eq__ for the SeqFeature. This hides the fact that we need to compare position objects, e.g. is BeforePosition(5) == ExactPosition(5)?, the answer is yes, which I have now clarified in the docstrings: https://github.com/biopython/biopython/commit/55feea75f7ab55eac4ef4e320567d746ce41120a Other than the fact that I think the ref and ref_db should be checked when comparing locations, adding location comparison seems like a good idea. Note that with the recent changes on the trunk, the strand, ref and ref_db now belong to the FeatureLocation not the SeqFeature. Extending this to cover the SeqFeature leaves the ID, type, etc and is fiddly: Particularly the question of annotation. These are essentially the same reasons why we don't support SeqRecord equality. Joshua - would you like to update your patch against the code in github, just for the FeatureLocation __eq__ method, to include the strand, ref and red_db properties? Thanks, Peter ---------- Forwarded message ---------- From:?"Joshua Ismael Haase Hern?ndez" To:?biopython-dev at biopython.org Date:?Mon, 17 Oct 2011 01:06:17 -0500 Subject:?[patch] SeqFeature comparison for equality Hi there. I was working on a testcase for a custom program which should extract the same features I had planned. Since SeqFeature lacs comparison method, there is no easy way to test for feature in test_gene.features: ? ?self.asserIn(feature, myparser(file).features) So I added comparison methods and they work fine. Patch attached. My changes are under Biopython license. From p.j.a.cock at googlemail.com Mon Oct 17 15:03:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 16:03:42 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: Hi Michiel, Regarding code using Bio.File, which you asked about deprecating last month: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html I objected at the time because I was using it for the TogoWS code I was working on, On Thu, Sep 8, 2011 at 4:25 PM, Peter Cock wrote: On Wed, Sep 7, 2011 at 3:36 PM, Peter Cock wrote: >>> If the server could be relied on to always give an >>> HTTP error code this wouldn't be needed: >>> >>> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py >>> > > ... > > [Some of those TogoWS checks are probably superfluous > right now, I'm still polishing the error handling - some of > which will rely on TogoWS itself catching more conditions] I've updated my TogoWS to rely on the HTTP error codes, and removed the heuristic error detection which required Bio.File for the UndoHandle. That seems to be working fine now. That leaves Bio/SCOP/__init__.py as the only existing or imminent code using Bio.File, so if we can sort that out, we can deprecate Bio.File as you suggested. Regards, Peter From anaryin at gmail.com Mon Oct 17 15:13:37 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 17:13:37 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: Hey Peter, all, Sorry to peek in. I was going over some code lately together with Eric and he suggested I use Bio.File as it was done in plenty of Bio.*IO modules. What is this deprecation about then? Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/17 Peter Cock > Hi Michiel, > > Regarding code using Bio.File, which you asked about > deprecating last month: > > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html > > I objected at the time because I was using it for the > TogoWS code I was working on, > > On Thu, Sep 8, 2011 at 4:25 PM, Peter Cock > wrote: > On Wed, Sep 7, 2011 at 3:36 PM, Peter Cock > wrote: > >>> If the server could be relied on to always give an > >>> HTTP error code this wouldn't be needed: > >>> > >>> > https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > >>> > > > > ... > > > > [Some of those TogoWS checks are probably superfluous > > right now, I'm still polishing the error handling - some of > > which will rely on TogoWS itself catching more conditions] > > I've updated my TogoWS to rely on the HTTP error codes, > and removed the heuristic error detection which required > Bio.File for the UndoHandle. That seems to be working fine > now. > > That leaves Bio/SCOP/__init__.py as the only existing or > imminent code using Bio.File, so if we can sort that out, > we can deprecate Bio.File as you suggested. > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Mon Oct 17 15:44:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 16:44:35 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Mon, Oct 17, 2011 at 4:13 PM, Jo?o Rodrigues wrote: > Hey Peter, all, > Sorry to peek in. I was going over some code lately together with Eric and > he suggested I use Bio.File as it was done in plenty of Bio.*IO modules. > What is this deprecation about then? > Cheers, Hi Jo?o, Perhaps you misunderstood Eric, Bio.File is not used widely at all. See Michiel's email at the start of this thread: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html Peter From anaryin at gmail.com Mon Oct 17 16:10:40 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 18:10:40 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: Hi Peter, To be honest, I didn't see much of a point to use the module but for consistency's sake. I grep'ed Bio.File in my biopython dir and I got a few more modules with Bio.File, don't know if you were aware. Bio/Application/__init__.py:from Bio import File Bio/Blast/NCBIStandalone.py:from Bio import File Bio/PDB/parse_pdb_header.py:from Bio import File Bio/Phylo/_io.py:from Bio import File Bio/SCOP/__init__.py: from Bio import File Just wanting to clear my doubts about this, thanks! Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/17 Peter Cock > On Mon, Oct 17, 2011 at 4:13 PM, Jo?o Rodrigues wrote: > > Hey Peter, all, > > Sorry to peek in. I was going over some code lately together with Eric > and > > he suggested I use Bio.File as it was done in plenty of Bio.*IO modules. > > What is this deprecation about then? > > Cheers, > > Hi Jo?o, > > Perhaps you misunderstood Eric, Bio.File is not used widely at all. > See Michiel's email at the start of this thread: > > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html > > Peter > From p.j.a.cock at googlemail.com Mon Oct 17 16:26:14 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 17:26:14 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Mon, Oct 17, 2011 at 5:10 PM, Jo?o Rodrigues wrote: > Hi Peter, > To be honest, I didn't see much of a point to use the module but for > consistency's sake. Michiel's point was [at that time] there was very little useful code if any in Bio.File, so could we deprecate it? > I grep'ed Bio.File in my biopython dir and I got a few more modules > with Bio.File, don't know if you were aware. > > Bio/Application/__init__.py:from Bio import File > Bio/Blast/NCBIStandalone.py:from Bio import File > Bio/PDB/parse_pdb_header.py:from Bio import File > Bio/Phylo/_io.py:from Bio import File > Bio/SCOP/__init__.py: ? ?from Bio import File > > Just wanting to clear my doubts about this, thanks! > Cheers, Oh - I remember now. We recently added the as_handle context manager to Bio.File, and that is a useful bit of functionality of general interest. At the time I had forgotten about Michiel's suggestion we deprecate Bio.File, which is unfortunate, but we can still change this before our next release. So, should we keep Bio.File for as_handle (even if everything else in Bio.File is to be deprecated), or should we move the new as_handle functionality somewhere else and deprecate all of Bio.File. Thanks for double checking Jo?o, Peter From anaryin at gmail.com Mon Oct 17 17:21:28 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 19:21:28 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: > > At the time I had forgotten about Michiel's suggestion > we deprecate Bio.File, which is unfortunate, but we > can still change this before our next release. > > So, should we keep Bio.File for as_handle (even if > everything else in Bio.File is to be deprecated), or > should we move the new as_handle functionality > somewhere else and deprecate all of Bio.File. > I think it doesn't make sense to keep the module for 5 lines of code. if isinstance(handleish, basestring): with open(handleish, mode) as fp: yield fp else: yield handleish I'd either place them in __init__.py or just insert them in all Bio.*IO modules wherever needed. If we had more snippets in common with all *IOs, it would be valuable and understandable to have a separate module, but as is it's a bit unnecessary IMHO. > > Thanks for double checking Jo?o, > No problem. Cheers, Jo?o From hahj87 at gmail.com Mon Oct 17 17:57:53 2011 From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=) Date: Mon, 17 Oct 2011 12:57:53 -0500 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: El 17 de octubre de 2011 12:15, Peter Cock escribi?: > Hi Joshua, > > Could you CC the biopython-dev mailing list, unless you > specifically want to discuss something in private? > Sorry about that, I thought i was answering to mailin list. > > 2011/10/17 Joshua Ismael Haase Hern?ndez : > > I'm on it. > > > > Will add __eq__ to FeatureLocation on trunk. > > Great. > > In the short term, you can just work on it directly with a copy of the > official repository and send me a patch (use git patch > file.patch) > > The "best" way is to fork biopython on github, and create your > own branch with these changes. > > > I think BeforeLocation should check if the second is before, > > After check if it is after, etc, and this can be done in locations. > > > > Before I implement those: do you agree? > > > > In that case, AbstractLocation instances > > should check if ExactLocation instances are > > inside their range, and AbstractLocation > > instances to be exactly the same. > > This positions would be the same: OneOfPosition(5, 11, 15), ExactPosition(11), AfterPosition(4), BeforePosition(16), WithinPosition(5, 16), > No. Having tried this myself, it is very complicated. > I think I'm missing something, why is it hard?, I see it as a cases listing. > Also, there are constraints with the Python language > about equality, hashing and comparisons (e.g. for > membership in lists, or use as dictionary keys). > I don't think anyone should use Features as dictionary keys, they will use Feature Id for that, but maybe someona wants a set of features (which just now is like a list of all sequences)... I which cases that should be a problem? (I'm biothechnology engineer, so I don't see all caveats, and i don't really have deep undestanding about how python works) The current behaviour of simple comparison of > the positions as an integer is at least simple. > > > About SeqFeature, I think they should be > > the same if they share all locations. > > You don't care about feature type and ID? ;) > maybe not, a comparison could skip iterating the locations if we have the same type and id, still not sure that's a good method (thus the comment ?# Can we trust this?? on my patch) but a feature 'CDS' is sometimes equivalent to feature 'mRNA', in that case ID and type would both be different in seqfeatures. > > Peter > From p.j.a.cock at googlemail.com Mon Oct 17 18:07:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 19:07:27 +0100 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: 2011/10/17 Joshua Ismael Haase Hern?ndez : > ... > > This positions would be the same: > > OneOfPosition(5, 11, 15), > ExactPosition(11), > AfterPosition(4), > BeforePosition(16), > WithinPosition(5, 16), I don't understand what you are asking here. Those positions do not look the same to me. >> >> No. Having tried this myself, it is very complicated. > > I think I'm missing something, why is it hard?, > I see it as a cases listing. Well, try it and write lots of unit tests, and I'll review it. >> >> Also, there are constraints with the Python language >> about equality, hashing and comparisons (e.g. for >> membership in lists, or use as dictionary keys). > > I don't think anyone should use Features as dictionary keys, > they will use Feature Id for that, but maybe someona wants a > set of features (which just now is like a list of all sequences)... > > I which cases that should be a problem? (I'm biothechnology > engineer, so I don't see all caveats, and i don't really have > deep undestanding about how python works) Using positions as dictionary keys seems reasonable. Using a SeqFeature as a key is not possible as they are mutable objects. >> The current behaviour of simple comparison of >> the positions as an integer is at least simple. >> >> > About SeqFeature, I think they should be >> > the same if they share all locations. >> >> You don't care about feature type and ID? ?;) > > maybe not, a comparison could skip iterating > the locations if we have the same type and id, > still not sure that's a good method (thus the comment > ?# Can we trust this?? on my patch) but a feature > 'CDS' is sometimes equivalent to feature 'mRNA', > in that case ID and type would both be different > in seqfeatures. A gene, mRNA and CDS might all have the same position, but they are different features. Peter From hahj87 at gmail.com Mon Oct 17 18:27:19 2011 From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=) Date: Mon, 17 Oct 2011 13:27:19 -0500 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: El 17 de octubre de 2011 13:07, Peter Cock escribi?: > 2011/10/17 Joshua Ismael Haase Hern?ndez : > > ... > > > > This positions would be the same: > > > > OneOfPosition(5, 11, 15), > > ExactPosition(11), > > AfterPosition(4), > > BeforePosition(16), > > WithinPosition(5, 16), > > I don't understand what you are asking here. Those > positions do not look the same to me. > > They are not *exactly* the same, but besides AfterPosition and BeforePosition, ExactPosition(11) is included in OneOfPosition(5, 11, 15), ExactPosition(11) is after AfterPosition(4) ExactPosition(11) is before BeforePosition(16) ExactPosition(11) is included in WithinPosition(5, 16) All positions in OneOfPosition are before BeforePosition, after AfterPosition, within WithinPosition, and includes ExactPosition. Al positions in WithinPosition are after AfterPosition, before BeforePosition. BeforePosition and AfterPosition can't be equal. How should I name the TestCases? From p.j.a.cock at googlemail.com Mon Oct 17 19:03:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 20:03:15 +0100 Subject: [Biopython-dev] [biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: 2011/10/17 Joshua Ismael Haase Hern?ndez : > > > El 17 de octubre de 2011 13:07, Peter Cock > escribi?: >> >> 2011/10/17 Joshua Ismael Haase Hern?ndez : >> > ... >> > >> > This positions would be the same: >> > >> > OneOfPosition(5, 11, 15), >> > ExactPosition(11), >> > AfterPosition(4), >> > BeforePosition(16), >> > WithinPosition(5, 16), >> >> I don't understand what you are asking here. Those >> positions do not look the same to me. >> > > They are not *exactly* the same, but besides > AfterPosition and BeforePosition, > ExactPosition(11) is included in OneOfPosition(5, 11, 15), > ExactPosition(11) is after AfterPosition(4) > ExactPosition(11) is before BeforePosition(16) > ExactPosition(11) is included in WithinPosition(5, 16) > All positions in OneOfPosition are before BeforePosition, > after AfterPosition, within WithinPosition, and includes > ExactPosition. > Al positions in WithinPosition are after AfterPosition, > before BeforePosition. > BeforePosition and AfterPosition can't be equal. > It might help it you wrote these out explicitly, e.g. currently: >>> from Bio.SeqFeature import * >>> a = BeforePosition(10) >>> b = AfterPosition(10) >>> a == b == 10 True Currently BeforePosition and AfterPosition act like the integer position for comparison etc. I find this reasonable given we have to treat them as the integer for things like extracting the sequence. > How should I name the TestCases? > Something like test_SeqFeature.py and using unittest. Most existing tests in this area are in doctests and test_SeqIO_feature.py Peter From andrea at biocomp.unibo.it Tue Oct 18 12:59:05 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 18 Oct 2011 14:59:05 +0200 (CEST) Subject: [Biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: Hi, I don't know if this can help, but I've been subclassing seqfeature and seqrecord objects to assert equalities. I've attached the very simple code for the seqfeature equality Handling complex location equalities with a given set of rules could be misleading. a feature starting in position 11 is different, for me, from one located at position 12. Andrea > ------------------------------ > > Message: 4 > Date: Mon, 17 Oct 2011 12:57:53 -0500 > From: Joshua Ismael Haase Hern?ndez > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Peter Cock > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > El 17 de octubre de 2011 12:15, Peter Cock > escribi?: > >> Hi Joshua, >> >> Could you CC the biopython-dev mailing list, unless you >> specifically want to discuss something in private? >> > > Sorry about that, I thought i was answering to mailin list. > >> >> 2011/10/17 Joshua Ismael Haase Hern?ndez : >> > I'm on it. >> > >> > Will add __eq__ to FeatureLocation on trunk. >> >> Great. >> >> In the short term, you can just work on it directly with a copy of the >> official repository and send me a patch (use git patch > file.patch) >> >> The "best" way is to fork biopython on github, and create your >> own branch with these changes. >> >> > I think BeforeLocation should check if the second is before, >> > After check if it is after, etc, and this can be done in locations. >> > >> > Before I implement those: do you agree? >> > >> > In that case, AbstractLocation instances >> > should check if ExactLocation instances are >> > inside their range, and AbstractLocation >> > instances to be exactly the same. >> >> > This positions would be the same: > > OneOfPosition(5, 11, 15), > ExactPosition(11), > AfterPosition(4), > BeforePosition(16), > WithinPosition(5, 16), > > >> No. Having tried this myself, it is very complicated. >> > > I think I'm missing something, why is it hard?, > I see it as a cases listing. > > >> Also, there are constraints with the Python language >> about equality, hashing and comparisons (e.g. for >> membership in lists, or use as dictionary keys). >> > > I don't think anyone should use Features as dictionary keys, > they will use Feature Id for that, but maybe someona wants a > set of features (which just now is like a list of all sequences)... > > I which cases that should be a problem? (I'm biothechnology > engineer, so I don't see all caveats, and i don't really have > deep undestanding about how python works) > > The current behaviour of simple comparison of >> the positions as an integer is at least simple. >> >> > About SeqFeature, I think they should be >> > the same if they share all locations. >> >> You don't care about feature type and ID? ;) >> > > maybe not, a comparison could skip iterating > the locations if we have the same type and id, > still not sure that's a good method (thus the comment > ?# Can we trust this?? on my patch) but a feature > 'CDS' is sometimes equivalent to feature 'mRNA', > in that case ID and type would both be different > in seqfeatures. > >> >> Peter >> > > > > ------------------------------ > > Message: 5 > Date: Mon, 17 Oct 2011 19:07:27 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Joshua Ismael Haase Hern?ndez > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > 2011/10/17 Joshua Ismael Haase Hern?ndez : >> ... >> >> This positions would be the same: >> >> OneOfPosition(5, 11, 15), >> ExactPosition(11), >> AfterPosition(4), >> BeforePosition(16), >> WithinPosition(5, 16), > > I don't understand what you are asking here. Those > positions do not look the same to me. > >>> >>> No. Having tried this myself, it is very complicated. >> >> I think I'm missing something, why is it hard?, >> I see it as a cases listing. > > Well, try it and write lots of unit tests, and I'll review it. > >>> >>> Also, there are constraints with the Python language >>> about equality, hashing and comparisons (e.g. for >>> membership in lists, or use as dictionary keys). >> >> I don't think anyone should use Features as dictionary keys, >> they will use Feature Id for that, but maybe someona wants a >> set of features (which just now is like a list of all sequences)... >> >> I which cases that should be a problem? (I'm biothechnology >> engineer, so I don't see all caveats, and i don't really have >> deep undestanding about how python works) > > Using positions as dictionary keys seems reasonable. > > Using a SeqFeature as a key is not possible as they > are mutable objects. > >>> The current behaviour of simple comparison of >>> the positions as an integer is at least simple. >>> >>> > About SeqFeature, I think they should be >>> > the same if they share all locations. >>> >>> You don't care about feature type and ID? ?;) >> >> maybe not, a comparison could skip iterating >> the locations if we have the same type and id, >> still not sure that's a good method (thus the comment >> ?# Can we trust this?? on my patch) but a feature >> 'CDS' is sometimes equivalent to feature 'mRNA', >> in that case ID and type would both be different >> in seqfeatures. > > A gene, mRNA and CDS might all have the same > position, but they are different features. > > Peter > > > > ------------------------------ > > Message: 6 > Date: Mon, 17 Oct 2011 13:27:19 -0500 > From: Joshua Ismael Haase Hern?ndez > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Peter Cock > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > El 17 de octubre de 2011 13:07, Peter Cock > escribi?: > >> 2011/10/17 Joshua Ismael Haase Hern?ndez : >> > ... >> > >> > This positions would be the same: >> > >> > OneOfPosition(5, 11, 15), >> > ExactPosition(11), >> > AfterPosition(4), >> > BeforePosition(16), >> > WithinPosition(5, 16), >> >> I don't understand what you are asking here. Those >> positions do not look the same to me. >> >> > They are not *exactly* the same, but besides > AfterPosition and BeforePosition, > ExactPosition(11) is included in OneOfPosition(5, 11, 15), > ExactPosition(11) is after AfterPosition(4) > ExactPosition(11) is before BeforePosition(16) > ExactPosition(11) is included in WithinPosition(5, 16) > All positions in OneOfPosition are before BeforePosition, > after AfterPosition, within WithinPosition, and includes > ExactPosition. > Al positions in WithinPosition are after AfterPosition, > before BeforePosition. > > BeforePosition and AfterPosition can't be equal. > > How should I name the TestCases? > > > > ------------------------------ > > Message: 7 > Date: Mon, 17 Oct 2011 20:03:15 +0100 > From: Peter Cock > Subject: Re: [Biopython-dev] [biopython-dev] SeqFeature comparison for > equality > To: Joshua Ismael Haase Hern?ndez > Cc: biopython-dev at biopython.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > 2011/10/17 Joshua Ismael Haase Hern?ndez : >> >> >> El 17 de octubre de 2011 13:07, Peter Cock >> escribi?: >>> >>> 2011/10/17 Joshua Ismael Haase Hern?ndez : >>> > ... >>> > >>> > This positions would be the same: >>> > >>> > OneOfPosition(5, 11, 15), >>> > ExactPosition(11), >>> > AfterPosition(4), >>> > BeforePosition(16), >>> > WithinPosition(5, 16), >>> >>> I don't understand what you are asking here. Those >>> positions do not look the same to me. >>> >> >> They are not *exactly* the same, but besides >> AfterPosition and BeforePosition, >> ExactPosition(11) is included in OneOfPosition(5, 11, 15), >> ExactPosition(11) is after AfterPosition(4) >> ExactPosition(11) is before BeforePosition(16) >> ExactPosition(11) is included in WithinPosition(5, 16) >> All positions in OneOfPosition are before BeforePosition, >> after AfterPosition, within WithinPosition, and includes >> ExactPosition. >> Al positions in WithinPosition are after AfterPosition, >> before BeforePosition. >> BeforePosition and AfterPosition can't be equal. >> > > It might help it you wrote these out explicitly, > e.g. currently: > > >>> from Bio.SeqFeature import * > >>> a = BeforePosition(10) > >>> b = AfterPosition(10) > >>> a == b == 10 > True > > Currently BeforePosition and AfterPosition act like > the integer position for comparison etc. I find this > reasonable given we have to treat them as the > integer for things like extracting the sequence. > >> How should I name the TestCases? >> > > Something like test_SeqFeature.py and using > unittest. Most existing tests in this area are in > doctests and test_SeqIO_feature.py > > Peter > > > > ------------------------------ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 105, Issue 15 > ********************************************** > -------------- next part -------------- A non-text attachment was scrubbed... Name: seqfeature_eq.py Type: text/x-python-script Size: 1505 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Oct 18 13:20:34 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 14:20:34 +0100 Subject: [Biopython-dev] SeqFeature comparison for equality In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 1:59 PM, Andrea Pierleoni wrote: > Hi, > I don't know if this can help, > but I've been subclassing seqfeature and seqrecord objects to assert > equalities. > I've attached the very simple code for the seqfeature equality > Handling complex location equalities with a given set of rules could be > misleading. > a feature starting in position 11 is different, for me, from one located > at position 12. > > Andrea That looks reasonable for basic SeqFeature comparison, although comparing the annotations in the qualifiers dict is debatable (as with SeqRecord object's annotation). Given the way join locations (etc) are currently handled, it would be important to also compare the sub-features. I think it would be more practical to first (and perhaps only) implement equality testing for FeatureLocation (checking start, end, strand, ref and db_ref), then you can compare the location of a SeqFeature easily with: f1.location == f2.location. Peter From carlcrott at gmail.com Tue Oct 18 16:18:39 2011 From: carlcrott at gmail.com (carl crott) Date: Tue, 18 Oct 2011 12:18:39 -0400 Subject: [Biopython-dev] fixes on the tutorials In-Reply-To: References: Message-ID: Peter and other devs, I'm deeply interested in any kind of HMM applications ... As I'm not quite a biologist if you guys wanted to 'sic me' on any particular bug related to these let me know .. however as far as the GIT stuff .. that would be more of the control for updates and merging all the code that you guys work on separately. toodles! -Carl On Tue, Oct 18, 2011 at 5:36 AM, Peter Cock wrote: > On Mon, Oct 17, 2011 at 2:34 PM, Peter Cock > wrote: > > ... > > > > P.S. Don't forget to CC the mailing list ;) > > Apologies for posting that to the wrong development mailing list > (samtools rather than biopython), I need to be more careful with > autocomplete. > > Peter > -- Carl Crott Web Applications Engineer www.black-glass.com 412-610-0600 From mjldehoon at yahoo.com Wed Oct 19 02:39:53 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 18 Oct 2011 19:39:53 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1318991993.91732.YahooMailClassic@web161201.mail.bf1.yahoo.com> Hi Peter, > That leaves Bio/SCOP/__init__.py as the only existing or > imminent code using Bio.File, so if we can sort that out, > we can deprecate Bio.File as you suggested. In Bio/SCOP/__init__.py, Bio.File.UndoHandle is used in the _open function, which is an internal function used in the "search" function in Bio.SCOP. The UndoHandle is used to wrap a handle returned by urllib.urlopen. This search function returns a handle to data in HTML format. I don't think we have a parser for it. This suggests that there is no specific purpose for UndoHandle in Bio.SCOP._open. So I would suggest to just remove the UndoHandle from Bio.SCOP._open and return the urllib.urlopen handle directly. Any objections? --Michiel. --- On Mon, 10/17/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.File > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Monday, October 17, 2011, 11:03 AM > Hi Michiel, > > Regarding code using Bio.File, which you asked about > deprecating last month: > http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009144.html > > I objected at the time because I was using it for the > TogoWS code I was working on, > > On Thu, Sep 8, 2011 at 4:25 PM, Peter Cock > wrote: > On Wed, Sep 7, 2011 at 3:36 PM, Peter Cock > wrote: > >>> If the server could be relied on to always > give an > >>> HTTP error code this wouldn't be needed: > >>> > >>> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > >>> > > > > ... > > > > [Some of those TogoWS checks are probably superfluous > > right now, I'm still polishing the error handling - > some of > > which will rely on TogoWS itself catching more > conditions] > > I've updated my TogoWS to rely on the HTTP error codes, > and removed the heuristic error detection which required > Bio.File for the UndoHandle. That seems to be working fine > now. > > That leaves Bio/SCOP/__init__.py as the only existing or > imminent code using Bio.File, so if we can sort that out, > we can deprecate Bio.File as you suggested. > > Regards, > > Peter > From mjldehoon at yahoo.com Wed Oct 19 02:46:33 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 18 Oct 2011 19:46:33 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> I agree that it doesn't make sense to have a separate module for this. Even if we put it in Bio/__init__.py, people are likely to forget about it, and we will end up with some modules that use this code in Bio/__init__.py and other modules that copy this code in their source code. As this code is very short, I would just copy it into the modules that use it. Best, --Michiel. --- On Mon, 10/17/11, Jo?o Rodrigues wrote: I think it doesn't make sense to keep the module for 5 lines of code.? ? ? if isinstance(handleish, basestring): ? ? ? ? with open(handleish, mode) as fp:? ? ? ? ? ? yield fp ? ? else:? ? ? ? yield handleish I'd either place them in __init__.py or just insert them in all Bio.*IO modules wherever needed. If we had more snippets in common with all *IOs, it would be valuable and understandable to have a separate module, but as is it's a bit unnecessary IMHO. From p.j.a.cock at googlemail.com Wed Oct 19 08:49:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 09:49:27 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1318991993.91732.YahooMailClassic@web161201.mail.bf1.yahoo.com> References: <1318991993.91732.YahooMailClassic@web161201.mail.bf1.yahoo.com> Message-ID: On Wed, Oct 19, 2011 at 3:39 AM, Michiel de Hoon wrote: > Hi Peter, > >> That leaves Bio/SCOP/__init__.py as the only existing or >> imminent code using Bio.File, so if we can sort that out, >> we can deprecate Bio.File as you suggested. > > In Bio/SCOP/__init__.py, Bio.File.UndoHandle is used in the _open > function, which is an internal function used in the "search" function > in Bio.SCOP. The UndoHandle is used to wrap a handle returned > by urllib.urlopen. Should we change that to use urllib2 for better error handling, as in Bio.Entrez's _open? > This search function returns a handle to data in HTML format. > I don't think we have a parser for it. This suggests that there is > no specific purpose for UndoHandle in Bio.SCOP._open. I wonder if that is a sign of URL rot, it would make more sense to get plain text back. Sadly there were no unit tests for this at all until now, and I don't yet do anything with the handle other than confirm we get one! https://github.com/biopython/biopython/commit/10b94a7b5611edde5fe05f95406d927e5a6a02d9 > So I would suggest to just remove the UndoHandle from > Bio.SCOP._open and return the urllib.urlopen handle directly. > > Any objections? Sounds fine. Peter From p.j.a.cock at googlemail.com Wed Oct 19 08:53:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 09:53:25 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> References: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: 2011/10/19 Michiel de Hoon > > I agree that it doesn't make sense to have a separate module for this. For just the one little function, maybe not. I suspect we may want more "File related" things like this for Python 3, what with text vs binary handles and so on, in which case keeping Bio/File.py is sensible. > Even if we put it in Bio/__init__.py, people are likely to forget about > it, and we will end up with some modules that use this code in > Bio/__init__.py and other modules that copy this code in their > source code. As this code is very short, I would just copy it into > the modules that use it. It may be short, but duplicating this function all over the place seems like a very bad idea. I think we should just be vigilant in making sure it is used uniformly wherever we want to accept either a handle or a filename. Perhaps some of the historically handle-only parsers should start using it now? Peter From anaryin at gmail.com Wed Oct 19 11:46:26 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Oct 2011 13:46:26 +0200 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: Hey Peter, > For just the one little function, maybe not. I suspect we may want > more "File related" things like this for Python 3, what with text vs > binary handles and so on, in which case keeping Bio/File.py is > sensible. > What kind of "things" are we talking about here? Could they be anticipated? > > It may be short, but duplicating this function all over the place > seems like a very bad idea. I think we should just be vigilant in > making sure it is used uniformly wherever we want to accept > either a handle or a filename. Perhaps some of the historically > handle-only parsers should start using it now? > Duplicating is not a beautiful solution I must agree, but keeping a module and adding an import statement in every parser for only 5 lines isn't neither. I suggest we keep Bio.File, deprecating all the other functions, and meanwhile look at which changes we could include due to Py3. From p.j.a.cock at googlemail.com Wed Oct 19 12:28:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 13:28:03 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: References: <1318992393.44240.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: On Wed, Oct 19, 2011 at 12:46 PM, Jo?o Rodrigues wrote: > Hey Peter, > >> >> For just the one little function, maybe not. I suspect we may want >> more "File related" things like this for Python 3, what with text vs >> binary handles and so on, in which case keeping Bio/File.py is >> sensible. > > What kind of "things" are we talking about here? Could they be >?anticipated? > For instance, in Python 3 it might be useful for a parsing text files efficiently to use binary mode (i.e. byte strings not unicode) but also have universal newlines (which I think happens for you automatically in Python 3 for text mode, i.e. unicode). Surprisingly open(filename, "rbU") is accepted in Python 3, but it acts like "rb", typical binary read mode. >> It may be short, but duplicating this function all over the place >> seems like a very bad idea. I think we should just be vigilant in >> making sure it is used uniformly wherever we want to accept >> either a handle or a filename. Perhaps some of the historically >> handle-only parsers should start using it now? > > Duplicating is not a beautiful solution I must agree, but keeping > a module and adding an import statement in every parser for > only 5 lines isn't neither. > I suggest we keep Bio.File, deprecating all the other functions, and > meanwhile look at which changes we could include due to Py3. Yes, that's what I am suggesting. Peter From mjldehoon at yahoo.com Sat Oct 22 12:17:58 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 22 Oct 2011 05:17:58 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1319285878.88223.YahooMailClassic@web161206.mail.bf1.yahoo.com> OK, done. Best, --Michiel --- On Wed, 10/19/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.File > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Wednesday, October 19, 2011, 4:49 AM > On Wed, Oct 19, 2011 at 3:39 AM, > Michiel de Hoon > wrote: > > Hi Peter, > > > >> That leaves Bio/SCOP/__init__.py as the only > existing or > >> imminent code using Bio.File, so if we can sort > that out, > >> we can deprecate Bio.File as you suggested. > > > > In Bio/SCOP/__init__.py, Bio.File.UndoHandle is used > in the _open > > function, which is an internal function used in the > "search" function > > in Bio.SCOP. The UndoHandle is used to wrap a handle > returned > > by urllib.urlopen. > > Should we change that to use urllib2 for better error > handling, > as in Bio.Entrez's _open? > > > This search function returns a handle to data in HTML > format. > > I don't think we have a parser for it. This suggests > that there is > > no specific purpose for UndoHandle in Bio.SCOP._open. > > I wonder if that is a sign of URL rot, it would make more > sense > to get plain text back. Sadly there were no unit tests for > this at > all until now, and I don't yet do anything with the handle > other > than confirm we get one! > > https://github.com/biopython/biopython/commit/10b94a7b5611edde5fe05f95406d927e5a6a02d9 > > > So I would suggest to just remove the UndoHandle from > > Bio.SCOP._open and return the urllib.urlopen handle > directly. > > > > Any objections? > > Sounds fine. > > Peter > From p.j.a.cock at googlemail.com Wed Oct 26 11:11:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Oct 2011 12:11:57 +0100 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 12:02 PM, Jo?o Rodrigues wrote: > Hey Peter, > Thanks for the answer. How do I pass the matrix and which format should it > be on? Is there an example I could read? > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao Not that I know of, but adding one to the docstrings and test_pairwise2.py would be great. I think you use it with a score matrix as a dictionary from Bio.SubsMat.MatrixInfo Peter From eric.talevich at gmail.com Wed Oct 26 13:27:17 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 26 Oct 2011 09:27:17 -0400 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 7:11 AM, Peter Cock wrote: > On Wed, Oct 26, 2011 at 12:02 PM, Jo?o Rodrigues > wrote: > > Hey Peter, > > Thanks for the answer. How do I pass the matrix and which format should > it > > be on? Is there an example I could read? > > Jo?o [...] Rodrigues > > http://nmr.chem.uu.nl/~joao > > Not that I know of, but adding one to the docstrings and test_pairwise2.py > would be great. I think you use it with a score matrix as a dictionary from > Bio.SubsMat.MatrixInfo > > Peter > > Here's an example: from Bio import pairwise2, SeqIO from Bio.SubsMat.MatrixInfo import blosum62 # pairwise2 works with raw strings, not SeqRecords seq1 = str(SeqIO.read("seq1.fa", "fasta")) seq2 = str(SeqIO.read("seq2.fa", "fasta")) results = pairwise2.align.globalds(seq1, seq2, blosum62, -10, -0.5) # Returns a tuple: (seqA, seqB, score, begin, end) return results[0][2] From anaryin at gmail.com Wed Oct 26 13:31:29 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Oct 2011 15:31:29 +0200 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: Hello all, Coming back after lunch... I managed to load a matrix using this: from Bio import pairwise2 from Bio.SubsMat import MatrixInfo as m #print dir(m) matrix = m.blosum60 pairwise2.align.localdx(seqA, seqB, matrix) Thanks a lot for the help, it was simple after all, just a bit hard to start with.. From redmine at redmine.open-bio.org Thu Oct 27 04:55:53 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 27 Oct 2011 04:55:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3308] (New) SeqIO FastaIO: Blank Descriptor causes Indes Out of Range Message-ID: Issue #3308 has been reported by Darren Cullerne. ---------------------------------------- Bug #3308: SeqIO FastaIO: Blank Descriptor causes Indes Out of Range https://redmine.open-bio.org/issues/3308 Author: Darren Cullerne Status: New Priority: Normal Assignee: Category: Target version: URL: Entering a FASTA sequence with a blank descriptor: ">" "ACTAGTACTAGATCAGACTACAGTACAGAGAGGACATCTATACTACGAGAGACATACTACTCAGCATACGATAC" Causes the following error: File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 532, in parse for r in i: File "C:\Python27\lib\site-packages\Bio\SeqIO\FastaIO.py", line 49, in FastaIterator id = descr.split()[0] IndexError: list index out of range Please let me know if there is any further information you require. Thanks, ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Oct 27 14:03:42 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 27 Oct 2011 14:03:42 +0000 Subject: [Biopython-dev] [Biopython - Bug #3309] (New) GenBank Scanner expects sequence lines to start at position 9 Message-ID: Issue #3309 has been reported by Liam Childs. ---------------------------------------- Bug #3309: GenBank Scanner expects sequence lines to start at position 9 https://redmine.open-bio.org/issues/3309 Author: Liam Childs Status: New Priority: Normal Assignee: Category: Target version: 1.57 URL: Some programs (eg. Vector NTI and Lasegene) produce GenBank files where the sequences start at an index on the line other than index 9. I don't know how tightly defined the GenBank file format is, but if the indent for the start of the sequence can be variable, it seems to me there is a simple fix. Current version (Bio/GenBank/Scanner.py:904): line = self.line ... 15 lines if len(line) > 9 and line[9:10]!=' ': raise ValueError("Sequence line mal-formed, '%s'"% line) seq_lines.append(line[idx + 1:]) #remove spaces later Simple fix 1 (variable per file): line = self.line idx = line.find('1') + 1 ... 15 lines if len(line) > idx and line[idx:idx + 1]!=' ': raise ValueError("Sequence line mal-formed, '%s'"% line) seq_lines.append(line[idx + 1:]) #remove spaces later The index can be obtained in any number of ways, this was the simplest I could think of off the top of my head. If sequences are allowed to start at a position other than '1', then maybe a regular expression should be used instead. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu Oct 27 14:46:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Oct 2011 15:46:08 +0100 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 2:31 PM, Jo?o Rodrigues wrote: > Hello all, > Coming back after lunch... > I managed to load a matrix using this: > > from Bio import pairwise2 > from Bio.SubsMat import MatrixInfo as m > #print dir(m) > matrix = m.blosum60 > pairwise2.align.localdx(seqA, seqB, matrix) > > Thanks a lot for the help, it was simple after all, just a bit hard to start > with.. Hi Jo?o, Could you write a little documentation for the pairwise2 docstring? Just something short based on the above example would be great (ideally as a doctest). Thanks, Peter From anaryin at gmail.com Thu Oct 27 14:52:25 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 27 Oct 2011 16:52:25 +0200 Subject: [Biopython-dev] [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: Sure thing. The docstring is actually pretty explicit, it's just missing the part that you can get the matrices from SubsMat. Or at least, not that clear. I'll go over it this weekend, maybe earlier. Best, Jo?o From p.j.a.cock at googlemail.com Fri Oct 28 16:15:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 17:15:36 +0100 Subject: [Biopython-dev] Fwd: [Utilities-announce] Upcoming Release of NCBI EFetch version 2.0 In-Reply-To: References: Message-ID: Hi all, We may need to update Bio.Entrez for EFetch v2.0 soon, although at first glance there is nothing that will obviously cause trouble... Peter ---------- Forwarded message ---------- From: Date: Fri, Oct 28, 2011 at 4:15 PM Subject: [Utilities-announce] Upcoming Release of NCBI EFetch version 2.0 To: NLM/NCBI List utilities-announce Upcoming Release of EFetch version 2.0 In November 2011 NCBI plans to release version 2.0 of EFetch. The major changes and updates are as follows: ????????? EFetch now supports the following databases: biosample, biosystems and sra ????????? EFetch now has defined default values for &retmode and &rettype for all supported databases (please see Table 1 for all supported values of these parameters) ????????? EFetch no longer supports &retmode=html; requests containing &retmode=html will return data using the default &retmode value for the specified database (&db) ????????? EFetch requests including &rettype=docsum will return XML data equivalent to ESummary output Details about EFetch can be found at http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch An updated, complete listing of supported &rettype and &retmode values can be found at http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?report=objectonly Release notes about this and future releases can be found at http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes Please write to info at ncbi.nlm.nih.gov if you have any questions about these changes. _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce -------------- next part -------------- _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce From redmine at redmine.open-bio.org Fri Oct 28 23:45:53 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 28 Oct 2011 23:45:53 +0000 Subject: [Biopython-dev] [Biopython - Feature #3310] (New) HMMER parser Message-ID: Issue #3310 has been reported by J M. ---------------------------------------- Feature #3310: HMMER parser https://redmine.open-bio.org/issues/3310 Author: J M Status: New Priority: Normal Assignee: Category: Target version: URL: This is a parser for the output of hmmsearch from the HMMER package. Given the output of the hmmsearch, this program can retrieve information for each of the alignments including the expected values, the starting and ending positions of each alignment, as well as insert, deletion and mismatch information for each alignment. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Oct 29 02:00:07 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 29 Oct 2011 02:00:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3311] (New) GFF parser fails to intelligently break lines Message-ID: Issue #3311 has been reported by gahoo lee. ---------------------------------------- Bug #3311: GFF parser fails to intelligently break lines https://redmine.open-bio.org/issues/3311 Author: gahoo lee Status: New Priority: Normal Assignee: Category: Target version: URL: Move from "BioStar":http://biostar.stackexchange.com/questions/13651/gff-parsing-in-python-is-not-so-perfect I use BCBio.GFF to parse "chr01.gff3":ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/chr01.dir/chr01.gff3 and "all.gff3":ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/all.gff3 . But things didn't work out as I expect. Here's the code: @from BCBio import GFF limits = dict(gff_type = ["gene","mRNA","CDS"]) gff_handle = open('chr01.gff3') for rec in GFF.parse(gff_handle,target_lines=1000,limit_info=limits): #Chromosome seq level for gene_feature in rec.features: #gene level for mRNA_feature in gene_feature.sub_features: #mRNA level print mRNA_feature.type print mRNA_feature.qualifiers['Alias']@ And I got: @Traceback (most recent call last): File "R:\Untitled 1.py", line 14, in print mRNA_feature.qualifiers['Alias'] KeyError: 'Alias'@ And the 'type' is "CDS" which is not correct. When parsing without @target_lines=1000@ everything is ok. But parsing all.gff3 came to the same problem. Maybe all.gff3 is too huge to parse. The problem might be due to the parser did not recognise the entry boudary correctly. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org