From bugzilla-daemon at portal.open-bio.org Sat Aug 1 16:46:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 16:46:38 -0400 Subject: [Biopython-dev] [Bug 2894] New: Jython List difference causes failed assertion in CondonTable Fix+Patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2894 Summary: Jython List difference causes failed assertion in CondonTable Fix+Patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu Different list behaviour in Jython causes assertion to fail because last to elements on produced list are swapped. Haven't taken the time to figure out if this caused by sloppy list usage or Jython list weirdness. At this point, will assume that list order doesn't matter and simple expand the assertion to allow both cases... list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) Python : ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] Jython : ['TGA', 'TAA', 'TAG', 'TRA', 'TAR'] NOTE: Fixing this bug causes setup.py to fail (java.lang.ClassFormatError: Invalid method Code length) because it exposes previously untested bugs *** biopython-1.51b_orig/Bio/Data/CodonTable.py 2009-05-08 14:20:19.000000000 -0700 --- biopython-1.51b/Bio/Data/CodonTable.py 2009-08-01 13:30:46.000000000 -0700 *************** *** 615,621 **** assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TGA'] assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TAA', 'TAR'] assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values) == ['UAG', 'UAA', 'UAR'] ! assert list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] # Forward translation is "onto", that is, any given codon always maps # to the same protein, or it doesn't map at all. Thus, I can build --- 615,623 ---- assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TGA'] assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TAA', 'TAR'] assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values) == ['UAG', 'UAA', 'UAR'] ! #Jython BUG? For some order Jython swaps the order of the last two elements... ! assert list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] or\ ! list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TRA', 'TAR'] # Forward translation is "onto", that is, any given codon always maps # to the same protein, or it doesn't map at all. Thus, I can build -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 17:16:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 17:16:48 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200908012116.n71LGmgG031493@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-01 17:16 EST ------- (In reply to comment #0) > Different list behaviour in Jython causes assertion to fail because last to > elements on produced list are swapped. Haven't taken the time to figure out > if this caused by sloppy list usage or Jython list weirdness. ... Are you using Biopython 1.51b, or the latest code from CVS/github? This sounds like a duplicate of Bug 2887 (set order is Python implementation dependent). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:47 -0400 Subject: [Biopython-dev] [Bug 2895] New: Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Summary: Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu BugsThisDependsOn: 2891,2892,2893,2894 Jython is limited to JVM method sizes, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). The Bio.Restriction.Restriction_Dictionary module defines to much data in the base method, by breaking the defined dicts into pieces held in separate methods, then merging them, the code will correctly compile in Jython. Patch: 11,12c11,14 < rest_dict = \ < {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'), --- > > > def RestDict1(): > return {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'), 1503,1504c1505,1508 < 'suppl': ('I',)}, < 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'), --- > 'suppl': ('I',)} } > > def RestDict2(): > return { 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'), 3500c3504,3508 < 'suppl': ('X',)}, --- > 'suppl': ('X',)} } > > > def RestDict3(): > return { 4497c4505,4508 < 'suppl': ('I',)}, --- > 'suppl': ('I',)} } > > def RestDict4(): > return { 5494,5495c5505,5508 < 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')}, < 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'), --- > 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')} } > > def RestDict5(): > return { 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'), 6479c6492,6495 < 'suppl': ('N',)}, --- > 'suppl': ('N',)} } > > def RestDict6(): > return { 7194,7195c7210,7214 < 'suppl': ('N',)}, < 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'), --- > 'suppl': ('N',)} } > > > def RestDict7(): > return { 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'), 8491c8510,8513 < 'suppl': ()}, --- > 'suppl': ()} } > > def RestDict8(): > return { 9608c9630,9634 < 'suppl': ('F',)}, --- > 'suppl': ('F',)} } > > > def RestDict9(): > return { 11992,11993c12018,12051 < suppliers = \ < {'A': ('Amersham Pharmacia Biotech', --- > > > rest_dict = {} > tmp = RestDict1() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict2() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict3() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict4() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict5() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict6() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict7() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict8() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict9() > for a in tmp: > rest_dict[a] = tmp[a] > > > def Suppliers(): > return {'A': ('Amersham Pharmacia Biotech', 13626,13627c13684,13692 < typedict = \ < {'type145': (('NonPalindromic', --- > > > suppliers = Suppliers() > > > > > def TypeDict(): > return {'type145': (('NonPalindromic', 14498a14564,14567 > > typedict = TypeDict() > > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:49 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200908020246.n722knhV005000@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:50 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200908020246.n722koqM005006@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:51 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200908020246.n722kpGh005015@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:52 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200908020246.n722kq8g005021@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Aug 3 10:57:59 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 3 Aug 2009 10:57:59 -0400 Subject: [Biopython-dev] GSoC Weekly Update 11: PhyloXML for Biopython Message-ID: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> Hi all, Previously (July 27-31) I: - Added the remaining checks for restricted tokens - Modified the tree, parser and writer for phyloXML 1.10 support -- it validates now, and unit tests pass. PhyloXML 1.00 validation breaks, but that won't affect anyone except BioPerl, and they said they can deal with it on their end - Changed how the Parser and Writer classes work to resemble other Biopython parser classes more closely - Picked standard attributes for BaseTree's Tree and Node objects (informed by PhyloDB, though the names are slightly different); added properties to PhyloXML's Clade to mimic both types - Made SeqRecord conversion actually work (with reasonable round-tripping capability); added a unit test - Changed __str__ methods to not include the object's class name if there's another representative label to use (e.g. name) -- that's easy enough to add in the caller - Sorted out the TreeIO read/parse/write API and added some support for the Newick format, as recommended by Peter on biopython-dev - Split some "plumbing" (depth_first_search) off from the Tree.find() method. Since there are a lot of potentially useful methods to have on phylogenetic tree objects, I think it's best to distinguish between "porcelain" (specific, easy-to-use methods for common operations) and "plumbing" (generalized or low-level methods/algorithms that porcelain can rely on) in the Tree class in Bio.Tree.BaseTree. - Started a function for networkx export. The edges are screwy right now, so I haven't checked it in yet. This week (Aug. 3-7) I will: Scan the code base for lingering TODO/ENH/XXX comments Discuss merging back upstream Work on enhancements (time permitting): - Clean up the Parser class a bit more, to resemble Writer - Finish networkx export - Port common methods to Bio.Tree.BaseTree (from Bio.Nexus.Trees and other packages) Run automated testing: - Re-run performance benchmarks - Run tests and benchmarks on alternate platforms - Check epydoc's generated API documentation and fix docstrings Update wiki documentation with new features: - Tree: base classes, find() etc., - TreeIO: 'phyloxml', 'nexus', 'newick' wrappers; PhyloXMLIO extras; warn that Nexus/Newick wrappers don't return Bio.Tree objects yet - PhyloXML: singular properties, improved str() Remarks: - Most of the work done this week and last, shuffling base classes and adding various checks, actually made the I/O functions a little slower. I don't think this will be a big deal, and the changes were necessary, but it's still a little disappointing. - The networkx export will look pretty cool. After exporting a Biopython tree to a networkx graph, it takes a couple more imports and commands to draw the tree to the screen or a file. Would anyone find it handy to have a short function in Bio.Tree or Bio.Graphics to go straight from a tree to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz) - I have to admit this: I don't know anything about BioSQL. How would I use and test the PhyloDB extension, and what's involved in writing a Biopython interface for it? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From krother at rubor.de Mon Aug 3 11:11:15 2009 From: krother at rubor.de (Kristian Rother) Date: Mon, 03 Aug 2009 17:11:15 +0200 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython Message-ID: <4A76FE13.6050203@rubor.de> Hi, We have created a lot of code that works on RNA structures in Poznan, Poland. There are some jewels that I consider useful and mature enough to meet a wider audience. I'd be interested in refactorizing and packaging them as a RNAStructure package and contribute it to BioPython. I just discussed the possibilities with Magdalena Musielak & Tomasz Puton who wrote & tested significant portions of the code. They came up with a list of 'most wanted' Use Cases: - Calculate RNA base pairs - Generate RNA secondary structures from 3D structures - Recognize pseudoknots - Recognize modified nucleotides in RNA 3D structures. - Superimpose two RNA molecules. The existing code massively uses Bio.PDB already, and has little dependancies apart from that. Any comments how this kind of functionality would fit into BioPython are welcome. Best Regards, Kristian Rother www.rubor.de Structural Bioinformatics Group UAM Poznan From bugzilla-daemon at portal.open-bio.org Mon Aug 3 12:28:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 3 Aug 2009 12:28:39 -0400 Subject: [Biopython-dev] [Bug 2896] New: BLAST XML parser: stripped leading/trailing spaces in Hsp_midline Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2896 Summary: BLAST XML parser: stripped leading/trailing spaces in Hsp_midline Product: Biopython Version: 1.50 Platform: All OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: volkmer at mpi-cbg.de Parsing a XML output file from NCBI BLAST using blastp & complexity filters on omits leading/trailing spaces in the hsp match line: hsp.query u'XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV' hsp.match u'P+ T L S PPL S+S + PQ+ L+ + R+K+ + + RHHH R LDL ++V' This makes it more awkward to evaluate the alignment. It would be the best when query, subject and alignment always have the same length. The BLAST XML output file at least has the correct Hsp_midline:
<,Hsp_qseq>XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV</Hsp_qseq>
<<Hsp_hseq>>EFFEPAITGLYYS-PPLFSVSRLTGLLHLLERPQETLF-TNYRNKIKRLDIPLRHHHIRHLDLEQLV</Hsp_hseq>
<Hsp_midline>    P+ T L  S PPL S+S    +      PQ+ L+ +  R+K+ +  +  RHHH R
LDL ++V</Hsp_midline>
And as the plaintext parser gives the complete alignment line it would be nice to get the same behaviour. Thanks, Michael -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 3 13:20:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 3 Aug 2009 13:20:24 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908031720.n73HKOFr019079@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-03 13:20 EST ------- Could you attach a complete XML file we could use for a unit test please? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Aug 3 16:48:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 21:48:49 +0100 Subject: [Biopython-dev] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> Message-ID: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> On 22 June 2009, I wrote: > ... > I'd like to officially deprecate Bio.Fasta for the next release (Biopython > 1.51), which means you can continue to use it for a couple more > releases, but at import time you will see a warning message. See also: > http://biopython.org/wiki/Deprecation_policy > > Would this cause anyone any problems? If you are still using Bio.Fasta, > it would be interesting to know if this is just some old code that hasn't > been updated, or if there is some stronger reason for still using it. No one replied, so I plan to make this change in CVS shortly, meaning that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work but will trigger a deprecation warning at import. Please speak up ASAP if this concerns you. Thanks, Peter From chapmanb at 50mail.com Mon Aug 3 18:38:47 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 3 Aug 2009 18:38:47 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> Message-ID: <20090803223847.GM8112@sobchak.mgh.harvard.edu> Hi Eric; Thanks for the update. Things are looking in great shape as we get towards the home stretch. > - Most of the work done this week and last, shuffling base classes and > adding various checks, actually made the I/O functions a little slower. > I don't think this will be a big deal, and the changes were necessary, > but it's still a little disappointing. The unfortunate influence of generalization. I think the adjustment to the generalized Tree is a big win and gives a solid framework for any future phylogenetic modules. I don't know what the numbers are but as long as performance is reasonable, few people will complain. This is always something to go back around on if it becomes a hangup in the future. > - The networkx export will look pretty cool. After exporting a Biopython > tree to a networkx graph, it takes a couple more imports and commands to > draw the tree to the screen or a file. Would anyone find it handy to have > a short function in Bio.Tree or Bio.Graphics to go straight from a tree > to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz) Awesome. Looking forward to seeing some trees that come out of this. It's definitely worthwhile to formalize the functionality to go straight from a tree to png or pdf. This will add some more localized dependencies, so I'm torn as to whether it would be best as a utility function or an example script. Peter might have an opinion here. Either way, this would be really useful as a cookbook example with a final figure. Being able to produce some pretty is a good way to convince people to store trees in a reasonable format like PhyloXML. > - I have to admit this: I don't know anything about BioSQL. How would I use > and test the PhyloDB extension, and what's involved in writing a > Biopython interface for it? BioSQL and the PhyloDB extension are a set of relational database tables. Looking at the SVN logs, it appears as if the main work on PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps lagging behind, so my suggestion is to start with PostgreSQL. Hilmar, please feel free to correct me here. The schemas are available from SVN: http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/sql You'd want biosqldb-pg.sql and presumably also biosqldb-views-pg.sql for BioSQL and biosql-phylodb-pg.sql and biosql-phylodata-pg.sql. The Biopython docs are pretty nice on this -- you create the empty tables: http://biopython.org/wiki/BioSQL#PostgreSQL >From there you should be able to browse to get a sense of what is there. In terms of writing an interface, the first step is loading the data where you can mimic what is done with SeqIO and BioSQL: http://biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database Pass the database an iterator of trees and they are stored. Secondarily is retrieving and querying persisted trees. Here you would want TreeDB objects that act like standard trees, but retrieve information from the database on demand. Here are Seq/SeqRecord models in BioSQL: http://github.com/biopython/biopython/tree/master/BioSQL/BioSeq.py So it's a bit of an extended task. Time frames being what they are, any steps in this direction are useful. If you haven't played with BioSQL before, it's worth a look for your own interest. The underlying key/value model is really flexible and kind of models RDF triplets. I've used BioSQL here recently as the backend for a web app that differs a bit from the standard GenBank like thing, and found it very flexible. Again, great stuff. Let me know if I can add to any of that, Brad From bugzilla-daemon at portal.open-bio.org Tue Aug 4 04:45:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 4 Aug 2009 04:45:03 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908040845.n748j36R015856@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #2 from volkmer at mpi-cbg.de 2009-08-04 04:45 EST ------- Created an attachment (id=1353) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1353&action=view) blastp xml sample -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Aug 4 08:32:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 4 Aug 2009 08:32:39 -0400 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: <4A76FE13.6050203@rubor.de> References: <4A76FE13.6050203@rubor.de> Message-ID: <20090804123239.GN8112@sobchak.mgh.harvard.edu> Hi Kristian; > We have created a lot of code that works on RNA structures in Poznan, > Poland. There are some jewels that I consider useful and mature enough > to meet a wider audience. I'd be interested in refactorizing and > packaging them as a RNAStructure package and contribute it to BioPython. This sounds great. I don't know enough about the area to comment directly on your use cases -- my experience is limited to folding structures with RNAFold and the like -- but it sounds like a solid feature set. > I just discussed the possibilities with Magdalena Musielak & Tomasz > Puton who wrote & tested significant portions of the code. They came up > with a list of 'most wanted' Use Cases: > > - Calculate RNA base pairs > - Generate RNA secondary structures from 3D structures > - Recognize pseudoknots > - Recognize modified nucleotides in RNA 3D structures. > - Superimpose two RNA molecules. > > The existing code massively uses Bio.PDB already, and has little > dependancies apart from that. You may also want to have a look at PyCogent, which has wrappers and parsers for several command line programs involved with RNA structure, along with a representation of RNA secondary structure: http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/struct/rna2d.py?view=markup It would be great to complement this functionality, and interact with PyCogent where feasible. We could offer more specific suggestions as you get rolling with this and there is code to review. Glad to have you interested, Brad From tiagoantao at gmail.com Tue Aug 4 11:29:36 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 4 Aug 2009 16:29:36 +0100 Subject: [Biopython-dev] 1.52 Message-ID: <6d941f120908040829g6531804dpe51e9f24720dab78@mail.gmail.com> Hi, I am currently working on the implementation of Genepop support on Bio.PopGen. Genepop support will allow calculation of basic frequentist statistics. This is the biggest addition to Bio.PopGen and makes the module useful for a wide range of applications. In fact I never tried to publicize Bio.PopGen in the population genetics community, but with this addon, that will change. The status is as follows: 1. Code done 90% done. Check http://github.com/tiagoantao/biopython/tree/genepop 2. Test code around 30% coverage 3. Documentation 50% Check http://biopython.org/wiki/PopGen_dev_Genepop for a tutorial under development. This will be ready for 1.52. And I would like to make the code available after the Summer vacation. And it is about 1.52 that this mail is about ;) I remember Peter writing about 1.52 being ad-hoc scheduled for fall. I have September blocked with work, but I managed to have October clear mostly just for this. So my request is: if there is more or less a Fall release please don't schedule it for the first week in the Fall (which is still in September) ;) . Mid-October or somewhere around that time would be good. Thanks a lot, Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From matzke at berkeley.edu Tue Aug 4 13:01:34 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 04 Aug 2009 10:01:34 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> Message-ID: <4A78696E.8010808@berkeley.edu> Hi all, update: Major improvements/fixes: - removed any reliance on lagrange tree module, refactored all phylogeny code to use the revised Bio.Nexus.Tree module - tree functions put in TreeSum (tree summary) class - added functions for calculating phylodiversity measures, including necessary subroutines like subsetting trees, randomly selecting tips from a larger pool - Code dealing with GBIF xml output completely refactored into the following classes: * ObsRecs (observation records & search results/summary) * ObsRec (an individual observation record) * XmlString (functions for cleaning xml returned by Gbif) * GbifXml (extention of capabilities for ElementTree xml trees, parsed from GBIF xml returns. - another suggestion implemented: dependencies on tempfiles eliminated by using cStringIO (temporary file-like strings, not stored as temporary files) file_str objects instead - another suggestion implemented: the _open method from biopython's ncbi www functionality has been copied & modified so that it is now a method of ObsRecs, and doesn't contain NCBI-specific defaults etc. (it does still include a 3-second waiting time between GBIF requests, figuring that is good practice). - function to download large numbers of records in increments implemented as method of ObsRecs. This week: - Put GIS functions in a class (easy), allowing each ObsRec to be classified into an are (easy) - Improve extraction of data from GBIF xmltree -- my Utricularia "practice XML file" didn't have problems, but with running online searches, I am discovering some fields are not always filled in, etc. This shouldn't be too hard, using the GbifXml xmltree searching functions, and including defaults for exceptions. - Function for converting points to KML for Google Earth display. Code uploaded here: http://github.com/nmatzke/biopython/commits/Geography -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Tue Aug 4 14:28:33 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 04 Aug 2009 11:28:33 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu> Message-ID: <4A787DD1.40301@berkeley.edu> Hilmar Lapp wrote: > > On Aug 4, 2009, at 1:01 PM, Nick Matzke wrote: > >> * ObsRecs (observation records & search results/summary) >> * ObsRec (an individual observation record) > > > I'll let the Biopython folks make the call on this, but in general I'd > recommend to everyone trying to write reusable code to spell out names, > especially non-local names. > > The days in which the length of a variable or class name was somehow > limited or affected the speed of a program are definitely over since > more than a decade. I know the temptation is big to save on a few > keystrokes every time you have to type the name, but the time that you > will cause your fellow programmers who will later try to understand your > code is vastly greater. What prevents me from thinking that ObsRec is a > class for an obsolete recording? Good point, this is easy to fix, I will put it on the list. Cheers! Nick > > Just my $0.02 :-) > > -hilmar -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Tue Aug 4 14:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Aug 2009 19:44:29 +0100 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: <4A76FE13.6050203@rubor.de> References: <4A76FE13.6050203@rubor.de> Message-ID: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> On Mon, Aug 3, 2009 at 4:11 PM, Kristian Rother wrote: > Hi, > > We have created a lot of code that works on RNA structures in Poznan, > Poland. There are some jewels that I consider useful and mature enough to > meet a wider audience. I'd be interested in refactorizing and packaging them > as a RNAStructure package and contribute it to BioPython. I remember we talked about this briefly at BOSC/ISMB - it sounds good. Did you get a chance to talk to Thomas Hamelryck about this? > I just discussed the possibilities with Magdalena Musielak & Tomasz Puton > who wrote & tested significant portions of the code. They came up with a > list of 'most wanted' Use Cases: > > - Calculate RNA base pairs > - Generate RNA secondary structures from 3D structures > - Recognize pseudoknots > - Recognize modified nucleotides in RNA 3D structures. > - Superimpose two RNA molecules. > > The existing code massively uses Bio.PDB already, and has little > dependancies apart from that. > > Any comments how this kind of functionality would fit into BioPython are > welcome. I see you have already started a github branch, which is great: http://github.com/krother/biopython/tree/rol Am I right in thinking all of this code is for 3D RNA work? Maybe that might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA? Did you have something in mind? Peter P.S. Who won the ISMB Art and Science Exhibition prize? http://www.iscb.org/ismbeccb2009/artscience.php From biopython at maubp.freeserve.co.uk Tue Aug 4 15:29:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Aug 2009 20:29:47 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> Message-ID: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote: > On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote: >> How about adding a function like "run_arguments" to the >> commandlines that returns the commandline as a list. > > That would be a simple alternative to my vague idea "Maybe we > can make the command line wrapper object more list like to make > subprocess happy without needing to create a string?", which may > not be possible. Either way, this will require a bit of work on the > Bio.Application parameter objects... By defining an __iter__ method, we can make the Biopython application wrapper object sufficiently list-like that it can be passed directly to subprocess. I think I have something working (only tested on Linux so far), at least for the case where none of the arguments have spaces or quotes in them. If this works, it should make things a little easier in that we don't have to do str(cline), and also I think it avoids the OS specific behaviour of the shell argument as Brad noted earlier: >> This avoids the shell nastiness with the argument list, is as >> simple as it gets with subprocess, and gives users an easy >> path to getting stdout, stderr and the return codes. i.e. I am hoping we can replace this: child = subprocess.Popen(str(cline), shell(sys.platform!="win32"), ...) with just: child = subprocess.Popen(cline, ...) where the "..." represents any messing about with stdin, stdout and stderr. Peter From chapmanb at 50mail.com Tue Aug 4 18:27:31 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 4 Aug 2009 18:27:31 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A78696E.8010808@berkeley.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> Message-ID: <20090804222731.GA12604@sobchak.mgh.harvard.edu> Hi Nick; Thanks for the update -- great to see things moving along. > - removed any reliance on lagrange tree module, refactored all phylogeny > code to use the revised Bio.Nexus.Tree module Awesome -- glad this worked for you. Are the lagrange_* files in Bio.Geography still necessary? If not, we should remove them from the repository to clean things up. More generally, it would be really helpful if we could do a bit of housekeeping on the repository. The Geography namespace has a lot of things in it which belong in different parts of the tree: - The test code should move to the 'Tests' directory as a set of test_Geography* files that we can use for unit testing the code. - Similarly there are a lot of data files in there which are appear to be test related; these could move to Tests/Geography - What is happening with the Nodes_v2 and Treesv2 files? They look like duplicates of the Nexus Nodes and Trees with some changes. Could we roll those changes into the main Nexus code to avoid duplication? > - Code dealing with GBIF xml output completely refactored into the > following classes: > > * ObsRecs (observation records & search results/summary) > * ObsRec (an individual observation record) > * XmlString (functions for cleaning xml returned by Gbif) > * GbifXml (extention of capabilities for ElementTree xml trees, parsed > from GBIF xml returns. I'm agreed with Hilmar -- the user classes would probably benefit from expanded naming. There is a art to naming to get them somewhere between the hideous RidicuouslyLongNamesWithEverythingSpecified names and short truncated names. Specifically, you've got a lot of filler in the names -- dbfUtils, geogUtils, shpUtils. The Utils probably doesn't tell the user much and makes all of the names sort of blend together, just as the Rec/Recs pluralization hides a quite large difference in what the classes hold. Something like Observation and ObservationSearchResult would make it clear immediately what they do and the information they hold. > This week: What are your thoughts on documentation? As a naive user of these tools without much experience with the formats, I could offer better feedback if I had an idea of the public APIs and how they are expected to be used. Moreover, cookbook and API documentation is something we will definitely need to integrate into Biopython. How does this fit in your timeline for the remaining weeks? Thanks again. Hope this helps, Brad From hlapp at gmx.net Tue Aug 4 19:34:26 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 4 Aug 2009 19:34:26 -0400 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> References: <4A76FE13.6050203@rubor.de> <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> Message-ID: On Aug 4, 2009, at 2:44 PM, Peter wrote: > P.S. Who won the ISMB Art and Science Exhibition prize? > http://www.iscb.org/ismbeccb2009/artscience.php Guess who - Kristian did :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From krother at rubor.de Wed Aug 5 04:07:12 2009 From: krother at rubor.de (Kristian Rother) Date: Wed, 05 Aug 2009 10:07:12 +0200 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython Message-ID: <4A793DB0.5000805@rubor.de> Hi Peter, I remember we talked about this briefly at BOSC/ISMB - it sounds good. Did you get a chance to talk to Thomas Hamelryck about this? We talked on ISMB, but no details yet. Am I right in thinking all of this code is for 3D RNA work? Maybe that might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA? Did you have something in mind? I was thinking of 'RNAStructure' - I also like 'RNA' as long as it does not violate any claims. P.S. Who won the ISMB Art and Science Exhibition prize? http://www.iscb.org/ismbeccb2009/artscience.php The winning picture can be found here: http://www.rubor.de/twentycharacters_en.html Best Regards, Kristian From biopython at maubp.freeserve.co.uk Wed Aug 5 04:15:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 09:15:36 +0100 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: References: <4A76FE13.6050203@rubor.de> <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> Message-ID: <320fb6e00908050115y612d89b2h757f5aa59fbb99ed@mail.gmail.com> On Wed, Aug 5, 2009 at 12:34 AM, Hilmar Lapp wrote: > > On Aug 4, 2009, at 2:44 PM, Peter wrote: > >> P.S. Who won the ISMB Art and Science Exhibition ?prize? >> http://www.iscb.org/ismbeccb2009/artscience.php > > Guess who - Kristian did :-) > > ? ? ? ?-hilmar Ha! That's cool. Congratulations Kristian! Peter From biopython at maubp.freeserve.co.uk Wed Aug 5 06:29:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 11:29:45 +0100 Subject: [Biopython-dev] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> Message-ID: <320fb6e00908050329m44fa2596ife06917306ae44ab@mail.gmail.com> On Mon, Aug 3, 2009 at 9:48 PM, Peter wrote: > On 22 June 2009, I wrote: >> ... >> I'd like to officially deprecate Bio.Fasta for the next release (Biopython >> 1.51), which means you can continue to use it for a couple more >> releases, but at import time you will see a warning message. See also: >> http://biopython.org/wiki/Deprecation_policy >> ... > > No one replied, so I plan to make this change in CVS shortly, meaning > that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work > but will trigger a deprecation warning at import. > > Please speak up ASAP if this concerns you. I've just committed the deprecation of Bio.Fasta to CVS. This could be reverted if anyone has a compelling reason (and tells us before we do the final release of Biopython 1.51). The docstring for Bio.Fasta should cover the typical situations for moving from Bio.Fasta to Bio.SeqIO, but please feel free to ask on the mailing list if you have a more complicated bit of old code that needs to be ported. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Aug 5 07:29:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Aug 2009 07:29:41 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908051129.n75BTf8i026537@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 07:29 EST ------- Thanks for the sample XML file. I could reproduce this, I think I have fixed it. hsp.query, hsp.match and hsp.sbjct should all be the same length. Previously, at the end of each tag our XML parser strips the leading/trailing white space from the tag's value before processing it. In the case of Hsp_midline this is a very bad idea. However, the reason it did this was that the way the current tag value was built up wasn't context aware. In particular case, there was white space outside tags like Hsp_midline, which really belong to the parent tag (Hsp), but was wrongly being combined. Would you be able to test this please? All you really need to try this is the new Bio/Blast/NCBIXML.py file (CVS revision 1.23). It might be easiest just to update to the latest code in CVS (or on github), but I could attach the file here if you like. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Aug 5 09:13:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Aug 2009 09:13:40 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908051313.n75DDeFt031305@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #4 from volkmer at mpi-cbg.de 2009-08-05 09:13 EST ------- Hi Peter, could you please attach the file? The latest version of NCBIXML.py I get from cvs at code.open-bio.org still seems to be from April 2009. When I try to specify revision 1.23 I get a checkout warning and no file. Or is there a testing branch for this? Michael -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Aug 5 09:27:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Aug 2009 09:27:45 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908051327.n75DRjjg031915@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 09:27 EST ------- Created an attachment (id=1357) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1357&action=view) Updated version of NCBIXML.py as in CVS revision 1.23 (In reply to comment #4) > Hi Peter, > > could you please attach the file? Sure. > The latest version of NCBIXML.py I get from cvs at code.open-bio.org > still seems to be from April 2009. When I try to specify revision > 1.23 I get a checkout warning and no file. Or is there a testing > branch for this? Using code.open-bio.org (or its various aliases like cvs.biopython.org) actually gives you access to a read only mirror of the real CVS data, which is on dev.open-bio.org (for use by those with commit rights). I'm not sure exactly how often the public mirror is updated, but I would guess hourly. I would guess if you try again later it would work, but in the meantime I have attached the new file to this bug. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Wed Aug 5 18:31:31 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 5 Aug 2009 18:31:31 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <20090803223847.GM8112@sobchak.mgh.harvard.edu> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> On Mon, Aug 3, 2009 at 6:38 PM, Brad Chapman wrote: > Hi Eric; > Thanks for the update. Things are looking in great shape as we get > towards the home stretch. > > > - Most of the work done this week and last, shuffling base classes > and > > adding various checks, actually made the I/O functions a little > slower. > > I don't think this will be a big deal, and the changes were > necessary, > > but it's still a little disappointing. > > The unfortunate influence of generalization. I think the adjustment > to the generalized Tree is a big win and gives a solid framework for > any future phylogenetic modules. I don't know what the numbers are > but as long as performance is reasonable, few people will complain. > This is always something to go back around on if it becomes a hangup > in the future. > The complete unit test suite used to take about 4.5 seconds, and now it takes 5.8 seconds, though I've added a few more tests since then. I don't think it will feel like it's hanging for most operations, besides parsing or searching a huge tree. > - The networkx export will look pretty cool. After exporting a > Biopython > > tree to a networkx graph, it takes a couple more imports and > commands to > > draw the tree to the screen or a file. Would anyone find it handy > to have > > a short function in Bio.Tree or Bio.Graphics to go straight from a > tree > > to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe > graphviz) > > Awesome. Looking forward to seeing some trees that come out of this. > It's definitely worthwhile to formalize the functionality to go > straight from a tree to png or pdf. This will add some more > localized dependencies, so I'm torn as to whether it would be best > as a utility function or an example script. Peter might have an > opinion here. > > Either way, this would be really useful as a cookbook example with a > final figure. Being able to produce some pretty is a good way to > convince people to store trees in a reasonable format like PhyloXML. > OK, it works now but the resulting trees look a little odd. The options needed to get a reasonable tree representation are fiddly, so I made draw_graphviz() a separate function that basically just handles the RTFM work (not trivial), while the graph export still happens in to_networkx(). Here are a few recipes and a taste of each dish. The matplotlib engine seems usable for interactive exploration, albeit cluttered -- I can't hide the internal clade identifiers since graphviz needs unique labels, though maybe I could make them less prominent. Drawing directly to PDF gets cluttered for big files, and if you stray from the default settings (I played with it a bit to get it right), it can look surreal. There would still be some benefit to having a reportlab-based tree module in Bio.Graphics, and maybe one day I'll get around to that. $ ipython -pylab from Bio import Tree, TreeIO apaf = TreeIO.read('apaf.xml', 'phyloxml') Tree.draw_graphviz(apaf) # http://etal.myweb.uga.edu/phylo-nx-apaf.png Tree.draw_graphviz(apaf, 'apaf.pdf') # http://etal.myweb.uga.edu/apaf.pdf Tree.draw_graphviz(apaf, 'apaf.png', format='png', prog='dot') # http://etal.myweb.uga.edu/apaf.png -- why it's best to leave the defaults alone Thoughts: the internal node labels could be clear instead of red; if a node doesn't have a name, it could check its taxonomy attribute to see if anything's there; there's probably a way to make pygraphviz understand distinct nodes that happen to have the same label, although I haven't found it yet. Is PDF a good default format, or would PNG or PostScript be better? > - I have to admit this: I don't know anything about BioSQL. How would > I use > > and test the PhyloDB extension, and what's involved in writing a > > Biopython interface for it? > > BioSQL and the PhyloDB extension are a set of relational database > tables. Looking at the SVN logs, it appears as if the main work on > PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps > lagging behind, so my suggestion is to start with PostgreSQL. > Hilmar, please feel free to correct me here. > > [...] > > So it's a bit of an extended task. Time frames being what they are, > any steps in this direction are useful. If you haven't played with > BioSQL before, it's worth a look for your own interest. The underlying > key/value model is really flexible and kind of models RDF triplets. I've > used BioSQL here recently as the backend for a web app that differs a > bit from the standard GenBank like thing, and found it very flexible. > > I think I've seen that app, but I thought it was backed by AppEngine. Neat stuff. I will learn BioSQL for my own benefit, but I don't think there's enough time left in GSoC for me to add a useful PhyloDB adapter to Biopython. So that, along with refactoring Nexus.Trees to use Bio.Tree.BaseTree, would be a good project to continue with in the fall, at a slower pace and with more discussion along the way. Cheers, Eric From bugzilla-daemon at portal.open-bio.org Thu Aug 6 03:56:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 Aug 2009 03:56:25 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908060756.n767uPk1031552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #6 from volkmer at mpi-cbg.de 2009-08-06 03:56 EST ------- (In reply to comment #3) > I could reproduce this, I think I have fixed > it. > hsp.query, hsp.match and hsp.sbjct should all be the same length. > > Previously, at the end of each tag our XML parser strips the leading/trailing > white space from the tag's value before processing it. In the case of > Hsp_midline this is a very bad idea. Ok, the fix seems to solve the problem. Well I guess the only time when this problem appears is when you have filtered/masked residues at the beginning/end of the query hsp. Otherwise the hsp would just start with the first match and end with the last one. Thanks, Michael -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Aug 6 04:03:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 Aug 2009 04:03:03 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908060803.n76833YJ032257@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-06 04:03 EST ------- (In reply to comment #6) > > Ok, the fix seems to solve the problem. > Great - I'm marking this bug as fixed, thanks for your time reporting and then testing this. > Well I guess the only time when this problem appears is when you have > filtered/masked residues at the beginning/end of the query hsp. Otherwise > the hsp would just start with the first match and end with the last one. I suspect there are other situations it might happen, but the fix is general. Cheers, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Aug 6 04:06:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 09:06:43 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> Message-ID: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com> On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich wrote: > OK, it works now but the resulting trees look a little odd. The options > needed to get a reasonable tree representation are fiddly, so I made > draw_graphviz() a separate function that basically just handles the RTFM > work (not trivial), while the graph export still happens in to_networkx(). > > Here are a few recipes and a taste of each dish. The matplotlib engine seems > usable for interactive exploration, albeit cluttered -- I can't hide the > internal clade identifiers since graphviz needs unique labels, though maybe > I could make them less prominent. ... Graphviv does need unique names, and the node labels default to the node name - but you can override this and use a blank label if you want. How are you calling Graphviz? There are several Python wrappers out there, or you could just write a dot file directly and call the graphviz command line tools. Peter From eric.talevich at gmail.com Thu Aug 6 08:47:47 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 6 Aug 2009 08:47:47 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com> Message-ID: <3f6baf360908060547r8f299dao413b3657966fe9f4@mail.gmail.com> On Thu, Aug 6, 2009 at 4:06 AM, Peter wrote: > On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich > wrote: > > > OK, it works now but the resulting trees look a little odd. The options > > needed to get a reasonable tree representation are fiddly, so I made > > draw_graphviz() a separate function that basically just handles the RTFM > > work (not trivial), while the graph export still happens in > to_networkx(). > > > > Here are a few recipes and a taste of each dish. The matplotlib engine > seems > > usable for interactive exploration, albeit cluttered -- I can't hide the > > internal clade identifiers since graphviz needs unique labels, though > maybe > > I could make them less prominent. ... > > Graphviv does need unique names, and the node labels default to the > node name - but you can override this and use a blank label if you want. > How are you calling Graphviz? There are several Python wrappers out > there, or you could just write a dot file directly and call the graphviz > command line tools. > I'm using the networkx and pygraphviz wrappers, since networkx already partly wraps pygraphviz. The direct networkx->matplotlib rendering engine figures out the associations correctly when I pass a LabeledDiGraph instance, using Clade objects as nodes and the str() representation as the label -- so networkx.draw(tree) shows a tree with the internal nodes all labeled as "Clade". But networkx.draw_graphviz(tree), while otherwise working the same as the other networkx drawing functions, seems to convert nodes to strings earlier, and then treats all "Clade" strings as the same node. Surely there's a way to fix this through the networkx or pygraphviz API, but I couldn't figure it out yesterday from the documentation and source code. I'll poke at it some more today and try using blank labels. Thanks, Eric From chapmanb at 50mail.com Thu Aug 6 09:14:42 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 6 Aug 2009 09:14:42 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> Message-ID: <20090806131442.GG12604@sobchak.mgh.harvard.edu> Hi Eric; > OK, it works now but the resulting trees look a little odd. The options > needed to get a reasonable tree representation are fiddly, so I made > draw_graphviz() a separate function that basically just handles the RTFM > work (not trivial), while the graph export still happens in to_networkx(). > > Here are a few recipes and a taste of each dish. The matplotlib engine seems > usable for interactive exploration, albeit cluttered -- I can't hide the > internal clade identifiers since graphviz needs unique labels, though maybe > I could make them less prominent. Drawing directly to PDF gets cluttered for > big files, and if you stray from the default settings (I played with it a > bit to get it right), it can look surreal. There would still be some benefit > to having a reportlab-based tree module in Bio.Graphics, and maybe one day > I'll get around to that. This is great start. I remember pygraphviz and the networkx representation being a bit finicky last I used it. In the end, I ended up making a pygraphviz AGraph directly. Either way, if you can remove the unneeded labels and change colorization as you suggested, this is a great quick visualizations of trees. Something reportlab based that looks like biologists expect a phylogenetic tree to look would also be very useful. There is a benefit in familiarity of display. Building something generally usable like that is a longer term project. > I think I've seen that app, but I thought it was backed by AppEngine. Neat > stuff. I will learn BioSQL for my own benefit, but I don't think there's > enough time left in GSoC for me to add a useful PhyloDB adapter to > Biopython. So that, along with refactoring Nexus.Trees to use > Bio.Tree.BaseTree, would be a good project to continue with in the fall, at > a slower pace and with more discussion along the way. Yes, the AppEngine display is also BioSQL on the backend; I ported over some of the tables to the object representation used in AppEngine. I also have used the relational schema in work projects -- it generally is just a good place to get started. Agreed on the timelines for GSoC. We'd be very happy to have you continue that on those projects into the fall. Both are very useful additions to the great work you've already done. Brad From biopython at maubp.freeserve.co.uk Thu Aug 6 10:39:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 15:39:33 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> Message-ID: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> On Tue, Aug 4, 2009 at 8:29 PM, Peter wrote: > On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote: >> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote: >>> How about adding a function like "run_arguments" to the >>> commandlines that returns the commandline as a list. >> >> That would be a simple alternative to my vague idea "Maybe we >> can make the command line wrapper object more list like to make >> subprocess happy without needing to create a string?", which may >> not be possible. Either way, this will require a bit of work on the >> Bio.Application parameter objects... > > By defining an __iter__ method, we can make the Biopython > application wrapper object sufficiently list-like that it can be > passed directly to subprocess. I think I have something working > (only tested on Linux so far), at least for the case where none > of the arguments have spaces or quotes in them. The current Bio.Application code works around generating command line strings, and works fine cross platform. Making the Bio.Application objects "list like" and getting this to work cross platform isn't looking easy. Spaces on Windows are causing me big headaches. Switching to lists of arguments appears to work fine on Unix (specifically tested on Linux and Mac OS X), but things are more complicated Windows. Basically using an array/list of arguments is normal on Unix, but on Windows things get passed as strings. The upshot is different Windows tools (or libraries used to compile them) have to parse their command line string themselves, so different tools do it differently. The result is you *may* need to adopt different spaces/quotes escaping for different command line tools on Windows. Now, if you give subprocess a list, on Windows it must first be turned into a string, before subprocess can use the Windows API to run it. The subprocess function list2cmdline does this, but the conventions it follows are not universal. I have examples of working command line strings for ClustalW and PRANK where both the executable and some of the arguments have spaces in them. It seems the quoting I was using to make ClustalW (or PRANK) happy cannot be achieved via subprocess.list2cmdline (and I suspect this applies to other tools too). I will try and look into this further. However, even if it is possible, I don't think we can implement the list approach in time for Biopython 1.51, as there are just too many potential pitfalls. I have in the meantime extended the command line tool unit tests somewhat to include more examples with spaces in the filenames [I'm beginning to think replacing Bio.Application.generic_run with a simpler helper function would be easier in the short term, continuing to just using a string with subprocess, but haven't given up yet.] Peter From biopython at maubp.freeserve.co.uk Thu Aug 6 11:48:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 16:48:12 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> Message-ID: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> On Thu, Aug 6, 2009 at 3:39 PM, Peter wrote: > Now, if you give subprocess a list, on Windows it must first be turned > into a string, before subprocess can use the Windows API to run it. > The subprocess function list2cmdline does this, but the conventions it > follows are not universal. > > I have examples of working command line strings for ClustalW and PRANK > where both the executable and some of the arguments have spaces in > them. It seems the quoting I was using to make ClustalW (or PRANK) > happy cannot be achieved via subprocess.list2cmdline (and I suspect > this applies to other tools too). e.g. This is a valid and working command line for PRANK, which works both at the command line, or in Python via subprocess when given as a string: C:\repository\biopython\Tests>"C:\Program Files\prank.exe" -d=Quality/example.fasta -o="temp with space" -f=11 -convert Now, breaking up the arguments according to the description given in the subprocess.list2cmdline docstring, I think the arguments are: "C:\Program Files\prank.exe" -d=Quality/example.fasta -o="temp with space" -f=11 -convert Of these, the middle guy causes problems. By my reading of the subprocess.list2cmdline docstring this is valid: >> 2) A string surrounded by double quotation marks is >> interpreted as a single argument, regardless of white >> space or pipe characters contained within. A quoted >> string can be embedded in an argument. The example -o="temp with space" is a string surrounded by double quotes, "temp with space", embedded in an argument. Unfortunately, giving these five strings to subprocess.list2cmdline results in a mess as it never checks to see if the arguments are already quoted (as we have done for the program name and also the output filename base). We can pass the program name in without the quotes, and list2cmdline will do the right thing. But there is no way for the -o argument to be handled that I can see. This may be a bug in subprocess.list2cmdline, but it is certainly a real limitation in my opinion. So, it would appear that (on Windows) making our command line wrappers act like lists (by defining __iter__) will not work in general. The other approach which would allow our command line wrappers to be passed directly to subprocess is to make them more string like - but the subprocess code checks for string command lines using isinstance(args, types.StringTypes) which means we would have to subclass str (or unicode). I'm not sure if this can be made to work yet... Peter From biopython at maubp.freeserve.co.uk Thu Aug 6 12:05:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 17:05:24 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> Message-ID: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com> On Thu, Aug 6, 2009 at 4:48 PM, Peter wrote: > The other approach which would allow our command line wrappers > to be passed directly to subprocess is to make them more string > like - but the subprocess code checks for string command lines > using isinstance(args, types.StringTypes) which means we would > have to subclass str (or unicode). I'm not sure if this can be made > to work yet... Thinking about it a bit more, str and unicode are immutable objects, but we want the command line wrapper to be mutable (e.g. to add, change or remove parameters and arguments). So it won't work. Going back to my the original email, we could replace Bio.Application.generic_run instead: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006344.html > > Possible helper functions that come to mind are: > (a) Returns the return code (integer) only. This would basically > be a cross-platform version of os.system using the subprocess > module internally. > (b) Returns the return code (integer) plus the stdout and stderr > (which would have to be StringIO handles, with the data in > memory). This would be a direct replacement for the current > Bio.Application.generic_run function. > (c) Returns the stdout (and stderr) handles. This basically is > recreating a deprecated Python popen*() function, which seems > silly. Or we just declare both Bio.Application.generic_run and ApplicationResult obsolete, and simply recommend using subprocess with str(cline) as before. Would someone like to proof read (and test) the tutorial in CVS where I switched all the generic_run usage to subprocess? Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 07:14:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 12:14:18 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> Message-ID: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> On Wed, Jul 29, 2009 at 8:43 AM, Peter wrote: > On Tue, Jul 28, 2009 at 11:09 PM, Brad Chapman wrote: >> Extending this to AlignIO and TreeIO as Eric suggested is >> also great. > > Whatever we do for Bio.SeqIO, we can follow the same pattern > for Bio.AlignIO etc. > >> So +1 from me, >> Brad > > And we basically had a +0 from Michiel, and a +1 from Eric. > And I like the idea but am not convinced we need it. Maybe > we should put the suggestion forward on the main discussion > list for debate? I've stuck a branch up on github which (thus far) simply defines the Bio.SeqIO.convert and Bio.AlignIO.convert functions. Adding optimised code can come later. http://github.com/peterjc/biopython/commits/convert Right now (based on the other thread), I've experimented with making the convert functions accept either handles or filenames. This will make the convert function even more of a convenience wrapper, in addition to its role as a standardised API to allow file format specific optimisations. Taking handles and/or filenames does rather complicate things, and not just for remembering to close the handles. There are issues like should we silently replace any existing output file (I went for yes), and should the output file be deleted if the conversion fails part way (I went for no)? Dealing with just handles would free us from all these considerations. You could even consider using Python's temporary file support to write the file to a temp location, and only at the end move it to the desired location. However that is getting far too complicated for my liking (and may runs into permissions issues on Unix). If anyone wants to do this, they can do it explicitly in the calling script. How does this look so far? Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 15:41:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 20:41:20 +0100 Subject: [Biopython-dev] Unit tests for deprecated modules? In-Reply-To: <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com> References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com> <48AACE23.3050107@biologie.uni-kl.de> <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com> Message-ID: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com> Last year we talked about what to do with the unit tests for deprecated modules, http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html On Tue, Aug 19, 2008, Peter wrote: > Are there any strong views about when to remove unit tests for > deprecated modules? I can see two main approaches: > > (a) Remove the unit test when the code is deprecated, as this avoids > warning messages from the test suite. > (b) Remove the unit test only when the deprecated code is actually > removed, as continuing to test the code will catch any unexpected > breakage of the deprecated code. > > I lean towards (b), but wondered what other people think. > > Peter On Tue, Aug 19, 2008, Michiel de Hoon wrote: > I would say (a). In my opinion, deprecated means that the module > is in essence no longer part of Biopython; we just keep it around > to give people time to change. Also, deprecation warnings distract > from real warnings and errors in the unit tests, are likely to confuse > users, and give the impression that Biopython is not clean. I don't > remember a case where we had to resurrect a deprecated module, > so we may as well remove the unit test right away. > > --Michiel On Tue, Aug 19, 2008, Frank Kauff wrote: > I favor option a. Deprecated modules are no longer under development, > so there's not much need for a unit test. A failed test would probably > not trigger any action anyway, because nobody's going to do much > bugfixing in deprecated modules. > > Frank So, what we agreed last year was to remove tests for deprecated modules. This issue has come up again with the deprecation of Bio.Fasta, and the question of what to do with test_Fasta.py I'd like to suggest a third option: Keep the tests for deprecated modules, but silence the deprecation warning. e.g. make can test_Fasta.py silence the Bio.Fasta deprecation warning. Hiding the warning would prevent the likely user confusion on running the test suite (an issue Michiel pointed out last year). Keeping the test will prevent us accidentally breaking Bio.Fasta during the phasing out period. Any thoughts? Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 15:50:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 20:50:47 +0100 Subject: [Biopython-dev] Unit tests for deprecated modules? In-Reply-To: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com> References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com> <48AACE23.3050107@biologie.uni-kl.de> <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com> <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com> Message-ID: <320fb6e00908081250j189ba590o5cd9c6e98f596193@mail.gmail.com> On Sat, Aug 8, 2009 at 8:41 PM, Peter wrote: > Last year we talked about what to do with the unit tests for deprecated modules, > http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html > ... > I'd like to suggest a third option: Keep the tests for deprecated > modules, but silence the deprecation warning. e.g. make > test_Fasta.py silence the Bio.Fasta deprecation warning. I've done that in CVS as a proof of principle, replacing: from Bio import Fasta with: import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) from Bio import Fasta warnings.resetwarnings() There may be a more elegant way to do this, but it works. Peter From bugzilla-daemon at portal.open-bio.org Mon Aug 10 09:43:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 10 Aug 2009 09:43:15 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200908101343.n7ADhF4c020240@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1303 is|0 |1 obsolete| | ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-10 09:43 EST ------- (From update of attachment 1303) This file is already a tiny bit out of date - I've started working on this on a git branch. http://github.com/peterjc/biopython/commits/sff See also James Casbon's parser, also on github: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006456.html http://github.com/jamescasbon/biopython/tree/sff It looks like we could try and merge the two. James' code looks like it doesn't need seek/tell, which means it should work on any input handle (not just an open file). Note neither parser yet copes with paired end data (and I have not yet found any test files to work on). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Aug 10 12:46:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 17:46:16 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> Message-ID: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote: > I've stuck a branch up on github which (thus far) simply defines > the Bio.SeqIO.convert and Bio.AlignIO.convert functions. > Adding optimised code can come later. > > http://github.com/peterjc/biopython/commits/convert There is now a new file Bio/SeqIO/_convert.py on this branch, and a few optimised conversions have been done. In particular GenBank/EMBL to FASTA, any FASTQ to FASTA, and inter-conversion between any of the three FASTQ formats. In terms of speed, this new code takes under a minute to convert a 7 million short read FASTQ file to another FASTQ variant, or to a (line wrapped) FASTA file. In comparison, using Bio.SeqIO parse/write takes over five minutes. In terms of code organisation within Bio/SeqIO/_convert.py I am (as with Bio.SeqIO etc for parsing and writing) just using a dictionary of functions, keyed on the format names. Initially, as you can tell from the code history, I was thinking about having each sub-function potentially dealing with more than one conversion (e.g. GenBank to anything not needing features), but have removed this level of complication in the most recent commit. The current Bio/SeqIO/_convert.py file actually looks very long and complicated - but if you ignore the doctests (which I would probably more to a dedicated unit test), it isn't that much code at all. Would anyone like to try this out? Peter From eric.talevich at gmail.com Mon Aug 10 13:44:31 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 10 Aug 2009 13:44:31 -0400 Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython Message-ID: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> Hi folks, Previously I (Aug. 3-7): - Refactored the PhyloXML parser somewhat, to behave more like the other Biopython parsers and also handle 'other' elements better - Reorganized Bio.Tree a bit, generalizing the Tree base class and improving BaseTree-PhyloXML interoperability - Worked on networkx export and graphviz display - Added some more tests (thanks, Diana!) - Added TreeIO.convert(), to match the AlignIO and SeqIO modules Next week (Aug. 10-14) I will: - Update the wiki documentation - Fix any surprises that come up during testing Automated testing: - Check unit tests for complete coverage - Re-run performance benchmarks - Run tests and benchmarks on alternate platforms - Check epydoc's generated API documentation Remarks: - Performance of the I/O functions is close to what it was before, in the best of times; parsing Taxonomy nodes incrementally seems to have helped. - Drawing trees with Graphviz is still ugly. Hopefully I can fix it this week, but if not, I'll probably do it after GSoC because I like pretty things. - Presumably, any discussion of merging with Biopython will have to wait until after the biopython-1.51 release. I'll be around. For GSoC requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO modules along with the unit test suite as standalone files, rather than as a patch set since the last upstream revision I pulled was just a random untagged one around the time of the last beta release. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From matzke at berkeley.edu Mon Aug 10 16:23:15 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 10 Aug 2009 13:23:15 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <20090804222731.GA12604@sobchak.mgh.harvard.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> Message-ID: <4A8081B3.2080600@berkeley.edu> Hi all...updates... Summary: Major focus is getting the GBIF access/search/parse module into "done"/submittable shape. This primarily requires getting the documentation and testing up to biopython specs. I have a fair bit of documentation and testing, need advice (see below) for specifics on what it should look like. Brad Chapman wrote: > Hi Nick; > Thanks for the update -- great to see things moving along. > >> - removed any reliance on lagrange tree module, refactored all phylogeny >> code to use the revised Bio.Nexus.Tree module > > Awesome -- glad this worked for you. Are the lagrange_* files in > Bio.Geography still necessary? If not, we should remove them from > the repository to clean things up. Ah, they had been deleted locally but it took an extra command to delete on git. Done. > > More generally, it would be really helpful if we could do a bit of > housekeeping on the repository. The Geography namespace has a lot of > things in it which belong in different parts of the tree: > > - The test code should move to the 'Tests' directory as a set of > test_Geography* files that we can use for unit testing the code. OK, I will do this. Should I try and figure out the unittest stuff? I could use a simple example of what this is supposed to look like. > - Similarly there are a lot of data files in there which are > appear to be test related; these could move to Tests/Geography Will do. > - What is happening with the Nodes_v2 and Treesv2 files? They look > like duplicates of the Nexus Nodes and Trees with some changes. > Could we roll those changes into the main Nexus code to avoid > duplication? Yeah, these were just copies with your bug fix, and with a few mods I used to track crashes. Presumably I don't need these with after a fresh download of biopython. >> - Code dealing with GBIF xml output completely refactored into the >> following classes: >> >> * ObsRecs (observation records & search results/summary) >> * ObsRec (an individual observation record) >> * XmlString (functions for cleaning xml returned by Gbif) >> * GbifXml (extention of capabilities for ElementTree xml trees, parsed >> from GBIF xml returns. > > I'm agreed with Hilmar -- the user classes would probably benefit from expanded > naming. There is a art to naming to get them somewhere between the hideous > RidicuouslyLongNamesWithEverythingSpecified names and short truncated names. > Specifically, you've got a lot of filler in the names -- dbfUtils, > geogUtils, shpUtils. The Utils probably doesn't tell the user much > and makes all of the names sort of blend together, just as the Rec/Recs > pluralization hides a quite large difference in what the classes hold. Will work on this, these should be made part of the GbifObservationRecord() object or be accessed by it, basically they only exist to classify lat/long points into user-specified areas. > Something like Observation and ObservationSearchResult would make it > clear immediately what they do and the information they hold. Agreed, here is a new scheme for the names (changes already made): ============= class GbifSearchResults(): GbifSearchResults is a class for holding a series of GbifObservationRecord records, and processing them e.g. into classified areas. Also can hold a GbifDarwincoreXmlString record (the raw output returned from a GBIF search) and a GbifXmlTree (a class for holding/processing the ElementTree object returned by parsing the GbifDarwincoreXmlString). class GbifObservationRecord(): GbifObservationRecord is a class for holding an individual observation at an individual lat/long point. class GbifDarwincoreXmlString(str): GbifDarwincoreXmlString is a class for holding the xmlstring returned by a GBIF search, & processing it to plain text, then an xmltree (an ElementTree). GbifDarwincoreXmlString inherits string methods from str (class String). class GbifXmlTree(): gbifxml is a class for holding and processing xmltrees of GBIF records. ============= ...description of methods below... > >> This week: > > What are your thoughts on documentation? As a naive user of these > tools without much experience with the formats, I could offer better > feedback if I had an idea of the public APIs and how they are > expected to be used. Moreover, cookbook and API documentation is something > we will definitely need to integrate into Biopython. How does this fit > in your timeline for the remaining weeks? The API is really just the interface with GBIF. I think developing a cookbook entry is pretty easy, I assume you want something like one of the entries in the official biopython cookbook? Re: API documentation...are you just talking about the function descriptions that are typically in """ """ strings beneath the function definitions? I've got that done. Again, if there is more, an example of what it should look like would be useful. Documentation for the GBIF stuff below. ============ gbif_xml.py Functions for accessing GBIF, downloading records, processing them into a class, and extracting information from the xmltree in that class. class GbifObservationRecord(Exception): pass class GbifObservationRecord(): GbifObservationRecord is a class for holding an individual observation at an individual lat/long point. __init__(self): This is an instantiation class for setting up new objects of this class. latlong_to_obj(self, line): Read in a string, read species/lat/long to GbifObservationRecord object This can be slow, e.g. 10 seconds for even just ~1000 records. parse_occurrence_element(self, element): Parse a TaxonOccurrence element, store in OccurrenceRecord fill_occ_attribute(self, element, el_tag, format='str'): Return the text found in matching element matching_el.text. find_1st_matching_subelement(self, element, el_tag, return_element): Burrow down into the XML tree, retrieve the first element with the matching tag. record_to_string(self): Print the attributes of a record to a string class GbifDarwincoreXmlString(Exception): pass class GbifDarwincoreXmlString(str): GbifDarwincoreXmlString is a class for holding the xmlstring returned by a GBIF search, & processing it to plain text, then an xmltree (an ElementTree). GbifDarwincoreXmlString inherits string methods from str (class String). __init__(self, rawstring=None): This is an instantiation class for setting up new objects of this class. fix_ASCII_lines(self, endline=''): Convert each line in an input string into pure ASCII (This avoids crashes when printing to screen, etc.) _fix_ASCII_line(self, line): Convert a single string line into pure ASCII (This avoids crashes when printing to screen, etc.) _unescape(self, text): # Removes HTML or XML character references and entities from a text string. @param text The HTML (or XML) source text. @return The plain text, as a Unicode string, if necessary. source: http://effbot.org/zone/re-sub.htm#unescape-html _fix_ampersand(self, line): Replaces "&" with "&" in a string; this is otherwise not caught by the unescape and unicodedata.normalize functions. class GbifXmlTreeError(Exception): pass class GbifXmlTree(): gbifxml is a class for holding and processing xmltrees of GBIF records. __init__(self, xmltree=None): This is an instantiation class for setting up new objects of this class. print_xmltree(self): Prints all the elements & subelements of the xmltree to screen (may require fix_ASCII to input file to succeed) print_subelements(self, element): Takes an element from an XML tree and prints the subelements tag & text, and the within-tag items (key/value or whatnot) _element_items_to_dictionary(self, element_items): If the XML tree element has items encoded in the tag, e.g. key/value or whatever, this function puts them in a python dictionary and returns them. extract_latlongs(self, element): Create a temporary pseudofile, extract lat longs to it, return results as string. Inspired by: http://www.skymind.com/~ocrow/python_string/ (Method 5: Write to a pseudo file) _extract_latlong_datum(self, element, file_str): Searches an element in an XML tree for lat/long information, and the complete name. Searches recursively, if there are subelements. file_str is a string created by StringIO in extract_latlongs() (i.e., a temp filestr) extract_all_matching_elements(self, start_element, el_to_match): Returns a list of the elements, picking elements by TaxonOccurrence; this should return a list of elements equal to the number of hits. _recursive_el_match(self, element, el_to_match, output_list): Search recursively through xmltree, starting with element, recording all instances of el_to_match. find_to_elements_w_ancs(self, el_tag, anc_el_tag): Burrow into XML to get an element with tag el_tag, return only those el_tags underneath a particular parent element parent_el_tag xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, match_el_list): Recursively burrows down to find whatever elements with el_tag exist inside a parent_el_tag. create_sub_xmltree(self, element): Create a subset xmltree (to avoid going back to irrelevant parents) _xml_burrow_up(self, element, anc_el_tag, found_anc): Burrow up xml to find anc_el_tag _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): Burrow up from element of interest, until a cousin is found with cousin_el_tag _return_parent_in_xmltree(self, child_to_search_for): Search through an xmltree to get the parent of child_to_search_for _return_parent_in_element(self, potential_parent, child_to_search_for, returned_parent): Search through an XML element to return parent of child_to_search_for find_1st_matching_element(self, element, el_tag, return_element): Burrow down into the XML tree, retrieve the first element with the matching tag extract_numhits(self, element): Search an element of a parsed XML string and find the number of hits, if it exists. Recursively searches, if there are subelements. class GbifSearchResults(Exception): pass class GbifSearchResults(): GbifSearchResults is a class for holding a series of GbifObservationRecord records, and processing them e.g. into classified areas. __init__(self, gbif_recs_xmltree=None): This is an instantiation class for setting up new objects of this class. print_records(self): Print all records in tab-delimited format to screen. print_records_to_file(self, fn): Print the attributes of a record to a file with filename fn latlongs_to_obj(self): Takes the string from extract_latlongs, puts each line into a GbifObservationRecord object. Return a list of the objects Functions devoted to accessing/downloading GBIF records access_gbif(self, url, params): Helper function to access various GBIF services choose the URL ("url") from here: http://data.gbif.org/ws/rest/occurrence params are a dictionary of key/value pairs "self._open" is from Bio.Entrez.self._open, online here: http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open Get the handle of results (looks like e.g.: > ) (open with results_handle.read() ) _get_hits(self, params): Get the actual hits that are be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). get_xml_hits(self, params): Returns hits like _get_hits, but returns a parsed XML tree. get_record(self, key): Given the key, get a single record, return xmltree for it. get_numhits(self, params): Get the number of hits that will be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). xmlstring_to_xmltree(self, xmlstring): Take the text string returned by GBIF and parse to an XML tree using ElementTree. Requires the intermediate step of saving to a temporary file (required to make ElementTree.parse work, apparently) tempfn = 'tempxml.xml' fh = open(tempfn, 'w') fh.write(xmlstring) fh.close() get_all_records_by_increment(self, params, inc): Download all of the records in stages, store in list of elements. Increments of e.g. 100 to not overload server extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree): Extract all of the 'TaxonOccurrence' elements to a list, store them in a GbifObservationRecord. _paramsdict_to_string(self, params): Converts the python dictionary of search parameters into a text string for submission to GBIF _open(self, cgi, params={}): Function for accessing online databases. Modified from: http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html Helper function to build the URL and open a handle to it (PRIVATE). Open a handle to GBIF. cgi is the URL for the cgi script to access. params is a dictionary with the options to pass to it. Does some simple error checking, and will raise an IOError if it encounters one. This function also enforces the "three second rule" to avoid abusing the GBIF servers (modified after NCBI requirement). ============ > > Thanks again. Hope this helps, > Brad Very much, thanks!! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Mon Aug 10 16:25:10 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 10 Aug 2009 13:25:10 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A8081B3.2080600@berkeley.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> Message-ID: <4A808226.5020302@berkeley.edu> PS: Evidence of interest in this GBIF functionality already, see fwd below... PPS: Commit with updates names, deleted old files here: http://github.com/nmatzke/biopython/commits/Geography -------- Original Message -------- Subject: Re: biogeopython Date: Fri, 07 Aug 2009 16:34:26 -0700 From: Nick Matzke Reply-To: matzke at berkeley.edu Organization: Dept. Integ. Biology, UC Berkeley To: James Pringle References: <4A7C6DEE.1000305 at berkeley.edu> Coolness, let me know how it works for you, feedback appreciated at this stage. Cheers! Nick James Pringle wrote: > Thanks! > Jamie > > On Fri, Aug 7, 2009 at 2:09 PM, Nick Matzke > wrote: > > Hi Jamie! > > It's still under development, eventually it will be a biopython > module, but what I've got should do exactly what you need. > > Just take the files from the most recent commit here: > http://github.com/nmatzke/biopython/commits/Geography > > ...and run test_gbif_xml.py to get the idea, it will search on a > taxon name, count/download all hits, parse the xml to a set of > record objects, output each record to screen or tab-delimited file, > etc. > > Cheers! > Nick > > > > > > James Pringle wrote: > > Dear Mr. Matzke-- > > I am an oceanographer at the University of New Hampshire, and > with my colleagues John Wares and Jeb Byers am looking at the > interaction of ocean circulation and species ranges. As part > of that effort, I am using GBIF data, and was looking at your > Summer-of-Code project. I want to start from a species name > and get lat/long of occurance data. Is you toolbox in usable > shape (I am an ok pythonista)? What is the best way to download > a tested version of it (I can figure out how to get code from > CVS/GIT, etc, so I am just looking for a pointer to a stable-ish > tree)? > > Cheers, > & Thanks > Jamie Pringle > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When > people thought the earth was spherical, they were wrong. But if you > think that thinking the earth is spherical is just as wrong as > thinking the earth is flat, then your view is wronger than both of > them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical > Inquirer, 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== > > Nick Matzke wrote: > Hi all...updates... > > Summary: Major focus is getting the GBIF access/search/parse module into > "done"/submittable shape. This primarily requires getting the > documentation and testing up to biopython specs. I have a fair bit of > documentation and testing, need advice (see below) for specifics on what > it should look like. > > > Brad Chapman wrote: >> Hi Nick; >> Thanks for the update -- great to see things moving along. >> >>> - removed any reliance on lagrange tree module, refactored all >>> phylogeny code to use the revised Bio.Nexus.Tree module >> >> Awesome -- glad this worked for you. Are the lagrange_* files in >> Bio.Geography still necessary? If not, we should remove them from >> the repository to clean things up. > > > Ah, they had been deleted locally but it took an extra command to delete > on git. Done. > >> >> More generally, it would be really helpful if we could do a bit of >> housekeeping on the repository. The Geography namespace has a lot of >> things in it which belong in different parts of the tree: >> >> - The test code should move to the 'Tests' directory as a set of >> test_Geography* files that we can use for unit testing the code. > > OK, I will do this. Should I try and figure out the unittest stuff? I > could use a simple example of what this is supposed to look like. > > >> - Similarly there are a lot of data files in there which are >> appear to be test related; these could move to Tests/Geography > > Will do. > >> - What is happening with the Nodes_v2 and Treesv2 files? They look >> like duplicates of the Nexus Nodes and Trees with some changes. >> Could we roll those changes into the main Nexus code to avoid >> duplication? > > Yeah, these were just copies with your bug fix, and with a few mods I > used to track crashes. Presumably I don't need these with after a fresh > download of biopython. > > > >>> - Code dealing with GBIF xml output completely refactored into the >>> following classes: >>> >>> * ObsRecs (observation records & search results/summary) >>> * ObsRec (an individual observation record) >>> * XmlString (functions for cleaning xml returned by Gbif) >>> * GbifXml (extention of capabilities for ElementTree xml trees, >>> parsed from GBIF xml returns. >> >> I'm agreed with Hilmar -- the user classes would probably benefit from >> expanded >> naming. There is a art to naming to get them somewhere between the >> hideous RidicuouslyLongNamesWithEverythingSpecified names and short >> truncated names. >> Specifically, you've got a lot of filler in the names -- dbfUtils, >> geogUtils, shpUtils. The Utils probably doesn't tell the user much >> and makes all of the names sort of blend together, just as the >> Rec/Recs pluralization hides a quite large difference in what the >> classes hold. > > Will work on this, these should be made part of the > GbifObservationRecord() object or be accessed by it, basically they only > exist to classify lat/long points into user-specified areas. > >> Something like Observation and ObservationSearchResult would make it >> clear immediately what they do and the information they hold. > > > Agreed, here is a new scheme for the names (changes already made): > > ============= > class GbifSearchResults(): > > GbifSearchResults is a class for holding a series of > GbifObservationRecord records, and processing them e.g. into classified > areas. > > Also can hold a GbifDarwincoreXmlString record (the raw output returned > from a GBIF search) and a GbifXmlTree (a class for holding/processing > the ElementTree object returned by parsing the GbifDarwincoreXmlString). > > > > class GbifObservationRecord(): > > GbifObservationRecord is a class for holding an individual observation > at an individual lat/long point. > > > > class GbifDarwincoreXmlString(str): > > GbifDarwincoreXmlString is a class for holding the xmlstring returned by > a GBIF search, & processing it to plain text, then an xmltree (an > ElementTree). > > GbifDarwincoreXmlString inherits string methods from str (class String). > > > > class GbifXmlTree(): > gbifxml is a class for holding and processing xmltrees of GBIF records. > ============= > > ...description of methods below... > > >> >>> This week: >> >> What are your thoughts on documentation? As a naive user of these >> tools without much experience with the formats, I could offer better >> feedback if I had an idea of the public APIs and how they are >> expected to be used. Moreover, cookbook and API documentation is >> something we will definitely need to integrate into Biopython. How >> does this fit in your timeline for the remaining weeks? > > The API is really just the interface with GBIF. I think developing a > cookbook entry is pretty easy, I assume you want something like one of > the entries in the official biopython cookbook? > > Re: API documentation...are you just talking about the function > descriptions that are typically in """ """ strings beneath the function > definitions? I've got that done. Again, if there is more, an example > of what it should look like would be useful. > > Documentation for the GBIF stuff below. > > ============ > gbif_xml.py > Functions for accessing GBIF, downloading records, processing them into > a class, and extracting information from the xmltree in that class. > > > class GbifObservationRecord(Exception): pass > class GbifObservationRecord(): > GbifObservationRecord is a class for holding an individual observation > at an individual lat/long point. > > > __init__(self): > > This is an instantiation class for setting up new objects of this class. > > > > latlong_to_obj(self, line): > > Read in a string, read species/lat/long to GbifObservationRecord object > This can be slow, e.g. 10 seconds for even just ~1000 records. > > > parse_occurrence_element(self, element): > > Parse a TaxonOccurrence element, store in OccurrenceRecord > > > fill_occ_attribute(self, element, el_tag, format='str'): > > Return the text found in matching element matching_el.text. > > > > find_1st_matching_subelement(self, element, el_tag, return_element): > > Burrow down into the XML tree, retrieve the first element with the > matching tag. > > > record_to_string(self): > > Print the attributes of a record to a string > > > > > > > > class GbifDarwincoreXmlString(Exception): pass > > class GbifDarwincoreXmlString(str): > GbifDarwincoreXmlString is a class for holding the xmlstring returned by > a GBIF search, & processing it to plain text, then an xmltree (an > ElementTree). > > GbifDarwincoreXmlString inherits string methods from str (class String). > > > > __init__(self, rawstring=None): > > This is an instantiation class for setting up new objects of this class. > > > > fix_ASCII_lines(self, endline=''): > > Convert each line in an input string into pure ASCII > (This avoids crashes when printing to screen, etc.) > > > _fix_ASCII_line(self, line): > > Convert a single string line into pure ASCII > (This avoids crashes when printing to screen, etc.) > > > _unescape(self, text): > > # > Removes HTML or XML character references and entities from a text string. > > @param text The HTML (or XML) source text. > @return The plain text, as a Unicode string, if necessary. > source: http://effbot.org/zone/re-sub.htm#unescape-html > > > _fix_ampersand(self, line): > > Replaces "&" with "&" in a string; this is otherwise > not caught by the unescape and unicodedata.normalize functions. > > > > > > > > class GbifXmlTreeError(Exception): pass > class GbifXmlTree(): > gbifxml is a class for holding and processing xmltrees of GBIF records. > > __init__(self, xmltree=None): > > This is an instantiation class for setting up new objects of this class. > > > print_xmltree(self): > > Prints all the elements & subelements of the xmltree to screen (may require > fix_ASCII to input file to succeed) > > > print_subelements(self, element): > > Takes an element from an XML tree and prints the subelements tag & text, > and > the within-tag items (key/value or whatnot) > > > _element_items_to_dictionary(self, element_items): > > If the XML tree element has items encoded in the tag, e.g. key/value or > whatever, this function puts them in a python dictionary and returns > them. > > > extract_latlongs(self, element): > > Create a temporary pseudofile, extract lat longs to it, > return results as string. > > Inspired by: http://www.skymind.com/~ocrow/python_string/ > (Method 5: Write to a pseudo file) > > > > > _extract_latlong_datum(self, element, file_str): > > Searches an element in an XML tree for lat/long information, and the > complete name. Searches recursively, if there are subelements. > > file_str is a string created by StringIO in extract_latlongs() (i.e., a > temp filestr) > > > > extract_all_matching_elements(self, start_element, el_to_match): > > Returns a list of the elements, picking elements by TaxonOccurrence; > this should > return a list of elements equal to the number of hits. > > > > _recursive_el_match(self, element, el_to_match, output_list): > > Search recursively through xmltree, starting with element, recording all > instances of el_to_match. > > > find_to_elements_w_ancs(self, el_tag, anc_el_tag): > > Burrow into XML to get an element with tag el_tag, return only those > el_tags underneath a particular parent element parent_el_tag > > > xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, > match_el_list): > > Recursively burrows down to find whatever elements with el_tag exist > inside a parent_el_tag. > > > > create_sub_xmltree(self, element): > > Create a subset xmltree (to avoid going back to irrelevant parents) > > > > _xml_burrow_up(self, element, anc_el_tag, found_anc): > > Burrow up xml to find anc_el_tag > > > > _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): > > Burrow up from element of interest, until a cousin is found with > cousin_el_tag > > > > > _return_parent_in_xmltree(self, child_to_search_for): > > Search through an xmltree to get the parent of child_to_search_for > > > > _return_parent_in_element(self, potential_parent, child_to_search_for, > returned_parent): > > Search through an XML element to return parent of child_to_search_for > > > find_1st_matching_element(self, element, el_tag, return_element): > > Burrow down into the XML tree, retrieve the first element with the > matching tag > > > > > extract_numhits(self, element): > > Search an element of a parsed XML string and find the > number of hits, if it exists. Recursively searches, > if there are subelements. > > > > > > > > > > > > > class GbifSearchResults(Exception): pass > > class GbifSearchResults(): > > GbifSearchResults is a class for holding a series of > GbifObservationRecord records, and processing them e.g. into classified > areas. > > > > __init__(self, gbif_recs_xmltree=None): > > This is an instantiation class for setting up new objects of this class. > > > > print_records(self): > > Print all records in tab-delimited format to screen. > > > > > print_records_to_file(self, fn): > > Print the attributes of a record to a file with filename fn > > > > latlongs_to_obj(self): > > Takes the string from extract_latlongs, puts each line into a > GbifObservationRecord object. > > Return a list of the objects > > > Functions devoted to accessing/downloading GBIF records > access_gbif(self, url, params): > > Helper function to access various GBIF services > > choose the URL ("url") from here: > http://data.gbif.org/ws/rest/occurrence > > params are a dictionary of key/value pairs > > "self._open" is from Bio.Entrez.self._open, online here: > http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open > > Get the handle of results > (looks like e.g.: object at 0x48117f0>> ) > > (open with results_handle.read() ) > > > _get_hits(self, params): > > Get the actual hits that are be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > > It will return the LAST non-none instance (in a standard search result > there > should be only one, anyway). > > > > > get_xml_hits(self, params): > > Returns hits like _get_hits, but returns a parsed XML tree. > > > > > get_record(self, key): > > Given the key, get a single record, return xmltree for it. > > > > get_numhits(self, params): > > Get the number of hits that will be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > > It will return the LAST non-none instance (in a standard search result > there > should be only one, anyway). > > > xmlstring_to_xmltree(self, xmlstring): > > Take the text string returned by GBIF and parse to an XML tree using > ElementTree. > Requires the intermediate step of saving to a temporary file (required > to make > ElementTree.parse work, apparently) > > > > tempfn = 'tempxml.xml' > fh = open(tempfn, 'w') > fh.write(xmlstring) > fh.close() > > > > > > get_all_records_by_increment(self, params, inc): > > Download all of the records in stages, store in list of elements. > Increments of e.g. 100 to not overload server > > > > extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree): > > Extract all of the 'TaxonOccurrence' elements to a list, store them in a > GbifObservationRecord. > > > > _paramsdict_to_string(self, params): > > Converts the python dictionary of search parameters into a text > string for submission to GBIF > > > > _open(self, cgi, params={}): > > Function for accessing online databases. > > Modified from: > http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html > > Helper function to build the URL and open a handle to it (PRIVATE). > > Open a handle to GBIF. cgi is the URL for the cgi script to access. > params is a dictionary with the options to pass to it. Does some > simple error checking, and will raise an IOError if it encounters one. > > This function also enforces the "three second rule" to avoid abusing > the GBIF servers (modified after NCBI requirement). > ============ > > >> >> Thanks again. Hope this helps, >> Brad > > Very much, thanks!! > Nick > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Mon Aug 10 16:49:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 21:49:29 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A8081B3.2080600@berkeley.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> Message-ID: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com> On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote: > Hi all...updates... > > Summary: Major focus is getting the GBIF access/search/parse module into > "done"/submittable shape. ?This primarily requires getting the documentation > and testing up to biopython specs. ?I have a fair bit of documentation and > testing, need advice (see below) for specifics on what it should look like. > >> - The test code should move to the 'Tests' directory as a set of >> ?test_Geography* files that we can use for unit testing the code. > > OK, I will do this. ?Should I try and figure out the unittest stuff? ?I > could use a simple example of what this is supposed to look like. You can either go for "unittest" based tests (generally better, but more of a learning curve - but useful for any python project), or our own Biopython specific "print and compare" tests (basically sample scripts with their expected output). Read the tests chapter in the Biopython Tutorial if you haven't already. (And if you think anything could be clearer, or you spot a typo, let us know please - feedback would be great). Peter From matzke at berkeley.edu Mon Aug 10 17:10:26 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 10 Aug 2009 14:10:26 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com> Message-ID: <4A808CC2.6000308@berkeley.edu> Peter wrote: > On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote: >> Hi all...updates... >> >> Summary: Major focus is getting the GBIF access/search/parse module into >> "done"/submittable shape. This primarily requires getting the documentation >> and testing up to biopython specs. I have a fair bit of documentation and >> testing, need advice (see below) for specifics on what it should look like. >> >>> - The test code should move to the 'Tests' directory as a set of >>> test_Geography* files that we can use for unit testing the code. >> OK, I will do this. Should I try and figure out the unittest stuff? I >> could use a simple example of what this is supposed to look like. > > You can either go for "unittest" based tests (generally better, but more > of a learning curve - but useful for any python project), or our own > Biopython specific "print and compare" tests (basically sample scripts > with their expected output). > > Read the tests chapter in the Biopython Tutorial if you haven't already. > (And if you think anything could be clearer, or you spot a typo, let us > know please - feedback would be great). Thanks! Nick > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Tue Aug 11 08:19:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 13:19:25 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> Message-ID: <320fb6e00908110519k313d6d34g40502fd2578326e1@mail.gmail.com> On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote: > In terms of speed, this new code takes under a minute to > convert a 7 million short read FASTQ file to another FASTQ > variant, or to a (line wrapped) FASTA file. In comparison, > using Bio.SeqIO parse/write takes over five minutes. If anyone is interested in the details, here I am using a 7 million entry FASTQ file of short reads (length 36bp) from a Solexa FASTQ format file (downloaded from the NCBI and then converted from the Sanger FASTQ format). I'm timing conversion from Solexa to Sanger FASTQ as it is a more common operation, and I can include the MAQ script for comparison. I pipe the output via grep and word count as a check on the conversion. Using a (patched) version of MAQ's fq_all2std.pl we get about 4 mins: $ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 3m58.978s user 4m13.475s sys 0m3.705s And using a patched version of EMBOSS 6.1.0 (without the optimisations Peter Rice has mentioned), we get 3m42s. $ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger < SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 3m41.625s user 3m56.753s sys 0m4.091s Using the latest Biopython in CVS (or the git master branch), with Bio.SeqIO.parse/write, takes about twice this, 7m11s: $ time python biopython_solexa2sanger.py < SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 7m10.706s user 7m27.597s sys 0m3.850s This is at least a marked improvement over Biopython 1.51b with Bio.SeqIO.parse/write, which took about 17 minutes! The bad news is while the Bio.SeqIO FASTQ read/write in CVS is faster than in Biopython 1.51b, it is also much less elegant. I'm think once I've finished adding test cases (and probably after 1.51 is out) it might be worth while trying to make it more beautiful without sacrificing too much of the speed gain. Now to the good news, using my github branch with the convert function we get a massive reduction to under a minute (52s): $ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 0m51.618s user 1m7.735s sys 0m3.162s We have a winner! Assuming of course there are no mistakes ;) In fact, these measurements are a little misleading because I am including grep (to check the record count) and the output isn't actually going to disk. Doing the grep on its own takes about 15s: $ time grep "^@SRR" SRR001666_1.fastq_solexa | wc -l 7047668 real 0m15.318s user 0m17.890s sys 0m1.087s However, if you actually output to a file the disk speed itself becomes important when the conversion is this fast: $ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq real 1m3.448s user 0m49.672s sys 0m4.826s $ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger < SRR001666_1.fastq_solexa > temp.fastq real 3m55.086s user 3m39.548s sys 0m5.998s $ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std SRR001666_1.fastq_solexa > temp.fastq real 4m10.245s user 3m54.880s sys 0m5.085s $ time python ../biopython/Tests/Quality/biopython_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq real 7m27.879s user 7m9.084s sys 0m6.008s Nevertheless, the Bio.SeqIO.convert(...) function still wins for now. Peter For those interested, here are the tiny little Biopython scripts I'm using: # biopython_solexa2sanger.py #FASTQ conversion using Bio.SeqIO, needs Biopython 1.50 or later. import sys from Bio import SeqIO records = SeqIO.parse(sys.stdin, "fastq-solexa") SeqIO.write(records, sys.stdout, "fastq") and: #convert_solexa2sanger.py #High performance FASTQ conversion using Bio.SeqIO.convert(...) #function likely to be in Biopython 1.52 onwards. import sys from Bio import SeqIO SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq") From chapmanb at 50mail.com Tue Aug 11 09:10:19 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 11 Aug 2009 09:10:19 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A8081B3.2080600@berkeley.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> Message-ID: <20090811131019.GW12604@sobchak.mgh.harvard.edu> Hi Nick; > Summary: Major focus is getting the GBIF access/search/parse module into > "done"/submittable shape. This primarily requires getting the > documentation and testing up to biopython specs. I have a fair bit of > documentation and testing, need advice (see below) for specifics on what > it should look like. Awesome. Thanks for working on the cleanup for this. > OK, I will do this. Should I try and figure out the unittest stuff? I > could use a simple example of what this is supposed to look like. In addition to Peter's pointers, here is a simple example from a small thing I wrote: http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py You can copy/paste the unit test part to get a base, and then replace the t_* functions with your own real tests. Simple scripts that generate consistent output are also fine; that's the print and compare approach. > > - What is happening with the Nodes_v2 and Treesv2 files? They look > > like duplicates of the Nexus Nodes and Trees with some changes. > > Could we roll those changes into the main Nexus code to avoid > > duplication? > > Yeah, these were just copies with your bug fix, and with a few mods I > used to track crashes. Presumably I don't need these with after a fresh > download of biopython. Cool. It would be great if we could weed these out as well. > The API is really just the interface with GBIF. I think developing a > cookbook entry is pretty easy, I assume you want something like one of > the entries in the official biopython cookbook? Yes, that would work great. What I was thinking of are some examples where you provide background and motivation: Describe some useful information you want to get from GBIF, and then show how to do it. This is definitely the most useful part as it gives people working examples to start with. From there they can usually browse the lower level docs or code to figure out other specific things. > Re: API documentation...are you just talking about the function > descriptions that are typically in """ """ strings beneath the function > definitions? I've got that done. Again, if there is more, an example > of what it should look like would be useful. That looks great for API level docs. You are right on here; for this week I'd focus on the cookbook examples and cleanup stuff. My other suggestion would be to rename these to follow Biopython conventions, something like: gbif_xml -> GbifXml shpUtils -> ShapefileUtils geogUtils -> GeographyUtils dbfUtils -> DbfUtils The *Utils might have underscores if they are not intended to be called directly. Thanks for all your hard work, Brad From chapmanb at 50mail.com Tue Aug 11 09:20:57 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 11 Aug 2009 09:20:57 -0400 Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython In-Reply-To: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> Message-ID: <20090811132057.GX12604@sobchak.mgh.harvard.edu> Hi Eric; All sounds great -- looks like you are in good shape for finishing things up this week. Really great work. > - Presumably, any discussion of merging with Biopython will have to wait > until after the biopython-1.51 release. I'll be around. For GSoC > requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO > modules along with the unit test suite as standalone files, rather than > as a patch set since the last upstream revision I pulled was just a > random untagged one around the time of the last beta release. We were discussing a release at the end of this week or over the weekend. I think we should roll this in soon after that so anyone can get it from the main trunk. I don't see any major issues with integrating it. How did you like the Git/GitHub experience? One thing we should push after this release is moving over to that as the official repository. Since you have been doing full time Git work this summer, your experience will be really helpful. I still rely on CVS as a bit of a crutch, but should learn to do things fully in Git. Brad From biopython at maubp.freeserve.co.uk Tue Aug 11 12:13:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 17:13:58 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com> Message-ID: <320fb6e00908110913x6cfe7826xa683a6dc130da26e@mail.gmail.com> On Thu, Aug 6, 2009 at 5:05 PM, Peter wrote: > Or we just declare both Bio.Application.generic_run and > ApplicationResult obsolete, and simply recommend using > subprocess with str(cline) as before. Would someone like to > proof read (and test) the tutorial in CVS where I switched all > the generic_run usage to subprocess? > I've just marked Bio.Application.generic_run and ApplicationResult as obsolete in CVS. I am content to wait for a consensus about any replacement for generic_run once more people have tried using subprocess directly. Peter From biopython at maubp.freeserve.co.uk Tue Aug 11 12:44:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 17:44:11 +0100 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? Message-ID: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> Hi David & John, Would either of you be able to draft a release announcement for Biopython 1.51? We're aiming for the end of this week... touch wood. I'm pretty sure the NEWS and DEPRECATED files are up to date (if anyone can spot any omissions, please let us know), these try and summarise changes for each release. Unless you have CVS or git installed, the easiest way to read these files is currently from the github website: http://github.com/biopython/biopython/tree/master Thanks, Peter P.S. Don't be afraid to repeat things from the Biopython 1.51 beta announcement: http://news.open-bio.org/news/2009/06/biopython-151-beta-released/ http://lists.open-bio.org/pipermail/biopython-announce/2009-June/000057.html From eric.talevich at gmail.com Tue Aug 11 14:50:02 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 11 Aug 2009 14:50:02 -0400 Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython In-Reply-To: <20090811132057.GX12604@sobchak.mgh.harvard.edu> References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> <20090811132057.GX12604@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360908111150q495e541bv405b25f0d74127fd@mail.gmail.com> On Tue, Aug 11, 2009 at 9:20 AM, Brad Chapman wrote: > > How did you like the Git/GitHub experience? One thing we should push > after this release is moving over to that as the official > repository. Since you have been doing full time Git work this > summer, your experience will be really helpful. I still rely on CVS > as a bit of a crutch, but should learn to do things fully in Git. > > I liked it a lot! I've spent some time with Subversion, Bazaar, Mercurial and Git now, and I'm confident that Git was the right choice for Biopython. My commit history shows a quick flurry of activity on each of the past few Fridays -- that's from a couple days of exploration toward the end of the week, then repeated calls to "git add -i" to pick out the parts that are worth keeping. I'm careful with git-rebase, but "git commit --amend" gets a fair amount of use. I could add a section on the Biopython wiki's GitUsage page, called something like "Managing Commits", giving some examples of this. GitHub has been down briefly a few times. It was only a problem because it happened on Monday mornings, when I wanted to push an updated README to my public fork at the same time as my weekly update e-mail to this list. Having a mirror on GitHub is great for getting started with Biopython development, but I'm still unclear on how changes should propagate back upstream after Biopython switches from CVS to Git. Pull requests? Core devs pushing to a central Git repository on OBF servers? Maybe the BioRuby folks have advice; if this has been settled on biopython-dev, I've missed it. Anyway. To create the final patch tarball next Monday for GSoC, I believe the right incantation looks like this: git format-patch -o gsoc-phyloxml master...phyloxml tar czf gsoc-phyloxml.tgz gsoc-phyloxml That's cleaner than I expected it to be. Neat. Cheers, Eric From winda002 at student.otago.ac.nz Wed Aug 12 01:47:13 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 12 Aug 2009 17:47:13 +1200 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> Message-ID: <4A825761.10106@student.otago.ac.nz> Peter wrote: > Hi David & John, > > Would either of you be able to draft a release announcement for > Biopython 1.51? We're aiming for the end of this week... touch wood. > We'll definitely aim to have something for the list to check out in the next 24hrs. I guess the main points are all the Cool New Stuff from the beta being in a stable release for the first time, FASTQ has been shown to play nicely with across a bunch of projects and Application.generic_run() is now on the deprecation path? On that note, would it be useful to have a cookbook example or even a blog-post ready to go showing a few of the ways one might use subprocess to run commands defined with Biopython? I'm happy to put something together that others can evaluate. Cheers, David From biopython at maubp.freeserve.co.uk Wed Aug 12 05:49:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Aug 2009 10:49:50 +0100 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <4A825761.10106@student.otago.ac.nz> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> Message-ID: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> On Wed, Aug 12, 2009 at 6:47 AM, David Winter wrote: > > Peter wrote: >> >> Hi David & John, >> >> Would either of you be able to draft a release announcement for >> Biopython 1.51? We're aiming for the end of this week... touch wood. > > We'll definitely aim to have something for the list to check out in the next > 24hrs. I guess the main points are all the Cool New Stuff from the beta > being in a stable release for the first time, FASTQ has been shown to play > nicely with across a bunch of projects and Application.generic_run() is now > on the deprecation path? Historically we haven't made a big thing about deprecations in the release announcements. Maybe we should - in which case also note that Bio.Fasta has finally been deprecated. > On that note, would it be useful to have a cookbook example or even a > blog-post ready to go showing a few of the ways one might use subprocess to > run commands defined with Biopython? I'm happy to put something together > that others can evaluate. The tutorial has several examples at the end of the chapter on alignments (because lots of the wrappers at the moment are for alignment tools). I've just updated the copy online to the current version from CVS (dated 10 August 2009). If you can spot any errors in the next couple of days we can get them fixed before the release. Peter From biopython at maubp.freeserve.co.uk Wed Aug 12 08:54:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Aug 2009 13:54:15 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> Message-ID: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> On Thu, Jul 23, 2009 at 10:34 AM, Peter wrote: > On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote: >> I don't think there is much in it really. ?You have a factored >> BinaryFile class, I have classes for the components of the SFF file. >> Both are based around struct. I have now written a third variant (loosely based on Jose's code). This is just a single generator function (also based on struct). Right now it is a slightly long function, but it can be refactored easily enough. Is also a lot faster than Jose's code which is a big plus point for large files. See: http://github.com/peterjc/biopython/tree/sff I haven't compared my new code against yours for speed yet James, because your parser didn't like my large SFF file. You have hard coded it to expect read names of length 14, and 400 flows per read. I have some data from Sanger where the read names are length 14, but there are 800 flows per read. Having the two reference parsers to look at was educational, so thank you both (James and Jose) for sharing your code. I now understand the SFF file format much better, and am now confident I could design an indexer to provide dictionary like access to it - a possible addition to Bio.SeqIO - see this thread: http://lists.open-bio.org/pipermail/biopython/2009-June/005312.html > Jose's code uses seek/tell which means it has to have a handle > to an actual file. He also used binary read mode - I'm not sure if > this was essential or not. Binary more was not essential - opening an SFF file in default mode also seemed to work fine with Jose's code. > James' code seems to make a single pass though the file handle, > without using seek/tell to jump about. I think this is nicer, as it is > consistent with the other SeqIO parsers, and should work on > more types of handles (e.g. from gzip, StringIO, or even a > network connection). I've also avoided using seek/tell in my rewrite. > It looks like you (James) construct Seq objects using the full > untrimmed sequence as is. I was undecided on if trimmed or > untrimmed should be the default, but the idea of some kind of > masked or trimmed Seq object had come up on the mailing list > which might be useful here (and in contig alignments). i.e. > something which acts like a Seq object giving the trimmed > sequence, but which also contains the full sequence and trim > positions. I'm still thinking about this. One simplistic option (as used on my branch) would be to have two input formats in Bio.SeqIO, one untrimmed and one trimmed, e.g. "sff" and "sff-trim". Peter From winda002 at student.otago.ac.nz Wed Aug 12 20:32:55 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 13 Aug 2009 12:32:55 +1200 Subject: [Biopython-dev] Draft announcement for Biopython 1.51 In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> Message-ID: <4A835F37.5040907@student.otago.ac.nz> Hi all, here is a draft announcement to go out when 1.51 is built and ready to go. Comments and corrections are very welcome (should we keep the deprecation paragraph in?) I've also added a draft post to the OBF blog with this text marked up with links and ready to go, hopefully that way whoever builds the release can just ask someone with an account there (Brad and Peter at least) to push post once everything is ready. ++ We are pleased to announce the release of Biopython 1.51.This new stable release enhances version 1.50 (released in April) by extending the functionality of existing modules, adding a set of application wrappers for popular alignment programs and fixing a number of minor bugs. In particular, the SeqIO module can now write genbank files that include features and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using Sloexa, Sanger and Ilumina quality scores and has been validated against the the BioPerl and EMBOSS implementations of this format. Biopython 1.51 is the first stable release to include the Align.Applications module which allows users to define command line wrappers for popular alignment programs including ClustalW, Muscle and T-Coffee. ?? This new release also spells the beginning of the end for some of Biopython's older tools. Bio.Fasta and the application tools ApplicationResult and generic_run() have been marked as deprecated which means they can still be imported but doing who warn the user that these functions will be removed in the future. Bio.Fasta has been superseded by SeqIO's support for the Fasta format while we now suggest using the subprocess module from the Python Standard Library to call applications - use of this module is extensively documented in section 6.3 of the Biopython Tutorial and Cookbook. ?? As always the Tutorial and Cookbook has been updated to document the other changes made since the last release. Thank you to everyone who tested our 1.51 beta or submitted bugs since out last stable release and to all of our contributors Sources and Windows Installer for the new release are available from the downloads page. ++ From winda002 at student.otago.ac.nz Wed Aug 12 20:37:12 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 13 Aug 2009 12:37:12 +1200 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> Message-ID: <4A836038.1060609@student.otago.ac.nz> >> On that note, would it be useful to have a cookbook example or even a >> blog-post ready to go showing a few of the ways one might use subprocess to >> run commands defined with Biopython? I'm happy to put something together >> that others can evaluate. >> > > The tutorial has several examples at the end of the chapter on > alignments (because lots of the wrappers at the moment are for > alignment tools). I've just updated the copy online to the current > version from CVS (dated 10 August 2009). If you can spot any > errors in the next couple of days we can get them fixed before > the release. > > Peter > > OK, I had only looked at the doc strings (my editor chokes on long text files and I don't have anything to set Tex docs with) so didn't know that existed. That looks really good (and the feeding output into handles bit is pretty wizardly!) Cheers, David From biopython at maubp.freeserve.co.uk Thu Aug 13 06:00:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 11:00:49 +0100 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <4A836038.1060609@student.otago.ac.nz> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> <4A836038.1060609@student.otago.ac.nz> Message-ID: <320fb6e00908130300x3b4f1eb7m7711b76e0e03fd8a@mail.gmail.com> On Thu, Aug 13, 2009 at 1:37 AM, David Winter wrote: > > OK, I had only looked at the doc strings (my editor chokes on long > text files and I don't have anything to set Tex docs with) so didn't > know that existed. TeX or LaTeX files are just plain text with some magic markup e.g. \emph{text to emphasise}. Any decent text editor should be able to load them, and some will even colour code things. Even if you don't understand the markup, most of the time you can actually read the raw files directly and understand them. But yeah, the PDF or HTML output is what most people will want to look at ;) > That looks really good (and the feeding output into handles > bit is pretty wizardly!) Yeah - it is pretty cool. Sadly not all command line tools will accept input via stdin, so this kind of thing isn't always possible. Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 06:10:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 11:10:44 +0100 Subject: [Biopython-dev] Draft announcement for Biopython 1.51 In-Reply-To: <4A835F37.5040907@student.otago.ac.nz> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> <4A835F37.5040907@student.otago.ac.nz> Message-ID: <320fb6e00908130310n8efa09dv81963277e607da52@mail.gmail.com> Thanks for the first draft David, On Thu, Aug 13, 2009 at 1:32 AM, David Winter wrote: > In particular, the SeqIO module can now write genbank files that include > features and deal with FASTQ files created by Illumina 1.3+. Support for > this format allows interconversion between FASTQ files using Sloexa, Sanger > and Ilumina quality scores and has been validated against the the BioPerl > and EMBOSS implementations of this format. Typo: Sloexa -> Solexa. I would probably rephrase the rest a little, there are some subtleties with 3 container formats but only 2 scoring systems... In particular, the SeqIO module can now write GenBank with features, and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using the Sanger, Solexa or Illumina 1.3+ FASTQ variants, using conventions agreed with the BioPerl and EMBOSS projects. [BioPerl and EMBOSS are still working on the FASTQ variants, so we haven't actually got everything cross validated yet.] > ?? > This new release also spells the beginning of the end for some of > Biopython's older tools. Bio.Fasta and the application tools > ApplicationResult and generic_run() have been marked as deprecated which > means they can still be imported but doing who warn the user that these > functions will be removed in the future. Bio.Fasta has been superseded by > SeqIO's support for the Fasta format while we now suggest using the > subprocess module from the Python Standard Library to call applications - > use of this module is extensively documented in section 6.3 of the Biopython > Tutorial and Cookbook. > ?? I would omit that, or at least cut it down a lot. It might also be worth mentioning we no longer include Martel/Mindy, and thus don't have any dependence on mxTextTools. Also we don't support Python 2.3 anymore. P.S. I try and avoid referring to sections of the Tutorial by number, as these often change from release to release. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 09:02:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 14:02:17 +0100 Subject: [Biopython-dev] [Biopython] Trimming adaptors sequences In-Reply-To: <20090813124432.GB90165@sobchak.mgh.harvard.edu> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> <20090813124432.GB90165@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908130602n607add6fme67f7934234a5540@mail.gmail.com> On Thu, Aug 13, 2009 at 1:44 PM, Brad Chapman wrote: >> However, if you just want speed AND you really want to have a FASTQ >> input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator >> parser which gives plain strings, and handle the output yourself. Working >> directly with Python strings is going to be faster than using Seq and >> SeqRecord objects. You can even opt for outputting FASTQ files - as >> long as you leave the qualities as an encoded string, you can just slice >> that too. The downside is the code will be very specific. e.g. something >> along these lines: >> >> from Bio.SeqIO.QualityIO import FastqGeneralIterator >> in_handle = open(input_fastq_filename) >> out_handle = open(output_fastq_filename, "w") >> for title, seq, qual in FastqGeneralIterator(in_handle) : >> ? ? #Do trim logic here on the string seq >> ? ? if trim : >> ? ? ? ? seq = seq[start:end] >> ? ? ? ? qual = qual[start:end] # kept as ASCII string! >> ? ? #Save the (possibly trimmed) FASTQ record: >> ? ? out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) >> out_handle.close() >> in_handle.close() > > Nice -- I will have to play with this. I hadn't dug into the current > SeqRecord slicing code at all but I wonder if there is a way to keep > the SeqRecord interface but incorporate some of these speed ups > for common cases like this FASTQ trimming. I suggest we continue this on the dev mailing list (this reply is cross posted), as it is starting to get rather technical. When you really care about speed, any object creation becomes an issue. Right now for *any* record we have at least the following objects being created: SeqRecord, Seq, two lists (for features and dbxrefs), two dicts (for annotation and the per letter annotation), and the restricted dict (for per letter annotations), and at least four strings (sequence, id, name and description). Perhaps some lazy instantiation might be worth exploring... for example make dbxref, features, annotations or letter_annotations into properties where the underlying object isn't created unless accessed. [Something to try after Biopython 1.51 is out?] I would guess (but haven't timed it) that for trimming FASTQ SeqRecords, a bit part of the overhead is that we are using Python lists of integers (rather than just a string) for the scores. So sticking with the current SeqRecord object as is, one speed up we could try would be to leave the FASTQ quality string as an encoded string (rather than turning into integer quality scores, and back again on output). It would be a hack, but adding this as another SeqIO format name, e.g. "fastq-raw" or "fastq-ascii", might work. We'd still need a new letter_annotations key, say "fastq_qual_ascii". This idea might work, but it does seem ugly. Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 13:33:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 18:33:41 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <1250099579.4a83017b4e97c@webmail.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> Message-ID: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> [Jose - you didn't CC the list with your reply] On Wed, Aug 12, 2009 at 6:52 PM, Blanca Postigo Jose Miguel wrote: > > Hi: > > I just love free software :) It's great to watch how the code is being improved > by the work of so many people. I hope to get some time to get a look at the > latest sff reader. You'll probably be interested to know I've made some excellent progress with the (optional) SFF index block. I note that the specifications (both on the NCBI page and in the Roche manual) appear to suggest that the index block could appear in the middle of the the read data. However, in all the examples I have looked at, the index is actually at the end. http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff Sadly the format of the index isn't documented, but I think I have reverse engineered the format that Roche SFF files are using. In a slight twist of the specification they are actually using the index bock for both XML meta data AND and index of the read offsets. This will dovetail nicely with the indexing support in Bio.SeqIO which I am working on for Biopython 1.52, branch on github. I expect to have fast random access to reads in an SFF file very soon. See http://github.com/peterjc/biopython/tree/convert >> > It looks like you (James) construct Seq objects using the full >> > untrimmed sequence as is. I was undecided on if trimmed or >> > untrimmed should be the default, but the idea of some kind of >> > masked or trimmed Seq object had come up on the mailing list >> > which might be useful here (and in contig alignments). i.e. >> > something which acts like a Seq object giving the trimmed >> > sequence, but which also contains the full sequence and trim >> > positions. >> >> I'm still thinking about this. One simplistic option (as used on >> my branch) would be to have two input formats in Bio.SeqIO, >> one untrimmed and one trimmed, e.g. "sff" and "sff-trim". > > I think that some way to mask the SeqRecord or Seq object > would be great. It would be useful for many tasks, not just this > one. Sure - if we can come up with a suitable design... Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 13:38:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 18:38:43 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> Message-ID: <320fb6e00908131038v567ed86fjb775d810fb69e7d@mail.gmail.com> Peter wrote: > > Sadly the format of the index isn't documented, but I think I have > reverse engineered the format that Roche SFF files are using. In a > slight twist of the specification they are actually using the index bock > for both XML meta data AND and index of the read offsets. I'm not the first to notice this, see for example the Celera Assembler looks in a Roche SFF file's XML meta data to determine how the quality scores were called: http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Roche_454_Platforms Peter From jblanca at btc.upv.es Fri Aug 14 02:01:42 2009 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Fri, 14 Aug 2009 08:01:42 +0200 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> <1250192416.4a846c2045f94@webmail.upv.es> <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> Message-ID: <1250229702.4a84fdc6c403a@webmail.upv.es> Mensaje citado por Peter : > On Thu, Aug 13, 2009 at 8:40 PM, Blanca Postigo Jose > Miguel wrote: > > > >> This will dovetail nicely with the indexing support in Bio.SeqIO > >> which I am working on for Biopython 1.52, branch on github. > >> I expect to have fast random access to reads in an SFF file > >> very soon. See http://github.com/peterjc/biopython/tree/convert > > > > I've written some code to solve a similar problem. Maybe you > > could take a look to it. It's in the classes FileIndex and > > FileSequenceIndex at: > > > > > http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/biolib_seqio_utils.py > > > > Did you see this thread? > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html > > The coding style is quite different, but it looks the essential idea > is the same - we both scan the file to find each record, and use > a dictionary to record the offset. Interestingly you and Peio also > keeps the record's length in the dictionary, which will double the > memory requirements - for something you don't actually need. > > Peter > > P.S. You can forward or CC this back to the list if you like. We keep the record length to be able to return the record without having to scan the file again. Jose Blanca From biopython at maubp.freeserve.co.uk Fri Aug 14 05:36:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 10:36:31 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <1250229702.4a84fdc6c403a@webmail.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> <1250192416.4a846c2045f94@webmail.upv.es> <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> <1250229702.4a84fdc6c403a@webmail.upv.es> Message-ID: <320fb6e00908140236v547ea056g965c7b7cd61d555c@mail.gmail.com> On Fri, Aug 14, 2009 at 7:01 AM, Blanca Postigo Jose Miguel wrote: > >> The coding style is quite different, but it looks the essential idea >> is the same - we both scan the file to find each record, and use >> a dictionary to record the offset. Interestingly you and Peio also >> keeps the record's length in the dictionary, which will double the >> memory requirements - for something you don't actually need. > > We keep the record length to be able to return the record without > having to scan the file again. If you want to be able to extract the raw record, that makes sense. It is still a trade off between memory usage and speed of access, and depending on your requirements either way makes sense. For Bio.SeqIO, I want to parse the raw record on access via the key in order to return a SeqRecord, so I have no need to keep the raw record length in memory. I'm using this github branch: http://github.com/peterjc/biopython/commits/index Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 07:57:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 12:57:26 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> Message-ID: <320fb6e00908140457k66e747dep881abf0b044ab9c1@mail.gmail.com> On Thu, Aug 13, 2009 at 6:33 PM, Peter wrote: > > You'll probably be interested to know I've made some excellent progress > with the (optional) SFF index block. I note that the specifications (both > on the NCBI page and in the Roche manual) appear to suggest that the > index block could appear in the middle of the the read data. However, > in all the examples I have looked at, the index is actually at the end. > > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff > > Sadly the format of the index isn't documented, but I think I have > reverse engineered the format that Roche SFF files are using. In > a slight twist of the specification they are actually using the index > block for both XML meta data AND an index of the read offsets. > > This will dovetail nicely with the indexing support in Bio.SeqIO > which I am working on for Biopython 1.52, branch on github. > I expect to have fast random access to reads in an SFF file > very soon. See http://github.com/peterjc/biopython/tree/convert Sorry, wrong branch - my "index" branch has the indexing (as well as SFF files and the Bio.SeqIO.convert() functionality): http://github.com/peterjc/biopython/tree/index I've got this code working nicely for reading or indexing SFF files. Testing with a 2GB SFF file with 660808 Roche 454 reads, using the Roche index I can load this in under 3 seconds and retrieve any single record almost instantly. If the index is missing (or not in the expected format) I have to scan the file to build my own index, and that takes about 11 seconds - which is still fine :) Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 08:00:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 13:00:15 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> Message-ID: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com> On Wed, Aug 12, 2009 at 1:54 PM, Peter wrote: > >> Jose's code uses seek/tell which means it has to have a handle >> to an actual file. He also used binary read mode - I'm not sure if >> this was essential or not. > > Binary mode was not essential - opening an SFF file in default > mode also seemed to work fine with Jose's code. Having worked on this more, default mode or binary mode are fine. However, as you might expect, you can't use Python's universal read lines mode when parsing SFF files. Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 09:25:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 14:25:43 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <1250252775.4a8557e7d9ae4@webmail.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> <1250192416.4a846c2045f94@webmail.upv.es> <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> <1250229702.4a84fdc6c403a@webmail.upv.es> <320fb6e00908140236v547ea056g965c7b7cd61d555c@mail.gmail.com> <1250252775.4a8557e7d9ae4@webmail.upv.es> Message-ID: <320fb6e00908140625v5f5bd338qc081e0e5091df9bf@mail.gmail.com> Jose wrote: >>> We keep the record length to be able to return the record without >>> having to scan the file again. Peter wrote: >> If you want to be able to extract the raw record, that makes sense. >> It is still a trade off between memory usage and speed of access, >> and depending on your requirements either way makes sense. >> >> For Bio.SeqIO, I want to parse the raw record on access via the >> key in order to return a SeqRecord, so I have no need to keep >> the raw record length in memory. I'm using this github branch: >> http://github.com/peterjc/biopython/commits/index Jose wrote: > We want the raw record because we plan to use this FileIndex on several > different files, not just for sequences. In fact you have an example on how to > use it for sequences in SequenceFileIndex, a class that uses the general > FileIndex. I think that this FileIndex class will be able even to index xml > files. This is the motivation for the design. I see - that makes sense. Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 11:20:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 16:20:21 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> Message-ID: <320fb6e00908140820uf86603bh408dc93f99a3641a@mail.gmail.com> On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote: > On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote: >> I've stuck a branch up on github which (thus far) simply defines >> the Bio.SeqIO.convert and Bio.AlignIO.convert functions. >> Adding optimised code can come later. >> >> http://github.com/peterjc/biopython/commits/convert > > There is now a new file Bio/SeqIO/_convert.py on this > branch, and a few optimised conversions have been done. > In particular GenBank/EMBL to FASTA, any FASTQ to > FASTA, and inter-conversion between any of the three > FASTQ formats. > > The current Bio/SeqIO/_convert.py file actually looks very > long and complicated - but if you ignore the doctests (which > I would probably move to a dedicated unit test), it isn't that > much code at all. I have now moved all the test code to a new unit test file, test_SeqIO_convert.py, and think this code is ready for public testing/review, with a the aim of inclusion in Biopython 1.52 (i.e. it can wait until after 1.51 is done). I would still need to add this to the tutorial, but that won't take very long. Peter From bugzilla-daemon at portal.open-bio.org Fri Aug 14 11:23:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Aug 2009 11:23:14 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200908141523.n7EFNExJ014906@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-14 11:23 EST ------- (In reply to comment #2) > (From update of attachment 1303 [details]) > This file is already a tiny bit out of date - I've started working on this > on a git branch. > > http://github.com/peterjc/biopython/commits/sff Actually, I got rid of that branch after merging it into my work on Bio.SeqIO indexing. I can now parse the Roche SFF index, allowing fast random access to the reads. See: http://github.com/peterjc/biopython/commits/index http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri Aug 14 17:08:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Aug 2009 17:08:32 -0400 Subject: [Biopython-dev] Biopython 1.51 code freeze Message-ID: <20090814210832.GL90165@sobchak.mgh.harvard.edu> Hey all; I'll be doing the 1.51 release this weekend, so am declaring an official code freeze until things get finished. If you have any last minute bugs or issues please check them in this evening; otherwise no more CVS commits until 1.51 is officially rolled and announced. Like, um, go outside this weekend or something. David -- thanks for writing up the release announcement. Everyone -- thanks for all your hard work on getting things ready for the release. After this is rolled we should be able to start checking in new functionality for 1.52 and beyond. Have a great weekend, Brad From biopython at maubp.freeserve.co.uk Sat Aug 15 08:09:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 15 Aug 2009 13:09:39 +0100 Subject: [Biopython-dev] Biopython 1.51 code freeze In-Reply-To: <20090814210832.GL90165@sobchak.mgh.harvard.edu> References: <20090814210832.GL90165@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908150509m322bfbd5yc55ab67b2af733a@mail.gmail.com> On Fri, Aug 14, 2009 at 10:08 PM, Brad Chapman wrote: > Hey all; > I'll be doing the 1.51 release this weekend, so am declaring an > official code freeze until things get finished. If you have any last > minute bugs or issues please check them in this evening; otherwise > no more CVS commits until 1.51 is officially rolled and announced. > Like, um, go outside this weekend or something. Cool - now that it has stopped raining, I might do that ;) Peter From tiagoantao at gmail.com Sat Aug 15 14:05:40 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 15 Aug 2009 19:05:40 +0100 Subject: [Biopython-dev] Biopython 1.51 code freeze In-Reply-To: <20090814210832.GL90165@sobchak.mgh.harvard.edu> References: <20090814210832.GL90165@sobchak.mgh.harvard.edu> Message-ID: <6d941f120908151105l7144f806ub43b6aa761ed22a8@mail.gmail.com> Outside in this case means 37C and planting trees under heavy sun (with a short break for checking email on my mobile behind a shadow). Congratz on 1.51. I intend to start checking in new functionality in around 2 weeks. If someone wants to have a look at the code that is on git(genepop branch) and criticize, feel free. back to the trees now. 2009/8/14, Brad Chapman : > Hey all; > I'll be doing the 1.51 release this weekend, so am declaring an > official code freeze until things get finished. If you have any last > minute bugs or issues please check them in this evening; otherwise > no more CVS commits until 1.51 is officially rolled and announced. > Like, um, go outside this weekend or something. > > David -- thanks for writing up the release announcement. > > Everyone -- thanks for all your hard work on getting things ready for > the release. After this is rolled we should be able to start checking in > new functionality for 1.52 and beyond. > > Have a great weekend, > Brad > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Enviada a partir do meu dispositivo m?vel "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From chapmanb at 50mail.com Sun Aug 16 20:48:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 16 Aug 2009 20:48:26 -0400 Subject: [Biopython-dev] Biopython 1.51 status Message-ID: <20090817004826.GA4221@kunkel> Hey all; 1.51 is all checked and prepped and ready to go. However, I don't appear to have a user account on portal.open-bio.org, so can't transfer the new tarballs and api over there. Peter, you had mentioned you could do the windows installers. When you do those, could you also transfer over these tarballs and stick them in the right places: http://chapmanb.50mail.com/biopython-1.51.tar.gz http://chapmanb.50mail.com/biopython-1.51.zip http://chapmanb.50mail.com/api.tar.gz If you can do that I'll update the website and send out announcements in the morning. Thanks much. 1.51 on the way, Brad From biopython at maubp.freeserve.co.uk Mon Aug 17 05:04:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 10:04:16 +0100 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <20090817004826.GA4221@kunkel> References: <20090817004826.GA4221@kunkel> Message-ID: <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> On Mon, Aug 17, 2009 at 1:48 AM, Brad Chapman wrote: > Hey all; > 1.51 is all checked and prepped and ready to go. However, I don't > appear to have a user account on portal.open-bio.org, so can't > transfer the new tarballs and api over there. Right - your old account probably as had its password reset or something - do you want to contact the OBF or should I? > Peter, you had mentioned you could do the windows installers. > When you do those, could you also transfer over these tarballs > and stick them in the right places: > > http://chapmanb.50mail.com/biopython-1.51.tar.gz > http://chapmanb.50mail.com/biopython-1.51.zip > http://chapmanb.50mail.com/api.tar.gz Will do... > If you can do that I'll update the website and send out > announcements in the morning. Thanks much. Give me an hour or so ;) Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 06:01:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 11:01:47 +0100 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> Message-ID: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> On Mon, Aug 17, 2009 at 10:04 AM, Peter wrote: >> If you can do that I'll update the website and send out >> announcements in the morning. Thanks much. > > Give me an hour or so ;) OK, all uploaded, including the new tutorial. I also did the wiki (as it was simple for me to get the new file sizes), and added version 1.51 to bugzilla (not sure if you have the relevent permissions there or not - could you check?). Over to you now Brad for the release announcements (OBF blog, email) and PyPi, http://pypi.python.org/pypi/biopython/ and anything else on the list. Thanks, Peter From chapmanb at 50mail.com Mon Aug 17 08:16:18 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 17 Aug 2009 08:16:18 -0400 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> Message-ID: <20090817121618.GD12768@sobchak.mgh.harvard.edu> Peter; Thanks for the help with this. Everything else is all finished up -- news posted and message sent to the lists. The announcement e-mail only needs to be approved on biopython-announce. I wrote a message to open-bio support to get my password reset on portal, so hopefully we'll get that all sorted. It's great to have this out. Thanks again to everyone for all the hard work, Brad > On Mon, Aug 17, 2009 at 10:04 AM, Peter wrote: > >> If you can do that I'll update the website and send out > >> announcements in the morning. Thanks much. > > > > Give me an hour or so ;) > > OK, all uploaded, including the new tutorial. I also did the wiki > (as it was simple for me to get the new file sizes), and added > version 1.51 to bugzilla (not sure if you have the relevent > permissions there or not - could you check?). > > Over to you now Brad for the release announcements (OBF > blog, email) and PyPi, http://pypi.python.org/pypi/biopython/ > and anything else on the list. > > Thanks, > > Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 08:17:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 13:17:53 +0100 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <20090817121618.GD12768@sobchak.mgh.harvard.edu> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> <20090817121618.GD12768@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908170517s7b1a6fb4h37bd2d22046dc3a@mail.gmail.com> On Mon, Aug 17, 2009 at 1:16 PM, Brad Chapman wrote: > Peter; > Thanks for the help with this. Everything else is all finished up -- > news posted and message sent to the lists. The announcement e-mail > only needs to be approved on biopython-announce. Done. > I wrote a message to open-bio support to get my password reset on portal, > so hopefully we'll get that all sorted. Cool. > It's great to have this out. Thanks again to everyone for all the hard > work, > Brad And thank you Brad :) Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 08:43:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 13:43:01 +0100 Subject: [Biopython-dev] Moving from CVS to git Message-ID: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> Hi all, Now that Biopython 1.51 is out (thanks Brad), we should discuss finally moving from CVS to git. This was something we talked about at BOSC/ISMB 2009, but not everyone was there. We have two main options: (a) Move from CVS (on the OBF servers) to github. All our developers will need to get github accounts, and be added as "collaborators" to the existing github repository. I would want a mechanism in place to backup the repository to the OBF servers (Bartek already has something that should work). (b) Move from CVS to git (on the OBF servers). All our developers can continue to use their existing OBF accounts. Bartek's existing scripts could be modified to push the updates from this OBF git repository onto github. In either case, there will be some "plumbing" work required, for example I'd like to continue to offer a recent source code dump at http://biopython.open-bio.org/SRC/biopython/ etc. Given we don't really seem to have the expertise "in house" to run an OBF git server ourselves right now, option (a) is simplest, and as I recall those of us at BOSC where OK with this plan. Assuming we go down this route (CVS to github), everyone with an existing CVS account should setup a github account if they want to continue to have commit access (e.g. Frank, Iddo). I would suggest that initially you get used to working with git and github BEFORE trying anything directly on what would be the "official" repository. It took me a while and I'm still learning ;) Is this agreeable? Are there any other suggestions? [Once this is settled, we can talk about things like merge requests and if they should be accompanied by a Bugzilla ticket or not.] Peter From eric.talevich at gmail.com Mon Aug 17 10:02:02 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 17 Aug 2009 10:02:02 -0400 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> Message-ID: <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> On Mon, Aug 17, 2009 at 6:01 AM, Peter wrote: > On Mon, Aug 17, 2009 at 10:04 AM, Peter > wrote: > >> If you can do that I'll update the website and send out > >> announcements in the morning. Thanks much. > > > > Give me an hour or so ;) > > OK, all uploaded, including the new tutorial. I also did the wiki > (as it was simple for me to get the new file sizes), and added > version 1.51 to bugzilla (not sure if you have the relevent > permissions there or not - could you check?). > > Over to you now Brad for the release announcements (OBF > blog, email) and PyPi, http://pypi.python.org/pypi/biopython/ > and anything else on the list. > > Thanks, > > Peter > Great to see the release went smoothly! I'm probably being impatient here, but was a tag created for v1.51 final? I don't see it in GitHub yet, and it's been slightly over an hour since the last push. Thanks, Eric From chapmanb at 50mail.com Mon Aug 17 10:17:58 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 17 Aug 2009 10:17:58 -0400 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> Message-ID: <20090817141758.GE12768@sobchak.mgh.harvard.edu> Hi Eric; > Great to see the release went smoothly! I'm probably being impatient here, > but was a tag created for v1.51 final? I don't see it in GitHub yet, and > it's been slightly over an hour since the last push. It was tagged last evening as biopython-151: > cvs log setup.py | head RCS file: /home/repository/biopython/biopython/setup.py,v Working file: setup.py head: 1.171 branch: locks: strict access list: symbolic names: biopython-151: 1.171 biopython-151b: 1.168 Maybe there is an issue with tags pushing to Git. Bartek and Peter were discussing this, but I don't remember the ultimate conclusion. Brad From bartek at rezolwenta.eu.org Mon Aug 17 10:29:31 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 17 Aug 2009 16:29:31 +0200 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <20090817141758.GE12768@sobchak.mgh.harvard.edu> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> <20090817141758.GE12768@sobchak.mgh.harvard.edu> Message-ID: <8b34ec180908170729g6c13333dk98d722cdb1d54bf0@mail.gmail.com> On Mon, Aug 17, 2009 at 4:17 PM, Brad Chapman wrote: > Hi Eric; > >> Great to see the release went smoothly! I'm probably being impatient here, >> but was a tag created for v1.51 final? I don't see it in GitHub yet, and >> it's been slightly over an hour since the last push. > > Maybe there is an issue with tags pushing to Git. Bartek and Peter > were discussing this, but I don't remember the ultimate conclusion. The ultimate conclusion will be reached when we move to github... ;) But for now, I'll just need to convert this tag manually. Just give me a few hours Bartek From bartek at rezolwenta.eu.org Mon Aug 17 11:07:32 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 17 Aug 2009 17:07:32 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> Message-ID: <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> On Mon, Aug 17, 2009 at 2:43 PM, Peter wrote: > Hi all, > > Given we don't really seem to have the expertise "in house" > to run an OBF git server ourselves right now, option (a) is > simplest, and as I recall those of us at BOSC where OK > with this plan. > > Assuming we go down this route (CVS to github), everyone > with an existing CVS account should setup a github account > if they want to continue to have commit access (e.g. Frank, > Iddo). I would suggest that initially you get used to working > with git and github BEFORE trying anything directly on what > would be the "official" repository. It took me a while and I'm > still learning ;) > > Is this agreeable? Are there any other suggestions? > > [Once this is settled, we can talk about things like merge > requests and if they should be accompanied by a Bugzilla > ticket or not.] > Hi All, I absolutely agree here with Peter, i.e. I would suggest we move now from CVS to a git branch hosted on github. Since I'm more involved in the technical setup we currently have, I'd also add a few more technical arguments for this move: - While current setup is working, it is suboptimal because there is an extra conversion step both for accepting changes done by people in git (git to CVS) and propagating releases (CVS to github). - Once we move to git as our version control system, we need to have a "master" branch which will be easily available for viewing and branching: we can't do it now on open-bio servers (it requires git installation and some server-side scripts to have a browseable repository), also "moving" to github is easier because it actually requires no physical action, we just need to stop updating CVS. - If anyone have fears of depending on github, I think it's much less of a problem than with CVS, moving our "master" branch from github to somewhere else is very easy and does not require any action on the side of github, we just post the branch somewhere, and start pushing there (you can find a list of possible hosting solutions here: http://git.or.cz/gitwiki/GitHosting) - Regarding the backups of the github branch: I'm already doing this. If you have a shell account on dev.open-bio.org, you can get the current git branch of biopython from /home/bartek/git_branch (location subject to change), so this would require no additional work, although it would be optimal, to actually install git on open-bio server, so that the updating script can be run from there. If we had that, we could actually hook it up directly to github, so that instead of running once in an hour, it would be run after each push to the branch (http://github.com/guides/post-receive-hooks) To summarize, I'm ready to switch off the part of my script which is updating the gihub branch from CVS. For now, I would leave the part that is making backups of github branch on open-bio server (via rsync). Once we have git installed on dev.open-bio, I can hook it up to notifications from github. cheers Bartek From biopython at maubp.freeserve.co.uk Mon Aug 17 11:52:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 16:52:36 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> Message-ID: <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> On Mon, Aug 17, 2009 at 4:07 PM, Bartek Wilczynski wrote: > > Hi All, > > I absolutely agree here with Peter, i.e. I would suggest we move now from > CVS to a git branch hosted on github. > > Since I'm more involved in the technical setup we currently have, I'd also add > a few more technical arguments for this move: > > - While current setup is working, it is suboptimal because there is an extra > ?conversion step both for accepting changes done by people in git (git to CVS) > ?and propagating releases (CVS to github). Yeah - this works but for anything non-trivial it would be a pain. > - Once we move to git as our version control system, we need to have a > "master" branch ?which will be easily available for viewing and branching: > we can't do it now on open-bio?servers (it requires git installation and > some server-side scripts to have a browseable?repository), ... My impression from talking to OBF guys is if we really want to we can do this, but it require us (Biopython) to take care of installing and running git on an OBF machine. > ... also "moving" to github is easier because it actually > requires no physical action, ?we just need to stop updating CVS. Yes - this is the big plus of "option (a)" over "option (b)" in my earlier email. > - If anyone have fears of depending on github, I think it's much less > of a problem than with CVS, ?moving our "master" branch from > github to somewhere else is very easy and does not require any > action on the side of github, we just post the branch somewhere, > and start pushing there (you can find a list of possible hosting > solutions here: http://git.or.cz/gitwiki/GitHosting) Yes, it is good to know we won't be tied to github (unless we start using more of the tools they offer on top of git itself). > - Regarding the backups of the github branch: I'm already doing this. > If you have a shell account on dev.open-bio.org, you can get the > current ?git branch of biopython from /home/bartek/git_branch >?(location subject to change), so this would require no additional > work, Yes - that is what I was hinting at in my email (trying to be brief). > ... although it would be optimal, to actually install git on open-bio > server, so that the updating script can be run from there. Yes. Even something as simple as a cron job running on an OBF server would satisfy me from a back up point of view. > If we had that, we could actually hook it up directly to github, > so that instead of running once in an hour, it would be run >?after each push to the branch (http://github.com/guides/post-receive-hooks) More complex, but worth considering. > To summarize, I'm ready to switch off the part of my script which is > updating the gihub branch from CVS. Good. We'll also want to ask the OBF admins to make CVS read only once we move. > For now, I would leave the part that is making backups of github > branch on open-bio server (via rsync). That would be my plan for the short term. We can then talk to the OBF server admins about how we can do this better. > Once we have git installed on dev.open-bio, I can hook it > up to notifications from github. If we go to the trouble of installing git on the OBF servers ;) Peter From mhampton at d.umn.edu Mon Aug 17 11:42:39 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 10:42:39 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: Message-ID: Hi, I am preparing biopython-1.51 for inclusion as an optional package for Sage (www.sagemath.org). I ran the test suite and got 8 errors; I am not sure if these are all expected. The KDTree ones I have seen before, but some look new. My test log is available at: http://sage.math.washington.edu/home/mhampton/biopython-1.51-testlog.txt in case anyone wants to take a look. Biopython has been available in Sage for several years as an optional package, but I would like to make it a standard component. This has become much more likely since the clean-up of Numeric and mx-texttools dependencies. I think the only real issue is setting up some testing during the Sage package installation, which is my motivation for really understanding the test failures. Cheers, Marshall Hampton Department of Mathematics and Statistics University of Minnesota, Duluth From biopython at maubp.freeserve.co.uk Mon Aug 17 12:28:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 17:28:27 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: Message-ID: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> On Mon, Aug 17, 2009 at 4:42 PM, Marshall Hampton wrote: > > Hi, > > I am preparing biopython-1.51 for inclusion as an optional package for Sage > (www.sagemath.org). ?I ran the test suite and got 8 errors; I am not sure if > these are all expected. I wouldn't have expected any failures. > The KDTree ones I have seen before, but some look new. ?My test log is > available at: > > http://sage.math.washington.edu/home/mhampton/biopython-1.51-testlog.txt > > in case anyone wants to take a look. This one should be simple: test_EMBOSS.py ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in genbank format: 16 vs 1 records This is a known regression in EMBOSS 6.1.0 which will be fixed in their next release. Can you check this by running embossversion? The others are all ImportErrors (e.g. cannot import name _CKDTree) I rather suspect you are running the test suite BEFORE compiling the C extensions, and that this may similarly affect Bio.Restriction. > Biopython has been available in Sage for several years as an optional > package, but I would like to make it a standard component. This has > has become much more likely since the clean-up of Numeric and > mx-texttools dependencies. Cool. > I think the only real issue is setting up some testing during the Sage > package installation, which is my motivation for really understanding > the test failures. I don't know anything about your test framework, but surely other packages (e.g. NumPy) have a similar requirement (compile before test) so this should be fixable. Regards, Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 12:35:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 17:35:53 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com> On Mon, Aug 17, 2009 at 5:28 PM, Peter wrote: > > The others are all ImportErrors (e.g. cannot import name _CKDTree) > I rather suspect you are running the test suite BEFORE compiling > the C extensions, and that this may similarly affect Bio.Restriction. Also this line is interesting - it suggest you have not installed NumPy, or not told sage is is a dependency? test_Cluster ... skipping. If you want to use Bio.Cluster, install NumPy first and then reinstall Biopython P.S. Why does this page talk about Biopython version "4.2b"? http://wiki.sagemath.org/Sage_Spkg_Tracking Peter From matzke at berkeley.edu Mon Aug 17 15:48:33 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 17 Aug 2009 12:48:33 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <20090811131019.GW12604@sobchak.mgh.harvard.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <20090811131019.GW12604@sobchak.mgh.harvard.edu> Message-ID: <4A89B411.4090501@berkeley.edu> Pencils down update: I have uploaded the relevant test scripts and data files to git, and deleted old loose files. http://github.com/nmatzke/biopython/commits/Geography Here is a simple draft tutorial: http://biopython.org/wiki/BioGeography#Tutorial Strangely, while working on the tutorial I discovered that I did something somewhere in the last revision that is messing up the parsing of automatically downloaded records from GBIF, I am tracking this down currently and will upload as soon as I find it. I would like to thank everyone for the opportunity to participate in GSoC, and to thank everyone for their help. For me, this summer turned into more of a "growing from a scripter to a programmer" summer than I expected initially. As a result I spent a more time refactoring and retracing my steps than I figured. However I think the resulting main product, a GBIF interface and associated tools, is much better than it would have been without the advice & encouragement of Brad, Hilmar, etc. I will be using this for my own research and will continue developing it. Cheers! Nick Brad Chapman wrote: > Hi Nick; > >> Summary: Major focus is getting the GBIF access/search/parse module into >> "done"/submittable shape. This primarily requires getting the >> documentation and testing up to biopython specs. I have a fair bit of >> documentation and testing, need advice (see below) for specifics on what >> it should look like. > > Awesome. Thanks for working on the cleanup for this. > >> OK, I will do this. Should I try and figure out the unittest stuff? I >> could use a simple example of what this is supposed to look like. > > In addition to Peter's pointers, here is a simple example from a > small thing I wrote: > > http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py > > You can copy/paste the unit test part to get a base, and then > replace the t_* functions with your own real tests. > > Simple scripts that generate consistent output are also fine; that's > the print and compare approach. > >>> - What is happening with the Nodes_v2 and Treesv2 files? They look >>> like duplicates of the Nexus Nodes and Trees with some changes. >>> Could we roll those changes into the main Nexus code to avoid >>> duplication? >> Yeah, these were just copies with your bug fix, and with a few mods I >> used to track crashes. Presumably I don't need these with after a fresh >> download of biopython. > > Cool. It would be great if we could weed these out as well. > >> The API is really just the interface with GBIF. I think developing a >> cookbook entry is pretty easy, I assume you want something like one of >> the entries in the official biopython cookbook? > > Yes, that would work great. What I was thinking of are some examples > where you provide background and motivation: Describe some useful > information you want to get from GBIF, and then show how to do it. > This is definitely the most useful part as it gives people working > examples to start with. From there they can usually browse the lower > level docs or code to figure out other specific things. > >> Re: API documentation...are you just talking about the function >> descriptions that are typically in """ """ strings beneath the function >> definitions? I've got that done. Again, if there is more, an example >> of what it should look like would be useful. > > That looks great for API level docs. You are right on here; for this > week I'd focus on the cookbook examples and cleanup stuff. > > My other suggestion would be to rename these to follow Biopython > conventions, something like: > > gbif_xml -> GbifXml > shpUtils -> ShapefileUtils > geogUtils -> GeographyUtils > dbfUtils -> DbfUtils > > The *Utils might have underscores if they are not intended to be > called directly. > > Thanks for all your hard work, > Brad > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From mhampton at d.umn.edu Mon Aug 17 16:46:42 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 15:46:42 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com> Message-ID: On Mon, 17 Aug 2009, Peter wrote: > On Mon, Aug 17, 2009 at 5:28 PM, Peter wrote: >> >> The others are all ImportErrors (e.g. cannot import name _CKDTree) >> I rather suspect you are running the test suite BEFORE compiling >> the C extensions, and that this may similarly affect Bio.Restriction. > > Also this line is interesting - it suggest you have not installed NumPy, > or not told sage is is a dependency? > test_Cluster ... skipping. If you want to use Bio.Cluster, install > NumPy first and then reinstall Biopython Numpy is included in Sage, so I guess there is some sort of path problem. I'll give it another look. > P.S. Why does this page talk about Biopython version "4.2b"? > http://wiki.sagemath.org/Sage_Spkg_Tracking > > Peter > I have no idea, that was simply wrong. I have corrected that wiki page. Thanks for the feedback! Marshall Hampton From mhampton at d.umn.edu Mon Aug 17 17:25:28 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 16:25:28 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: On Mon, 17 Aug 2009, Peter wrote: > This one should be simple: test_EMBOSS.py > ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in > genbank format: 16 vs 1 records > This is a known regression in EMBOSS 6.1.0 which will be fixed > in their next release. Can you check this by running embossversion? My emboss version is 6.1.0, so that explains that. After copying the Tests folder from the source to my site-packages directory, most of the errors go away, except for the one mentioned above and this one: ERROR: test_SeqIO_online ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 248, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "test_SeqIO_online.py", line 62, in record = SeqIO.read(handle, format) # checks there is exactly one record File "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 485, in read raise ValueError("No records found in handle") ValueError: No records found in handle ...not sure what the problem might be with that. -Marshall From mhampton at d.umn.edu Mon Aug 17 17:31:43 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 16:31:43 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: I hope this isn't too much email, I can just post to the dev list if you'd like. Anyway, I manually ran my last test failure, test_SeqIO_online.py, and when I do that everything looks OK: thorn:16:28:30:site-packages: sage -python Tests/test_SeqIO_online.py Checking Bio.ExPASy.get_sprot_raw() - Fetching O23729 Got MAPAMEEIRQAQRAEGPAA...GAE [5Y08l+HJRDIlhLKzFEfkcKd1dkM] len 394 Checking Bio.Entrez.efetch() - Fetching X52960 from genome as fasta Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 - Fetching X52960 from genome as gb Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 - Fetching 6273291 from nucleotide as fasta Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 - Fetching 6273291 from nucleotide as gb Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 - Fetching 16130152 from protein as fasta Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 - Fetching 16130152 from protein as gb Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 Not sure where to go from here, but it seems that things are basically working correctly. -Marshall Hampton On Mon, 17 Aug 2009, Marshall Hampton wrote: > > On Mon, 17 Aug 2009, Peter wrote: >> This one should be simple: test_EMBOSS.py >> ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in >> genbank format: 16 vs 1 records >> This is a known regression in EMBOSS 6.1.0 which will be fixed >> in their next release. Can you check this by running embossversion? > > My emboss version is 6.1.0, so that explains that. > > After copying the Tests folder from the source to my site-packages directory, > most of the errors go away, except for the one mentioned above and this one: > > ERROR: test_SeqIO_online > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "run_tests.py", line 248, in runTest > suite = unittest.TestLoader().loadTestsFromName(name) > File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, > in loadTestsFromName > module = __import__('.'.join(parts_copy)) > File "test_SeqIO_online.py", line 62, in > record = SeqIO.read(handle, format) # checks there is exactly one record > File > "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", > line 485, in read > raise ValueError("No records found in handle") > ValueError: No records found in handle > > ...not sure what the problem might be with that. > > -Marshall > From biopython at maubp.freeserve.co.uk Mon Aug 17 17:37:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:37:05 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> On Mon, Aug 17, 2009 at 10:25 PM, Marshall Hampton wrote: > > After copying the Tests folder from the source to my site-packages > directory, most of the errors go away, Well that does suggest some sort of path issue, but moving the test directory around that isn't a very good solution. > except for the one mentioned above and this one: Assuming the "one mentioned above" was the EMBOSS one, fine. > ERROR: test_SeqIO_online > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "run_tests.py", line 248, in runTest > ? ?suite = unittest.TestLoader().loadTestsFromName(name) > ?File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, > in loadTestsFromName > ? ?module = __import__('.'.join(parts_copy)) > ?File "test_SeqIO_online.py", line 62, in > ? ?record = SeqIO.read(handle, format) # checks there is exactly one record > ?File > "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", > line 485, in read > ? ?raise ValueError("No records found in handle") > ValueError: No records found in handle > > ...not sure what the problem might be with that. That is an online test using the NCBI's web services. This could be a transient failure due to the network. Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 17:43:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:43:06 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: <320fb6e00908171443g7b9fe780h1024d2f584be7b18@mail.gmail.com> On Mon, Aug 17, 2009 at 10:31 PM, Marshall Hampton wrote: > > I hope this isn't too much email, I can just post to the dev list if > you'd like. Doing it on the mailing list is fine, I'd read it either way ;) >?Anyway, I manually ran my last test failure, test_SeqIO_online.py, > and when I do that everything looks OK: > > thorn:16:28:30:site-packages: sage -python Tests/test_SeqIO_online.py > Checking Bio.ExPASy.get_sprot_raw() > - Fetching O23729 > ?Got MAPAMEEIRQAQRAEGPAA...GAE [5Y08l+HJRDIlhLKzFEfkcKd1dkM] len 394 > Checking Bio.Entrez.efetch() > - Fetching X52960 from genome as fasta > ?Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 > - Fetching X52960 from genome as gb > ?Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 > - Fetching 6273291 from nucleotide as fasta > ?Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 > - Fetching 6273291 from nucleotide as gb > ?Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 > - Fetching 16130152 from protein as fasta > ?Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 > - Fetching 16130152 from protein as gb > ?Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 > > Not sure where to go from here, but it seems that things are basically > working correctly. > > -Marshall Hampton That fits with it being a transient network issue. Some of our units tests like Tests/test_SeqIO_online.py are simple "print and compare" scripts, which are intended to be run via the run_tests.py script to validate their output. You can try this: sage -python Tests/run_tests.py test_SeqIO_online.py Or, manually compare that output to the expected output in file Tests/ouput/test_SeqIO_online - but it looks fine to me by eye. Peter From dalke at dalkescientific.com Mon Aug 17 17:36:41 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 17 Aug 2009 23:36:41 +0200 Subject: [Biopython-dev] old Martel release Message-ID: Hi all, Does anyone here have a copy of my *old* Martel code? Something from the pre-1.0 days? I can't find it anywhere, and it looks like I did things back then on the biopython.org machines. An example URL was: http://www.biopython.org/~dalke/Martel/Martel-0.5.tar.gz I'm specifically looking for the molfile format I developed. That was 9 years ago and several machines back in time. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Aug 17 17:40:11 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 17 Aug 2009 23:40:11 +0200 Subject: [Biopython-dev] old Martel release In-Reply-To: References: Message-ID: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com> On Aug 17, 2009, at 11:36 PM, Andrew Dalke wrote: > Does anyone here have a copy of my *old* Martel code? Ha! archive.org has it. Didn't think they kept .tar.gz files, but they do! Andrew dalke at dalkescientific.com From eric.talevich at gmail.com Mon Aug 17 17:47:22 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 17 Aug 2009 17:47:22 -0400 Subject: [Biopython-dev] GSoC final update: PhyloXML for Biopython Message-ID: <3f6baf360908171447i2e3c592em5960269600e80f1b@mail.gmail.com> Hi all, Here's a final changelog for Aug. 10-14: - Added a 'terminal' argument to the find() method on BaseTree.Tree, for filtering internal/external nodes. This makes get_leaf_nodes() a trivial function, and total_branch_length is pretty simple too. - Updated the example phyloXML files to v1.10 schema-compliant copies from phyloxml.org; couple bug fixes. - Removed the project's README.rst file, so Bio/PhyloXML/ is no longer controlled by Git. I'll merge any useful information from there into the Biopython wiki documentation. - Pulled the Biopython 1.51 release into my master branch, and merged that into the phyloxml branch, so this branch (and the required GSoC patch tarball) will apply cleanly to the publicly released Biopython 1.51 source tree. - Documented most of what's been done on the Biopython wiki: http://www.biopython.org/wiki/PhyloXML http://www.biopython.org/wiki/TreeIO http://www.biopython.org/wiki/Tree *Future plans* There are a few tangential projects that deserve more attention over the next few months, and I'm going to create separate Git branches for each of them, to make it easier to share: - Port the Newick tree parser and methods from Bio.Newick to Bio.Tree and TreeIO. - Improve the graph drawing and networkx integration - BioSQL adapter between Bio.Tree.BaseTree and PhyloDB tables - Possibly, play with other tree representations -- nested-set, as PhyloDB does, and relationship matrix, which could bring NumPy into play (in a separate Bio.Tree.Matrix module) Finally, massive thanks to Brad and Christian for mentoring, Hilmar for overseeing the whole project, Peter and the Biopython folks for their guidance, and the various BioPerl monks and BioRubyists who shared their wisdom. All the best, Eric https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Aug 17 17:48:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:48:19 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A89B411.4090501@berkeley.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <20090811131019.GW12604@sobchak.mgh.harvard.edu> <4A89B411.4090501@berkeley.edu> Message-ID: <320fb6e00908171448l296abbb8yb509893cfbaaaa24@mail.gmail.com> On Mon, Aug 17, 2009 at 8:48 PM, Nick Matzke wrote: > I would like to thank everyone for the opportunity to participate in GSoC, > and to thank everyone for their help. ?For me, this summer turned into more > of a "growing from a scripter to a programmer" summer than I expected > initially. ?As a result I spent a more time refactoring and retracing my > steps than I figured. ?However I think the resulting main product, a GBIF > interface and associated tools, is much better than it would have been > without the advice & encouragement of Brad, Hilmar, etc. ?I will be using > this for my own research and will continue developing it. That sounds like this has been a successful project, and from my Biopython point of view the bit about you planing to continue using and developing the code in your research is especially good news ;) Cheers! Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 17:54:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:54:45 +0100 Subject: [Biopython-dev] old Martel release In-Reply-To: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com> References: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com> Message-ID: <320fb6e00908171454k267ed02djc982bf312b6285bb@mail.gmail.com> On Mon, Aug 17, 2009 at 10:40 PM, Andrew Dalke wrote: > On Aug 17, 2009, at 11:36 PM, Andrew Dalke wrote: >> >> ?Does anyone here have a copy of my *old* Martel code? > > Ha! archive.org has it. > > Didn't think they kept .tar.gz files, but they do! Lucky :) I don't know what is in it, but your dalke user account is still there on biopython.org - which would probably still have all the http://www.biopython.org/~dalke website content. I guess your password has expired or something. Give the OBF guys an email? You might have some other bits and pieces still there... Peter From mhampton at d.umn.edu Mon Aug 17 17:45:07 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 16:45:07 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> Message-ID: Yep, I tried again and the test_SeqIO_online was ok, so I guess it was a transient failure. I agree that copying my Tests folder isn't a great solution. I will try to increase my understanding of the biopython test framework - I am used to the Sage method of mainly using docstring tests. -Marshall On Mon, 17 Aug 2009, Peter wrote: > On Mon, Aug 17, 2009 at 10:25 PM, Marshall Hampton wrote: >> >> After copying the Tests folder from the source to my site-packages >> directory, most of the errors go away, > > Well that does suggest some sort of path issue, but moving the > test directory around that isn't a very good solution. > >> except for the one mentioned above and this one: > > Assuming the "one mentioned above" was the EMBOSS one, fine. > >> ERROR: test_SeqIO_online >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> ?File "run_tests.py", line 248, in runTest >> ? ?suite = unittest.TestLoader().loadTestsFromName(name) >> ?File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, >> in loadTestsFromName >> ? ?module = __import__('.'.join(parts_copy)) >> ?File "test_SeqIO_online.py", line 62, in >> ? ?record = SeqIO.read(handle, format) # checks there is exactly one record >> ?File >> "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", >> line 485, in read >> ? ?raise ValueError("No records found in handle") >> ValueError: No records found in handle >> >> ...not sure what the problem might be with that. > > That is an online test using the NCBI's web services. This could > be a transient failure due to the network. > > Peter > From biopython at maubp.freeserve.co.uk Mon Aug 17 17:57:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:57:40 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> Message-ID: <320fb6e00908171457w73ca3699y11dbe255cd2748df@mail.gmail.com> On Mon, Aug 17, 2009 at 10:45 PM, Marshall Hampton wrote: > > Yep, I tried again and the test_SeqIO_online was ok, so I guess it was a > transient failure. Good :) > I agree that copying my Tests folder isn't a great solution. ?I will try to > increase my understanding of the biopython test framework - I am > used to the Sage method of mainly using docstring tests. If it helps, there is a whole chapter in our tutorial, but most of this is aimed at people wanting to write unit tests for us. http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Please point out any typos or things that can be clarified. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Tue Aug 18 06:01:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Aug 2009 06:01:25 -0400 Subject: [Biopython-dev] [Bug 2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py In-Reply-To: Message-ID: <200908181001.n7IA1PWk030525@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2619 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-18 06:01 EST ------- Just to note the Ubuntu/Debian packages for Biopython list flex as a build dependency, and patch our setup.py file to re-enable the Bio.PDB.mmCIF.MMCIFlex extension. This is a neat solution until we can update our setup.py to detect flex on its own. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hlapp at gmx.net Tue Aug 18 12:09:15 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 18 Aug 2009 12:09:15 -0400 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> Message-ID: On Aug 17, 2009, at 11:52 AM, Peter wrote: > My impression from talking to OBF guys is if we really want to we > can do this, but it require us (Biopython) to take care of installing > and running git on an OBF machine. That's how I would put it too. Moreover, if you as people who want this and know more about it already than anyone else among root-l can't be bothered to take the initiative to spearhead this on OBF servers, the argument that OBF "sysadmins" (which in essence is all of us who know how to do this) should do the work is a lot less strong than it might have to be. I.e., if you don't feel this would be time well invested for you, it is probably even less well invested for other OBFers. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Aug 18 12:39:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 18 Aug 2009 17:39:23 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> Message-ID: <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> On Tue, Aug 18, 2009 at 5:09 PM, Hilmar Lapp wrote: > > On Aug 17, 2009, at 11:52 AM, Peter wrote: > >> My impression from talking to OBF guys is if we really want to we >> can do this, but it require us (Biopython) to take care of installing >> and running git on an OBF machine. > > That's how I would put it too. Moreover, if you as people who want this and > know more about it already than anyone else among root-l can't be bothered > to take the initiative to spearhead this on OBF servers, the argument that > OBF "sysadmins" (which in essence is all of us who know how to do this) > should do the work is a lot less strong than it might have to be. I.e., if > you don't feel this would be time well invested for you, it is probably even > less well invested for other OBFers. Sure. Right now I don't think anyone at Biopython knows exactly what would be involved in running a gitserver, and it would take some investment of time to get to that point. In the long term I think running git on an OBF machine would be a good idea, but I don't personally want to spend time learning how to do that right now. By using github, we don't have to invest a lot of upfront effort in configuring a git server right away. I think it makes sense to just move Biopython to github in the short term, in the medium term we can (expertise permitting) get a git mirror running on an OBF machine, and then other tools like the git equivalent of ViewCVS (and if need be then abandon github - we won't be locked into anything permanent). Peter From fkauff at biologie.uni-kl.de Wed Aug 19 03:36:45 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 19 Aug 2009 09:36:45 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> Message-ID: <4A8BAB8D.1000109@biologie.uni-kl.de> Hi all, On 08/17/2009 02:43 PM, Peter wrote: > Hi all, > > Now that Biopython 1.51 is out (thanks Brad), we should > discuss finally moving from CVS to git. This was something > we talked about at BOSC/ISMB 2009, but not everyone was > there. We have two main options: > > (a) Move from CVS (on the OBF servers) to github. All our > developers will need to get github accounts, and be added > as "collaborators" to the existing github repository. I would > want a mechanism in place to backup the repository to the > OBF servers (Bartek already has something that should > work). > > I agree, this sounds at this point like the most feasible way to go. In the long run we can still reconsider to run git on the OBF servers, but t this point running such a server is an additional amount of work that brings no additional benefit. Cheers, Frank > (b) Move from CVS to git (on the OBF servers). All our > developers can continue to use their existing OBF accounts. > Bartek's existing scripts could be modified to push the > updates from this OBF git repository onto github. > > In either case, there will be some "plumbing" work required, > for example I'd like to continue to offer a recent source code > dump at http://biopython.open-bio.org/SRC/biopython/ etc. > > Given we don't really seem to have the expertise "in house" > to run an OBF git server ourselves right now, option (a) is > simplest, and as I recall those of us at BOSC where OK > with this plan. > > Assuming we go down this route (CVS to github), everyone > with an existing CVS account should setup a github account > if they want to continue to have commit access (e.g. Frank, > Iddo). I would suggest that initially you get used to working > with git and github BEFORE trying anything directly on what > would be the "official" repository. It took me a while and I'm > still learning ;) > > Is this agreeable? Are there any other suggestions? > > [Once this is settled, we can talk about things like merge > requests and if they should be accompanied by a Bugzilla > ticket or not.] > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From matzke at berkeley.edu Wed Aug 19 04:56:59 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Wed, 19 Aug 2009 01:56:59 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A89B411.4090501@berkeley.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <20090811131019.GW12604@sobchak.mgh.harvard.edu> <4A89B411.4090501@berkeley.edu> Message-ID: <4A8BBE5B.10705@berkeley.edu> OK, I nailed the bug, which was stemming from HTML links inside GBIF XML results which in some situations were screwing up parsing etc. So I've updated the tutorial to add the chunk about downloading an arbitrarily large number of records, in user-specified increments, with an appropriate time-delay between server requests. Also added a chunk on classifying records into user-specified geographic areas based on their latitude/longitude. Also updated the test scripts and test results files, and deleted some remaining loose/unnecessary files. Updated tutorial: http://biopython.org/wiki/BioGeography#Tutorial Github commits: http://github.com/nmatzke/biopython/commits/Geography I think I've reached a good stopping point for the moment, I welcome comments on the tutorial and/or on the prospects for turning this into an official biopython module, etc. Thanks again, and cheers! Nick Nick Matzke wrote: > Pencils down update: I have uploaded the relevant test scripts and data > files to git, and deleted old loose files. > http://github.com/nmatzke/biopython/commits/Geography > > Here is a simple draft tutorial: > http://biopython.org/wiki/BioGeography#Tutorial > > Strangely, while working on the tutorial I discovered that I did > something somewhere in the last revision that is messing up the parsing > of automatically downloaded records from GBIF, I am tracking this down > currently and will upload as soon as I find it. > > I would like to thank everyone for the opportunity to participate in > GSoC, and to thank everyone for their help. For me, this summer turned > into more of a "growing from a scripter to a programmer" summer than I > expected initially. As a result I spent a more time refactoring and > retracing my steps than I figured. However I think the resulting main > product, a GBIF interface and associated tools, is much better than it > would have been without the advice & encouragement of Brad, Hilmar, etc. > I will be using this for my own research and will continue developing it. > > Cheers! > Nick > > > Brad Chapman wrote: >> Hi Nick; >> >>> Summary: Major focus is getting the GBIF access/search/parse module >>> into "done"/submittable shape. This primarily requires getting the >>> documentation and testing up to biopython specs. I have a fair bit >>> of documentation and testing, need advice (see below) for specifics >>> on what it should look like. >> >> Awesome. Thanks for working on the cleanup for this. >> >>> OK, I will do this. Should I try and figure out the unittest stuff? >>> I could use a simple example of what this is supposed to look like. >> >> In addition to Peter's pointers, here is a simple example from a >> small thing I wrote: >> >> http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py >> >> You can copy/paste the unit test part to get a base, and then >> replace the t_* functions with your own real tests. >> >> Simple scripts that generate consistent output are also fine; that's >> the print and compare approach. >> >>>> - What is happening with the Nodes_v2 and Treesv2 files? They look >>>> like duplicates of the Nexus Nodes and Trees with some changes. >>>> Could we roll those changes into the main Nexus code to avoid >>>> duplication? >>> Yeah, these were just copies with your bug fix, and with a few mods I >>> used to track crashes. Presumably I don't need these with after a >>> fresh download of biopython. >> >> Cool. It would be great if we could weed these out as well. >> >>> The API is really just the interface with GBIF. I think developing a >>> cookbook entry is pretty easy, I assume you want something like one >>> of the entries in the official biopython cookbook? >> >> Yes, that would work great. What I was thinking of are some examples >> where you provide background and motivation: Describe some useful >> information you want to get from GBIF, and then show how to do it. >> This is definitely the most useful part as it gives people working >> examples to start with. From there they can usually browse the lower >> level docs or code to figure out other specific things. >> >>> Re: API documentation...are you just talking about the function >>> descriptions that are typically in """ """ strings beneath the >>> function definitions? I've got that done. Again, if there is more, >>> an example of what it should look like would be useful. >> >> That looks great for API level docs. You are right on here; for this >> week I'd focus on the cookbook examples and cleanup stuff. >> >> My other suggestion would be to rename these to follow Biopython >> conventions, something like: >> >> gbif_xml -> GbifXml >> shpUtils -> ShapefileUtils >> geogUtils -> GeographyUtils >> dbfUtils -> DbfUtils >> >> The *Utils might have underscores if they are not intended to be >> called directly. >> >> Thanks for all your hard work, >> Brad >> > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From bugzilla-daemon at portal.open-bio.org Wed Aug 19 05:29:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 Aug 2009 05:29:36 -0400 Subject: [Biopython-dev] [Bug 2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py In-Reply-To: Message-ID: <200908190929.n7J9TaR0006301@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2619 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-19 05:29 EST ------- (In reply to comment #5) > Just to note the Ubuntu/Debian packages for Biopython list flex as a build > dependency, and patch our setup.py file to re-enable the Bio.PDB.mmCIF.MMCIFlex > extension. This is a neat solution until we can update our setup.py to detect > flex on its own. > Alex Lancaster has kindly done the same for the latest Fedora RPM package (Biopython 1.51). See https://admin.fedoraproject.org/community/?package=python-biopython#package_maintenance -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Aug 19 05:45:20 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 19 Aug 2009 11:45:20 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> Message-ID: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> Hi guys, On Tue, Aug 18, 2009 at 6:39 PM, Peter wrote: > On Tue, Aug 18, 2009 at 5:09 PM, Hilmar Lapp wrote: >> >> On Aug 17, 2009, at 11:52 AM, Peter wrote: >> >>> My impression from talking to OBF guys is if we really want to we >>> can do this, but it require us (Biopython) to take care of installing >>> and running git on an OBF machine. >> >> That's how I would put it too. Moreover, if you as people who want this and >> know more about it already than anyone else among root-l can't be bothered >> to take the initiative to spearhead this on OBF servers, the argument that >> OBF "sysadmins" (which in essence is all of us who know how to do this) >> should do the work is a lot less strong than it might have to be. I.e., if >> you don't feel this would be time well invested for you, it is probably even >> less well invested for other OBFers. > > Sure. Right now I don't think anyone at Biopython knows exactly > what would be involved in running a gitserver, and it would take > some investment of time to get to that point. > I think there is some grave misunderstanding here. There is nothing magical or difficult in installing git on OBF servers. It's just a package. There is no effort to be spearheaded by anyone. The command "yum install git" needs to be run by someone with root privileges. That's it. It's absolutely enough to allow people with obf developer accounts to use git for development. As for running a git-protocol-server, this is a bit more complicated and can be done in many more ways than with CVS. I don't think that anyone is expecting OBF to provide git repository hosting in a standardized way (currently only BioRuby uses git and they seem to be fine with github, similar for biopython) The importance of having git installed on OBF machines comes from the fact that it can be useful for many things even if we don't host the repository on OBF servers. Most importantly, for doing regular backups of git branch from github to OBF servers we need a machine with git installed. Currently it's my work machine, but I think it would be a much better setup if we could do it directly from an OBF machine. > In the long term I think running git on an OBF machine would be a > good idea, but I don't personally want to spend time learning how to > do that right now. By using github, we don't have to invest a lot of > upfront effort in configuring a git server right away. > > I think it makes sense to just move Biopython to github in the short term, > in the medium term we can (expertise permitting) get a git mirror running > on an OBF machine, and then other tools like the git equivalent of > ViewCVS (and if need be then abandon github - we won't be locked > into anything permanent). > I don't quite understand what do you mean by "running git". Once we have git installed, you can use push and pull over ssh to a branch sitting on OBF machine. We can also make the mirror available for people (read-only) through http (just place the repo in a directory published with apache, no extra software required), But I don't think this makes much sense if we actually want to use collaborative features of github. In my opinion this would only bring confusion: either we make the github branch official or not. The most difficult part is the "viewCVS" replacement. There is the gitweb.cgi script, which is (in my opinion) inferior to github interface. Installing it wouldn't be difficult (it's CGI) so we could do it, but is it better than github here? I'm not sure.. (you can see how it would look on a slightly out-of-date biopython branch on my machine: http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=summary ) To summarize, I think that the only thing we really need from OBF is to have git installed (Hilmar, can you help with this? I tried to even compile it on dev.open-bio.org but there it depends on multiple libraries and I gave up...) best regards Bartek From biopython at maubp.freeserve.co.uk Wed Aug 19 05:58:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 10:58:12 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> Message-ID: <320fb6e00908190258u2262a3b9s256dff5db38ddd41@mail.gmail.com> On Wed, Aug 19, 2009 at 10:45 AM, Bartek Wilczynski wrote: >> Sure. Right now I don't think anyone at Biopython knows exactly >> what would be involved in running a gitserver, and it would take >> some investment of time to get to that point. > > I think there is some grave misunderstanding here. You have certainly clarified a few things for me ;) > There is nothing magical or difficult in installing git on OBF > servers. It's ?just a package. There is no effort to be spearheaded > by anyone. The command "yum install git" needs to be run by > someone with root privileges. That's it. It's absolutely enough > to allow people with obf developer accounts to use git for > development. Oh. That is less complicated than I realised - assuming all the existing dev accounts have SSH access. > As for running a git-protocol-server, this is a bit more complicated > and can be done in many more ways than with CVS. I don't think > that anyone is expecting OBF to provide git repository hosting in > a standardized way (currently only BioRuby uses git and they > seem to be fine with github, similar for biopython) > > The importance of having git installed on OBF machines comes > from the fact that it can be useful for many things even if we don't > host the repository on OBF servers. I had been assuming we would also need the git-protocol-server, and to mess about with the firewall and perhaps webserver, but if I understand you correctly even *just* the core git tool running on the OBF would be useful (even if just for backups). So let's try and do that... > ... > To summarize, I think that the only thing we really need from OBF is > to have git installed Any of the OBF server admins should be able to install the git *package* for us (this should be trivial as long as the Linux OS is fairly up to date). We should probably ask via a support request on on the root-l mailing list... let's just give Hilmar a chance to reply first. Peter From hlapp at gmx.net Wed Aug 19 18:17:20 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Aug 2009 18:17:20 -0400 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> Message-ID: <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> On Aug 19, 2009, at 5:45 AM, Bartek Wilczynski wrote: > To summarize, I think that the only thing we really need from OBF is > to have git installed > (Hilmar, can you help with this? I tried to even compile it on > dev.open-bio.org but there it depends on multiple libraries and I > gave up...) Post to root-l (copied here, for convenience) and ask if someone can set you up with the necessary privileges, assuming that you are volunteering to do the installation? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Aug 20 06:01:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 11:01:54 +0100 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? Message-ID: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> Hi Bartek, With the introduction of Bio.Motif, we declared Bio.AlignAce and Bio.MEME as obsolete as of release 1.50 in the DEPRECATED file. I note we didn't update the module docstrings themselves to make this more prominent. Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for the next release (i.e. put this in their docstrings and issue deprecation warnings)? Peter From barwil at gmail.com Thu Aug 20 06:10:23 2009 From: barwil at gmail.com (Bartek Wilczynski) Date: Thu, 20 Aug 2009 12:10:23 +0200 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? In-Reply-To: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> Message-ID: <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> On Thu, Aug 20, 2009 at 12:01 PM, Peter wrote: > Hi Bartek, > > With the introduction of Bio.Motif, we declared Bio.AlignAce and > Bio.MEME as obsolete as of release 1.50 in the DEPRECATED file. I note > we didn't update the module docstrings themselves to make this more > prominent. > > Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for > the next release (i.e. put this in their docstrings and issue > deprecation warnings)? I think so. Should I change something in the docstrings? Bartek From biopython at maubp.freeserve.co.uk Thu Aug 20 06:20:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 11:20:30 +0100 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? In-Reply-To: <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> Message-ID: <320fb6e00908200320k6d8902e0r4742a92a5956b1ed@mail.gmail.com> On Thu, Aug 20, 2009 at 11:10 AM, Bartek Wilczynski wrote: >> Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for >> the next release (i.e. put this in their docstrings and issue >> deprecation warnings)? > > I think so. ?Should I change something in the docstrings? > The start of the module docstring should be a one line description of the module - just include "(DEPRECATED)" at the end. Then it will show up nicely in the API docs: http://biopython.org/DIST/docs/api/ If you look at that page you should be able to see entries like this: * Bio.Fasta: Utilities for working with FASTA-formatted sequences (DEPRECATED) * Bio.FilteredReader: Code for more fancy file handles (OBSOLETE) Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 07:28:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 12:28:46 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO Message-ID: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> Hi all, You may recall a thread back in June with Cedar Mckay (cc'd - not sure if he follows the dev list or not) about indexing large sequence files - specifically FASTA files but any sequential file format. I posted some rough code which did this building on Bio.SeqIO: http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html I have since generalised this, and have something which I think would be ready for merging into the main trunk for wider testing. The code is on github on my "index" branch at the moment, http://github.com/peterjc/biopython/commits/index This would add a new function to Bio.SeqIO, provisionally called indexed_dict, which takes two arguments: filename and format name (e.g. "fasta"), plus an optional alphabet. This will return a dictionary like object, using SeqRecord identifiers as keys, and SeqRecord objects as values. There is (deliberately) no way to allow the user to choose a different keying mechanism (although I can see how to do this at a severe performance cost). As with the Bio.SeqIO.convert() function, the new addition of Bio.SeqIO.indexed_dict() will be the only public API change. Everything else is deliberately private, allowing us the freedom to change details if required in future. The essential idea is the same my post in June. Nothing about the existing SeqIO framework is changed (so this won't break anything). For each file we scan though it looking for new records, not the file offset, and extract the identifier string. These are stored in a normal (private) Python dictionary. On requesting a record, we seek to the appropriate offset and parse the data into a SeqRecord. For simple file formats we can do this by calling Bio.SeqIO.parse(). For complex file formats (such as SFF files, or anything else with important information in a header), the implementation is a little more complicated - but we can provide the same API to the user. Note that the indexing step does not fully parse the file, and thus may ignore corrupt/invalid records. Only when (if) they are accessed will this trigger a parser error. This is a shame, but means the indexing can (in general) be done very fast. I am proposing to merge all of this (except the SFF file support), but would welcome feedback (even after a merger). I already have basic unit tests, covering the following SeqIO file formats: "ace", "embl", "fasta", "fastq" (all three variants), "genbank"/"gb", "ig", "phd", "pir", and "swiss" (plus "sff" but I don't think that parser is ready to be checked in yet). An example using the new code, this takes just a few seconds to index this 238MB GenBank file, and record access is almost instant: >>> from Bio import SeqIO >>> gb_dict = SeqIO.indexed_dict("gbpln1.seq", "gb") >>> len(gb_dict) 59918 >>> gb_dict.keys()[:5] ['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1', 'AB161026.1'] >>> record = gb_dict["AB433451.1"] >>> print record.id, len(record), len(record.features) AB433451.1 590 2 And using a 1.3GB FASTQ file, indexing is about a minute, and again, record access is almost instant: >>> from Bio import SeqIO >>> fq_dict = SeqIO.indexed_dict("SRR001666_1.fastq", "fastq") >>> len(fq_dict) 7047668 >>> fq_dict.keys()[:4] ['SRR001666.2320093', 'SRR001666.2320092', 'SRR001666.1250635', 'SRR001666.2354360'] >>> record = fq_dict["SRR001666.2765432"] >>> print record.id, record.seq SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 08:24:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 13:24:34 +0100 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? In-Reply-To: <8b34ec180908200450r15823d18q87a8cbfccbdc9b13@mail.gmail.com> References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> <320fb6e00908200320k6d8902e0r4742a92a5956b1ed@mail.gmail.com> <8b34ec180908200450r15823d18q87a8cbfccbdc9b13@mail.gmail.com> Message-ID: <320fb6e00908200524k126ca330n86c3e8516777113c@mail.gmail.com> On Thu, Aug 20, 2009 at 12:50 PM, Bartek Wilczynski wrote: > > On Thu, Aug 20, 2009 at 12:20 PM, Peter wrote: > >> The start of the module docstring should be a one line description of >> the module - just include "(DEPRECATED)" at the end. Then it will >> show up nicely in the API docs: http://biopython.org/DIST/docs/api/ > > Done. Should be in CVS now. Sorry I was unclear - I was only talking about the docstrings. In addition we need to actually issue a deprecation warning (via the warnings module), and update the DEPRECATED file in the root folder. I've done this in CVS - sorry for any confusion. I've also tried to clarify the procedure on the wiki, http://biopython.org/wiki/Deprecation_policy If you can add a couple of examples to the AlignAce and MEME module docstrings showing a short example using the deprecated module, and the equivalent using Bio.Motif, that would be great. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 08:43:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 13:43:07 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> Message-ID: <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> On Thu, Aug 20, 2009 at 10:14 AM, Bartek Wilczynski wrote: > > Hi all, > > As the biopython project is moving now its development from CVS to > git, it would be very helpful for us if git software was installed on > dev.open-bio.org machine. > > The most convenient for us would be if someone with root privileges on > this machine would install the package (it's in the centos > repository). I can also do the installation myself, as suggested by > Hilmar (assuming I get the permissions required for package > installation, account=bartek). Bartek - do you think we need git on any of the other OBF machines in addition to dev.open-bio.org (current IP 207.154.17.71)? However, I'd like to have http://biopython.org/SRC/biopython kept up to date (also available via www.biopython.org and biopython.open-bio.org - these are all the same machine, IP 207.154.17.70). It might be easiest to do that with git installed on that machine too - or do you think it would be simpler to push the latest files from dev.open-bio.org instead? There is also the public CVS server, cvs.biopython.org aka cvs.open-bio.org (IP 207.154.17.75) but I doubt we will need to worry about that one in future. Peter From bartek at rezolwenta.eu.org Thu Aug 20 09:06:53 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 20 Aug 2009 15:06:53 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> Message-ID: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote: > Bartek - do you think we need git on any of the other OBF machines > in addition to dev.open-bio.org (current IP 207.154.17.71)? What we _need_ is a single machine, where we can run scripts from cron and where git is installed. That's why I requested the installation on dev.open-bio machine (it happens to be the only one I have an account on). The idea is to run something from cron and pull from github to have a backup copy of an up-to-date branch. The scripts can (after each update) push to other machines. > > However, I'd like to have http://biopython.org/SRC/biopython > kept up to date (also available via www.biopython.org and > biopython.open-bio.org - these are all the same machine, > IP 207.154.17.70). It might be easiest to do that with git > installed on that machine too - or do you think it would be > simpler to push the latest files from dev.open-bio.org instead? There is no need for git on the www-server machine if we only want to publish the code, or a read-only git branch over http for download. I think it's easier to have a single place where cron jobs are run. However, If we wanted to hook the scripts to github notifications rather than to cron, then we need some way to trigger scripts by a hit to a webpage, in which case it _might_ be easier to set things up on the machine with a web server. But I think we should be fine with the machinery running on the dev. machine. There is one remaining issue: We would need to have some directory where the branch would be kept. Currently it sits in my home directory whic probably should be changed to something like /home/biopython/git_branch. I am in biopython group, but currently /home/biopython does not even allow me to see /home/biopython, not to mention writing into it. I think it would be the best to set the scripts to run as biopython user. > > There is also the public CVS server, cvs.biopython.org aka > cvs.open-bio.org (IP 207.154.17.75) but I doubt we will need > to worry about that one in future. Certainly. I don't think we need to worry about this one. Bartek From biopython at maubp.freeserve.co.uk Thu Aug 20 09:24:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 14:24:15 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> Message-ID: <320fb6e00908200624i43b650e5q8478b0b5c12af67b@mail.gmail.com> On Thu, Aug 20, 2009 at 2:06 PM, Bartek Wilczynski wrote: > On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote: > >> Bartek - do you think we need git on any of the other OBF machines >> in addition to dev.open-bio.org (current IP 207.154.17.71)? > > What we _need_ is a single machine, where we can run scripts from cron > and where git is installed. That's why I requested the installation on > dev.open-bio machine (it happens to be the only one I have an account > on). The idea is to run something from cron and pull from github to > have a backup copy of an up-to-date branch. The scripts can (after > each update) push to other machines. > > ... > > There is no need for git on the www-server machine if we only want to > publish the code, or a read-only git branch over http for download. I > think it's easier to have a single place where cron jobs are run. So just push a dump of the latest code to http://biopython.org/SRC/biopython or push fresh epydoc api docs to http://biopython.org/DIST/docs/api-live/ or whatever from dev.open-bio.org. That sounds fine to me. > There is one remaining issue: We would need to have some directory > where the branch would be kept. Currently it sits in my home directory > which probably should be changed to something like > /home/biopython/git_branch. I am in biopython group, but currently > /home/biopython does not even allow me to see /home/biopython, not to > mention writing into it. I think it would be the best to set the > scripts to run as biopython user. Yes - we'll need some OBF admin input there... Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 09:28:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 14:28:22 +0100 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> Message-ID: <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> On Thu, Aug 20, 2009 at 2:15 PM, Chris Fields wrote: > > I would be interested in that as well. > > It appears dev.open-bio.org has apt (there is an /etc/apt directory), but > I'm failing to find apt-get in my PATH. ?Haven't installed on it yet, but a > packaged version would probably be easier. If we can have a packaged version of git on dev.open-bio.org from the Linux distro, that would be easiest (especially for keeping it up to date). > Also, are we planning ro mirrors on portal for anon access, or should we > (ab)use github for that purpose? ?To me a ro mirror sorta defeats the > purpose of git... For Biopython we plan to use github (initially at least) for committing changes. This will also allow anonymous access. A public OBF read only mirror of a git repository is still useful for people to clone from, and keep the local copy up to date - plus as a backup for if/when github is congested or unavailable. But not essential. Peter From dag at sonsorol.org Thu Aug 20 09:41:55 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Thu, 20 Aug 2009 09:41:55 -0400 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> Message-ID: Git is now installed via 'yum' on dev.open-bio.org Regards, Chris On Aug 20, 2009, at 9:28 AM, Peter wrote: > On Thu, Aug 20, 2009 at 2:15 PM, Chris Fields > wrote: >> >> I would be interested in that as well. >> >> It appears dev.open-bio.org has apt (there is an /etc/apt >> directory), but >> I'm failing to find apt-get in my PATH. Haven't installed on it >> yet, but a >> packaged version would probably be easier. > > If we can have a packaged version of git on dev.open-bio.org from the > Linux distro, that would be easiest (especially for keeping it up to > date). > >> Also, are we planning ro mirrors on portal for anon access, or >> should we >> (ab)use github for that purpose? To me a ro mirror sorta defeats the >> purpose of git... > > For Biopython we plan to use github (initially at least) for > committing > changes. This will also allow anonymous access. > > A public OBF read only mirror of a git repository is still useful > for people to > clone from, and keep the local copy up to date - plus as a backup for > if/when github is congested or unavailable. But not essential. > > Peter > > _______________________________________________ > Root-l mailing list > Root-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/root-l From mjldehoon at yahoo.com Thu Aug 20 09:58:03 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 20 Aug 2009 06:58:03 -0700 (PDT) Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> Message-ID: <818036.61284.qm@web62408.mail.re1.yahoo.com> I just have two suggestions: Since indexed_dict returns a dictionary-like object, it may make sense for the _IndexedSeqFileDict to inherit from a dict. Another issue is whether we can fold indexed_dict and to_dict into one. Right now we have def to_dict(sequences, key_function=None) : def indexed_dict(filename, format, alphabet=None) : What if we have a single function "dictionary" that can take sequences, a handle, or a filename, and optionally the format, alphabet, key_function, and a parameter "indexed" that indicates if the file should be indexed or kept into memory? Or something like that. Otherwise, the code looks really nice. Thanks! --Michiel --- On Thu, 8/20/09, Peter wrote: > From: Peter > Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO > To: "Biopython-Dev Mailing List" > Cc: "Cedar McKay" > Date: Thursday, August 20, 2009, 7:28 AM > Hi all, > > You may recall a thread back in June with Cedar Mckay (cc'd > - not > sure if he follows the dev list or not) about indexing > large sequence > files - specifically FASTA files but any sequential file > format. I posted > some rough code which did this building on Bio.SeqIO: > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html > > I have since generalised this, and have something which I > think > would be ready for merging into the main trunk for wider > testing. > The code is on github on my "index" branch at the moment, > http://github.com/peterjc/biopython/commits/index > > This would add a new function to Bio.SeqIO, provisionally > called > indexed_dict, which takes two arguments: filename and > format > name (e.g. "fasta"), plus an optional alphabet. This will > return a > dictionary like object, using SeqRecord identifiers as > keys, and > SeqRecord objects as values. There is (deliberately) no way > to > allow the user to choose a different keying mechanism > (although > I can see how to do this at a severe performance cost). > > As with the Bio.SeqIO.convert() function, the new addition > of > Bio.SeqIO.indexed_dict() will be the only public API > change. > Everything else is deliberately private, allowing us the > freedom > to change details if required in future. > > The essential idea is the same my post in June. Nothing > about > the existing SeqIO framework is changed (so this won't > break > anything). For each file we scan though it looking for new > records, > not the file offset, and extract the identifier string. > These are stored > in a normal (private) Python dictionary. On requesting a > record, we > seek to the appropriate offset and parse the data into a > SeqRecord. > For simple file formats we can do this by calling > Bio.SeqIO.parse(). > > For complex file formats (such as SFF files, or anything > else with > important information in a header), the implementation is a > little > more complicated - but we can provide the same API to the > user. > > Note that the indexing step does not fully parse the file, > and > thus may ignore corrupt/invalid records. Only when (if) > they are > accessed will this trigger a parser error. This is a shame, > but > means the indexing can (in general) be done very fast. > > I am proposing to merge all of this (except the SFF file > support), > but would welcome feedback (even after a merger). I > already > have basic unit tests, covering the following SeqIO file > formats: > "ace", "embl", "fasta", "fastq" (all three variants), > "genbank"/"gb", > "ig", "phd", "pir", and "swiss" (plus "sff" but I don't > think that > parser is ready to be checked in yet). > > An example using the new code, this takes just a few > seconds > to index this 238MB GenBank file, and record access is > almost > instant: > > >>> from Bio import SeqIO > >>> gb_dict = SeqIO.indexed_dict("gbpln1.seq", > "gb") > >>> len(gb_dict) > 59918 > >>> gb_dict.keys()[:5] > ['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1', > 'AB161026.1'] > >>> record = gb_dict["AB433451.1"] > >>> print record.id, len(record), > len(record.features) > AB433451.1 590 2 > > And using a 1.3GB FASTQ file, indexing is about a minute, > and > again, record access is almost instant: > > >>> from Bio import SeqIO > >>> fq_dict = > SeqIO.indexed_dict("SRR001666_1.fastq", "fastq") > >>> len(fq_dict) > 7047668 > >>> fq_dict.keys()[:4] > ['SRR001666.2320093', 'SRR001666.2320092', > 'SRR001666.1250635', > 'SRR001666.2354360'] > >>> record = fq_dict["SRR001666.2765432"] > >>> print record.id, record.seq > SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Thu Aug 20 10:13:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 15:13:00 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <818036.61284.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <818036.61284.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> On Thu, Aug 20, 2009 at 2:58 PM, Michiel de Hoon wrote: > > I just have two suggestions: > > Since indexed_dict returns a dictionary-like object, it may make sense > for the _IndexedSeqFileDict to inherit from a dict. We'd have to override things like values() to prevent explosions in memory, and just give a not implemented exception. But yes, good point. > Another issue is whether we can fold indexed_dict and to_dict into one. > Right now we have > > def to_dict(sequences, key_function=None) : > > def indexed_dict(filename, format, alphabet=None) : > > What if we have a single function "dictionary" that can take sequences, a > handle, or a filename, and optionally the format, alphabet, key_function, > and a parameter "indexed" that indicates if the file should be indexed or > kept into memory? Or something like that. I wondered about this, but there are a couple of important differences between my file indexer, and the existing to_dict function. For the Bio.SeqIO.to_dict() function, the optional key_function argument maps a SeqRecord to the desired index (by default the record's id is used). Supporting a key_function for indexing files in the same way would mean every single record in the file must be parsed into a SeqRecord while building the index. This is possible, but would really really slow things down - and while I considered it, I don't like this idea at all. Instead each format indexer has essentially got a "mini parser" which just extracts the id string, so things are much much faster. Also, the to_dict function can be used on any sequences - not just from a file. They could be a list of SeqRecords, or a generator expression filtering output from Bio.SeqIO.parse(). Anything at all really. Finally I had better explain my thoughts on indexing and handles versus filenames. For the SeqIO (and AlignIO etc) parsers, and handle which supports the basic read/readline/iteration functionality can be used. For the indexed_dict() function as written, we need to keep the handle open for as long as the dictionary is kept in memory. We also must have a handle which supports seek and tell (e.g. not a urllib handle, or compressed files). Finally, the mode the file was opened in can be important (e.g. for SFF files universal read lines mode must not be used). So while indexed_dict could take a file handle (instead of a filename) there are a lot of provisos. I felt just taking a filename was the simplest solution here. > Otherwise, the code looks really nice. Thanks! Great - thanks for your comments. Peter From bartek at rezolwenta.eu.org Thu Aug 20 10:19:50 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 20 Aug 2009 16:19:50 +0200 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> Message-ID: <8b34ec180908200719k74ebe1ccqa9cdf61684963997@mail.gmail.com> On Thu, Aug 20, 2009 at 3:41 PM, Chris Dagdigian wrote: > > Git is now installed via 'yum' on dev.open-bio.org > Wonderful, thanks a lot. Bartek From biopython at maubp.freeserve.co.uk Thu Aug 20 10:42:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 15:42:35 +0100 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> Message-ID: <320fb6e00908200742u2e0fbc16v3f17f1e00b13f634@mail.gmail.com> On Thu, Aug 20, 2009 at 3:24 PM, Chris Fields wrote: > > Thanks Chris D! ?Not sure, but can we view repos on dev similar to portal > (via gitweb or similar)? ?Or should we mirror these over to portal for that > purpose? > > chris Again, this falls into the nice to have in the medium/long term, but not essential in the short term (for Biopython to move from CVS to git). We can manage with the github web interface for history etc. Peter From dag at sonsorol.org Thu Aug 20 11:30:31 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Thu, 20 Aug 2009 11:30:31 -0400 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> Message-ID: <7A8D8D5E-4C39-4713-B92F-3A384374DCAC@sonsorol.org> Sure, just need informed advice on the 'best' packages to install and possibly some install help if I get stuck somewhere. -Chris On Aug 20, 2009, at 10:24 AM, Chris Fields wrote: > Thanks Chris D! Not sure, but can we view repos on dev similar to > portal (via gitweb or similar)? Or should we mirror these over to > portal for that purpose? > > chris > > On Aug 20, 2009, at 8:41 AM, Chris Dagdigian wrote: > >> >> Git is now installed via 'yum' on dev.open-bio.org >> >> Regards, >> Chris >> >> >> On Aug 20, 2009, at 9:28 AM, Peter wrote: >> >>> On Thu, Aug 20, 2009 at 2:15 PM, Chris >>> Fields wrote: >>>> >>>> I would be interested in that as well. >>>> >>>> It appears dev.open-bio.org has apt (there is an /etc/apt >>>> directory), but >>>> I'm failing to find apt-get in my PATH. Haven't installed on it >>>> yet, but a >>>> packaged version would probably be easier. >>> >>> If we can have a packaged version of git on dev.open-bio.org from >>> the >>> Linux distro, that would be easiest (especially for keeping it up >>> to date). >>> >>>> Also, are we planning ro mirrors on portal for anon access, or >>>> should we >>>> (ab)use github for that purpose? To me a ro mirror sorta defeats >>>> the >>>> purpose of git... >>> >>> For Biopython we plan to use github (initially at least) for >>> committing >>> changes. This will also allow anonymous access. >>> >>> A public OBF read only mirror of a git repository is still useful >>> for people to >>> clone from, and keep the local copy up to date - plus as a backup >>> for >>> if/when github is congested or unavailable. But not essential. >>> >>> Peter >>> >>> _______________________________________________ >>> Root-l mailing list >>> Root-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/root-l >> From biopython at maubp.freeserve.co.uk Thu Aug 20 12:19:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 17:19:19 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <818036.61284.qm@web62408.mail.re1.yahoo.com> <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> Message-ID: <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com> Peter wrote: > Michiel wrote: >> >> I just have two suggestions: >> >> Since indexed_dict returns a dictionary-like object, it may make sense >> for the _IndexedSeqFileDict to inherit from a dict. > > We'd have to override things like values() to prevent explosions in memory, > and just give a not implemented exception. But yes, good point. Done on github - I also had to override all the writeable dict methods like pop and clear which don't make sense here. The code for the class is now a bit longer, but is certainly more dict-like. I also had to implement __str__ and __repr__ to do something I think is useful and sensible. Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 14:07:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 19:07:38 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <818036.61284.qm@web62408.mail.re1.yahoo.com> <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com> Message-ID: <320fb6e00908201107u4c09fd7dj1bcc60ceabe0ecf9@mail.gmail.com> On Thu, Aug 20, 2009 at 5:19 PM, Peter wrote: > > Done on github - I also had to override all the writeable dict methods like > pop and clear which don't make sense here. The code for the class is now > a bit longer, but is certainly more dict-like. I also had to implement __str__ > and __repr__ to do something I think is useful and sensible. > I have checked this new indexing functionality into CVS, but the github branch is still there for the SFF file support (parsing and indexing). We can of course still easily tweak the naming or the public side of the API. In the meantime I'll think about updating the tutorial... Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 14:11:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 19:11:16 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> Message-ID: <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> On Thu, Aug 20, 2009 at 2:06 PM, Bartek Wilczynski wrote: > On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote: > >> Bartek - do you think we need git on any of the other OBF machines >> in addition to dev.open-bio.org (current IP 207.154.17.71)? > > What we _need_ is a single machine, where we can run scripts from cron > and where git is installed. That's why I requested the installation on > dev.open-bio machine (it happens to be the only one I have an account > on). The idea is to run something from cron and pull from github to > have a backup copy of an up-to-date branch. The scripts can (after > each update) push to other machines. Bartek, now that Chris D has kindly installed git on dev.open-bio.org, can you look into backing up our github repository onto dev.open-bio.org? Initially just running a cron job using your own user account should be fine. Thanks, Peter From bartek at rezolwenta.eu.org Thu Aug 20 17:07:06 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 20 Aug 2009 23:07:06 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> Message-ID: <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> On Thu, Aug 20, 2009 at 8:11 PM, Peter wrote: > Bartek, now that Chris D has kindly installed git on dev.open-bio.org, > can you look into backing up our github repository onto dev.open-bio.org? > Initially just running a cron job using your own user account should be fine. I've only quickly tested git, and I was able to pull from github with no problems. I will try porting thew scripts from my machine to dev.open-bio tomorrow. In the meantime, I've checked that biopython account on dev.open-bio machine is assigned to Brad Marshall. I haven't seen him posting to the list lately. Does anyone have the access to this account? cheers Barte -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From bugzilla-daemon at portal.open-bio.org Fri Aug 21 08:26:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 08:26:45 -0400 Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid os.cmd In-Reply-To: Message-ID: <200908211226.n7LCQjX7025910@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2867 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 08:26 EST ------- I'm going to assume the attempted fix worked (included with Biopython 1.51 final), and close this bug. Please reopen it if there is still a problem. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 08:52:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 08:52:24 -0400 Subject: [Biopython-dev] [Bug 2544] Bio.GenBank and SeqFeature improvements In-Reply-To: Message-ID: <200908211252.n7LCqOOt026458@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2544 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 08:52 EST ------- (In reply to comment #5) > > I'm leaving this bug open for defining __repr__ for the > Bio.SeqFeature.Reference object ... ONLY. > Done in CVS, marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:07:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:07:23 -0400 Subject: [Biopython-dev] [Bug 2711] GenomeDiagram.py: write() and write_to_string() are inefficient and don't check inputs In-Reply-To: Message-ID: <200908211307.n7LD7NoU026962@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2711 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #28 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:07 EST ------- (In reply to comment #27) > So the only remaining issue is a unit test involving at least checks for > the presence of renderPM due to versions of reportlab less than 2.2. Added test_GraphicsBitmaps.py to CVS which will make sure we can output a bitmap, and flag renderPM as a missing (optional) dependency if not found. Marking issue as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:11:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:11:48 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200908211311.n7LDBmUT027199@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #26 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:11 EST ------- Marking this old bug as fixed, given the work around. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:24:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:24:56 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method In-Reply-To: Message-ID: <200908211324.n7LDOuP7027608@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:24 EST ------- (In reply to comment #3) > Patch for Seq object checked in. > > Leaving bug open for possible similar addition to the SeqRecord object. > Done in Bio/SeqRecord.py CVS revision 1.43, marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:24:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:24:59 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200908211324.n7LDOxRW027624@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 Bug 2351 depends on bug 2853, which changed state. Bug 2853 Summary: Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method http://bugzilla.open-bio.org/show_bug.cgi?id=2853 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:55:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:55:58 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200908211355.n7LDtwvC028668@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:55 EST ------- I've checked in a slightly revised version of Cymon's patch to allow Bio.SeqIO to write "phd" files. Checking in Tests/test_SeqIO_QualityIO.py; /home/repository/biopython/biopython/Tests/test_SeqIO_QualityIO.py,v <-- test_SeqIO_QualityIO.py new revision: 1.14; previous revision: 1.13 done Checking in Tests/output/test_SeqIO; /home/repository/biopython/biopython/Tests/output/test_SeqIO,v <-- test_SeqIO new revision: 1.51; previous revision: 1.50 done Checking in Bio/SeqIO/__init__.py; /home/repository/biopython/biopython/Bio/SeqIO/__init__.py,v <-- __init__.py new revision: 1.58; previous revision: 1.57 done Checking in Bio/SeqIO/PhdIO.py; /home/repository/biopython/biopython/Bio/SeqIO/PhdIO.py,v <-- PhdIO.py new revision: 1.8; previous revision: 1.7 done Cymon - could you double check this please? I made one change regarding the filename/record description, and also you hadn't rounded the Solexa scores to the nearest integer value after they were converted to PHRED scores. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:42 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200908211421.n7LELgY3029289@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 10:21 EST ------- This should be fixed in CVS (pushed to github hourly), although I used a slighlty different style to break up the long test methods. Please reopen this bug if the problem persists. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:44 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200908211421.n7LELi5g029306@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Bug 2895 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:47 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200908211421.n7LELll4029321@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Bug 2893 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:50 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200908211421.n7LELobS029336@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Bug 2892 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dmikewilliams at gmail.com Sun Aug 23 13:47:53 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Sun, 23 Aug 2009 13:47:53 -0400 Subject: [Biopython-dev] how to determine BioPython version number Message-ID: Hi there. About a year ago a message was posted that suggested using Martel.__version__ to determine that BioPython versio number. A couple weeks ago the draft announcment for BioPython 1.51 said that Martel is no longer included. If Martel is no longer included, is there some other way for a program to determine the version number of BioPython that is installed? Tried searching for this, but found nothing relevant. Mike Below are snippets from the two messages referred to above: subject: [Biopython-dev] determining the version Peter biopython at maubp.freeserve.co.uk Wed Sep 24 17:12:24 EDT 2008 > Somewhat related to this, what is the appropriate way to find the version of > BioPython installed within Python? So I'm not the only person to have wondered about this. For now, I can only suggest an ugly workarround: import Martel print Martel.__version__ Since Biopython 1.45, by convention the Martel version has been incremented to match that of Biopython. Of course, in a few releases time we probably won't be including Martel any more. On Thu, Aug 13, 2009 at 6:10 AM, Peter wrote: subject: [Biopython-dev] Draft announcement for Biopython 1.51 ... we no longer include Martel/Mindy, and thus don't have any dependence on mxTextTools. From biopython at maubp.freeserve.co.uk Sun Aug 23 15:58:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 23 Aug 2009 20:58:07 +0100 Subject: [Biopython-dev] how to determine BioPython version number In-Reply-To: References: Message-ID: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com> On Sun, Aug 23, 2009 at 6:47 PM, Mike Williams wrote: > > Hi there. ?About a year ago a message was posted that suggested using > Martel.__version__ to determine that BioPython versio number. You are looking at at old thread, and missed what happened since, try: import Bio Bio.__version__ I think this deserves a FAQ entry in the next release of the tutorial... The Martel version "trick" was a work around for determining the version which worked for a few moderately old versions of Biopython (prior to us adding Bio.__version__). Peter From dmikewilliams at gmail.com Sun Aug 23 16:14:51 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Sun, 23 Aug 2009 16:14:51 -0400 Subject: [Biopython-dev] how to determine BioPython version number In-Reply-To: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com> References: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com> Message-ID: On Sun, Aug 23, 2009 at 3:58 PM, Peter wrote: > You are looking at at old thread, and missed what happened since, try: > > import Bio > Bio.__version__ > > I think this deserves a FAQ entry in the next release of the tutorial... > > The Martel version "trick" was a work around for determining the > version which worked for a few moderately old versions of Biopython > (prior to us adding Bio.__version__). > Thanks Peter. I had two problems, looking at an old thread and having an older versions of BioPython, 1.48 and 1.49 on fedora 10 and 11. The method you supplied works fine with the 1.51b version I just got from cvs. Mike From bugzilla-daemon at portal.open-bio.org Mon Aug 24 11:15:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 11:15:44 -0400 Subject: [Biopython-dev] [Bug 2904] New: Interface for Novoalign Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2904 Summary: Interface for Novoalign Product: Biopython Version: 1.51 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: osvaldo.zagordi at bsse.ethz.ch Hi, I wrote an interface for the short sequence alignment program Novoalign (www.novocraft.com). All I did was to modify the interface for Muscle. I might cover some other aligner in the near future. Hope it's useful to someone. Best, Osvaldo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 11:16:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 11:16:48 -0400 Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign In-Reply-To: Message-ID: <200908241516.n7OFGm98032344@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2904 ------- Comment #1 from osvaldo.zagordi at bsse.ethz.ch 2009-08-24 11:16 EST ------- Created an attachment (id=1361) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1361&action=view) Interface to run novoalign (www.novocraft.com) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 11:21:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 11:21:07 -0400 Subject: [Biopython-dev] [Bug 2905] New: Short read alignment format Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2905 Summary: Short read alignment format Product: Biopython Version: 1.51 Platform: All OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: osvaldo.zagordi at bsse.ethz.ch Hi again, is there any plan to develop some parsers for alignment of short reads? There's a lot of formats around, and the most serious proposal for a format I've seen is SAM (http://samtools.sourceforge.net/). I should start writing something to parse this output soon. Any suggestion on where to start from (in order not to depend on some module that will be soon obsolete)? Thanks, Osvaldo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 21:34:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 21:34:02 -0400 Subject: [Biopython-dev] [Bug 2907] New: When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2907 Summary: When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' not 'bp' although the .seq.alphabet is set (correctly, I think) to generic_dna. The background here is that we're annotating some viral genomes computationally (however, the annotation isn't necessary for the problem here, see below) and then writing the output to .gb format. After this we load the file using LaserGene (a commercial sequence editing program) to have a look at it etc. This doesn't work terribly well because of the 'aa' designation in the header line. Apart from this, the export seems ok. I'm using a git download from mid-June 09. here is an example which illustrates this: # load dependencies from Bio import Entrez from Bio import SeqIO from Bio import SeqRecord from Bio.Alphabet import generic_protein, generic_dna # get a sequence from Genbank print "going to recover a sequence from genbank...." ifh = Entrez.efetch(db="nucleotide",id="DQ923122",rettype="gb") # parse the file handle recordlist=[] print "OK, got the records from genbank, parsing ..." for record in SeqIO.parse(ifh, "genbank"): recordlist.append(record) ifh.close() # write it to a file for thisrecord in recordlist: # confirm it's dna assert (type(thisrecord.seq.alphabet)==type(generic_dna)), "We are supposed to be dealing with a DNA sequence, but we aren't, can't continue." # write to gb ofn=thisrecord.id+".gb" print "Writing thisrecord to ",ofn ofh=open(ofn,"w") SeqIO.write([thisrecord], ofh, "gb") ofh.close exit() # top lines of the genbank file reads as follows # #LOCUS DQ923122 34250 aa DNA VRL 01-JAN-1980 #DEFINITION Human adenovirus 52 isolate T03-2244, complete genome. #ACCESSION DQ923122 #VERSION DQ923122.2 GI:124375632 #KEYWORDS #SOURCE Human adenovirus 52 # ORGANISM Human adenovirus 52 # Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus; # unclassified Human adenoviruses #FEATURES Location/Qualifiers # source 1..34250 # /country="USA" # /isolate="T03-2244" # /mol_type="genomic DNA" # /organism="Human adenovirus 52" # /db_xref="taxon:332179 Thank you for any advice you have to offer. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 21:36:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 21:36:48 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908250136.n7P1amWC017814@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 ------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-08-24 21:36 EST ------- Created an attachment (id=1362) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1362&action=view) test case, which is the same as that pasted into the message -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 21:37:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 21:37:48 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908250137.n7P1bm9c017839@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-08-24 21:37 EST ------- Created an attachment (id=1363) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1363&action=view) example of the genbank file written -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 05:40:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 25 Aug 2009 05:40:29 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908250940.n7P9eT7w001376@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-25 05:40 EST ------- Hi David, I spotted this (aa/bp mix up in the LOCUS line) after the beta was out, and it should already be fixed in Biopython 1.51 final. Please update and retest, and if there is still a problem please reopen this bug. Thanks! Note that unless I was going to modify the annotation (which the background use case suggests you are), I would save the raw GenBank record from Entrez directly to disk (since parsing it and then writing it back out with SeqIO isn't yet perfect - e.g. the date in the LOCUS line). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 06:09:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 25 Aug 2009 06:09:50 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908251009.n7PA9o4T002461@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 ------- Comment #4 from david.wyllie at ndm.ox.ac.uk 2009-08-25 06:09 EST ------- thank you - this is indeed fixed in the latest git version. Best wishes David (In reply to comment #3) > Hi David, > > I spotted this (aa/bp mix up in the LOCUS line) after the beta was out, and it > should already be fixed in Biopython 1.51 final. Please update and retest, and > if there is still a problem please reopen this bug. Thanks! > > Note that unless I was going to modify the annotation (which the background use > case suggests you are), I would save the raw GenBank record from Entrez > directly to disk (since parsing it and then writing it back out with SeqIO > isn't yet perfect - e.g. the date in the LOCUS line). > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Aug 25 06:33:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 11:33:56 +0100 Subject: [Biopython-dev] Command line wrappers for assembly tools Message-ID: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> Hi all, Osvaldo Zagordi has recently offered a Bio.Application style command line wrapper for Novoalign (a commercial short read aligner from Novocraft), see enhancement Bug 2904, and the Novocraft website: http://bugzilla.open-bio.org/show_bug.cgi?id=2904 http://www.novocraft.com/products.html Note that Novocraft do offer a trial/evaluation version, but I have no idea what the terms and conditions are, and I personally do not have access to the commercial tool (e.g. for testing the wrapper). Nevertheless, this would be a nice addition to Biopython. I personally would like to have wrappers for some of the "off instrument" applications from Roche 454 (e.g. the Newbler assembler, read mapper and perhaps their SFF tools), which I have been using. These are Linux only (which is a pain as Windows and Mac OS X are out), but Roche seem relatively relaxed about making the software available to any academics using their sequencer (I'd suggest anyone interested contact your local sequencing centre for this). While some of these tools would fit under Bio.Align.Applications, does creating a similar collection at Bio.Sequencing.Applications make more sense? For example, the Roche sffinfo tool isn't in itself a alignment application - but it is related to DNA sequencing. Peter From mjldehoon at yahoo.com Tue Aug 25 06:41:20 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 25 Aug 2009 03:41:20 -0700 (PDT) Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: Message-ID: <54938.41623.qm@web62405.mail.re1.yahoo.com> I did (3) and (4) below, and I added a __str__ method but I didn't touch the other print functions (2). For (1), maybe a better way is to subclass the SeqMat class for each of the matrix types instead of storing the matrix type in self.mat_type. Any comments or objections (especially Iddo)? --Michiel. --- On Sat, 7/25/09, Iddo Friedberg wrote: > I'm the author of subsmat IIRC. > Everything sounds good, but I would not make 2.6 changes > that will break on 2.5. Ubuntu still uses 2.5 and I imagine > other linux distros do too. > 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2, > SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at > the level of Bio.SubsMat). I think that these should be > class variables of the Bio.SubsMat.SeqMat class. > > > > > 2) The print_mat method. It would be more Pythonic to use > __str__, __format__ for this, though the latter is only > available for Python versions >= 2.6. > > > > 3) The __sum__ method. I guess that this was intended to be > __add__? > > > > 4) The sum_letters attribute. To calculate the sum of all > values for a given letter, currently the following two > functions are involved: > > > > ? def all_letters_sum(self): > > ? ? ?for letter in self.alphabet.letters: > > ? ? ? ? self.sum_letters[letter] = > self.letter_sum(letter) > > > > ? def letter_sum(self,letter): > > ? ? ?assert letter in self.alphabet.letters > > ? ? ?sum = 0. > > ? ? ?for i in self.keys(): > > ? ? ? ? if letter in i: > > ? ? ? ? ? ?if i[0] == i[1]: > > ? ? ? ? ? ? ? sum += self[i] > > ? ? ? ? ? ?else: > > ? ? ? ? ? ? ? sum += (self[i] / 2.) > > ? ? ?return sum > > > > As you can see, the result is not returned, but stored in > an attribute called sum_letters. I suggest to replace this > with the following: > > > > ? ?def sum(self): > > ? ? ? ?result = {} > > ? ? ? ?for letter in self.alphabet.letters: > > ? ? ? ? ? ?result[letter] = 0.0 > > ? ? ? ?for pair, value in self: > > ? ? ? ? ? ?i1, i2 = pair > > ? ? ? ? ? ?if i1==i2: > > ? ? ? ? ? ? ? ?result[i1] += value > > ? ? ? ? ? ?else: > > ? ? ? ? ? ? ? ?result[i1] += value / 2 > > ? ? ? ? ? ? ? ?result[i2] += value / 2 > > ? ? ? ?return result > > > > so without storing the result in an attribute. > > > > > > Any comments, objections? > > > > --Michiel > > > > --- On Fri, 7/24/09, Michiel de Hoon > wrote: > > > > > From: Michiel de Hoon > > > Subject: Re: [Biopython-dev] Calculating motif scores > > > To: "Bartek Wilczynski" > > > Cc: biopython-dev at biopython.org > > > Date: Friday, July 24, 2009, 5:34 AM > > > > > > > As for the PWM being a separate class and used by > the > > > motif: > > > > I don't know. I'm using > Bio.SubsMat.FreqTable for > > > implementing > > > > frequency table, so I understand that the new > PWM > > > class would > > > > be basically a "smarter" FreqTable. > I'm not sure > > > whether it > > > > solves any problems... > > > > > > Wow, I didn't even know the Bio.SubsMat module > existed. > > > As we have several different but related modules > > > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we > should > > > define the purpose and scope of each of these > modules. > > > Maybe a good way to start is the documentation. > Bio.SubsMat > > > is currently divided into two chapters (14.4 and > 16.2). I'll > > > have a look at this over the weekend to see if this > can be > > > cleaned up a bit. > > > > > > --Michiel. > > > > > > > > > ? ? ? > > > _______________________________________________ > > > Biopython-dev mailing list > > > Biopython-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > > > > > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From bartek at rezolwenta.eu.org Tue Aug 25 06:52:24 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 25 Aug 2009 12:52:24 +0200 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: <54938.41623.qm@web62405.mail.re1.yahoo.com> References: <54938.41623.qm@web62405.mail.re1.yahoo.com> Message-ID: <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com> On Tue, Aug 25, 2009 at 12:41 PM, Michiel de Hoon wrote: > I did (3) and (4) below, and I added a __str__ method but I didn't touch the other print functions (2). > > For (1), maybe a better way is to subclass the SeqMat class for each of the matrix types instead of storing the matrix type in self.mat_type. Any comments or objections (especially Iddo)? > Hi, I don't have any objections here. Just for clarification: is it now in CVS or on some git branch? cheers Bartek From biopython at maubp.freeserve.co.uk Tue Aug 25 06:59:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 11:59:38 +0100 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com> References: <54938.41623.qm@web62405.mail.re1.yahoo.com> <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com> Message-ID: <320fb6e00908250359i2c35d8b0pe84d590a9527b8bb@mail.gmail.com> On Tue, Aug 25, 2009 at 11:52 AM, Bartek Wilczynski wrote: > I don't have any objections here. Just for clarification: is it now in > CVS or on some git branch? All on CVS still (and thus being pushed to gitgub). Do you want to give us a git status update on the other thread? http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006620.html Peter From bartek at rezolwenta.eu.org Tue Aug 25 07:58:05 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 25 Aug 2009 13:58:05 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> Message-ID: <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> Hi all, Time for an update on how things are with git and biopython. On Thu, Aug 20, 2009 at 11:07 PM, Bartek Wilczynski wrote: > I've only quickly tested git, and I was able to pull from github with > no problems. I will try porting thew scripts from my machine to > dev.open-bio tomorrow. That works fine. I've set up a crontab script (/home/bartek/github_backup.sh) on dev.open-bio machine which fetches the current github branch and saves it to /home/bartek/biopython_from_github. Then it creates a "bare repository" (/home/bartek/biopython.git) which can be then used by others. If you have an shell account on the dev machine, you should be able to clone it over ssh with the following command: git clone ssh://_YOUR_USERNAME_ at dev.open-bio.org/~bartek/biopython.git if this is put into a directory accesible via http, one can also clone (anonymously) over http. I don't have an account on biopython www server, but I was able to put it on my server (just to check if it works). You can fetch it like this: git clone http://bartek.rezolwenta.eu.org/biopython.git In conclusion: it works. I would say, that the next important step is to decide when to stop commiting to CVS... I'm just waiting for a signal to terminate the updates from CVS to github and we are done. In the meantime, it would make sense to make it more stable which involves some technical details (mostly related to user accounts) Namely, we need to - set up these scripts on biopython account instead of my own (see below) - decide whether we want other things to be done by these scripts (generating src tarballs, etc) > > In the meantime, I've checked that biopython account on dev.open-bio > machine is assigned to Brad Marshall. I haven't seen him posting to > the list lately. Does anyone have the access to this account? This would come in handy now. Anybody knows how to access this account? cheers Bartek From biopython at maubp.freeserve.co.uk Tue Aug 25 08:13:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 13:13:24 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> Message-ID: <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com> On Tue, Aug 25, 2009 at 12:58 PM, Bartek Wilczynski wrote: > Hi all, > > Time for an update on how things are with git and biopython. > > On Thu, Aug 20, 2009 at 11:07 PM, Bartek > Wilczynski wrote: >> I've only quickly tested git, and I was able to pull from github with >> no problems. I will try porting thew scripts from my machine to >> dev.open-bio tomorrow. > > That works fine. I've set up a crontab script > (/home/bartek/github_backup.sh) on dev.open-bio machine which fetches > the current github branch and saves it to > /home/bartek/biopython_from_github. Then it creates a "bare > repository" (/home/bartek/biopython.git) which can be then used by > others. If you have an shell account on the dev machine, you should be > able to clone it over ssh with the following command: > git clone ssh://_YOUR_USERNAME_ at dev.open-bio.org/~bartek/biopython.git Yes, that works for me (and thus in theory anyone with a dev account). > if this is put into a directory accesible via http, one can also clone > (anonymously) over http. I don't have an account on biopython www > server, but I was able to put it on my server (just to check if it > works). You can fetch it like this: > git clone http://bartek.rezolwenta.eu.org/biopython.git Excellent. We can ask the OBF to give you access to biopython.org (and Brad too since it would have helped when he did the recent release) which would help setting this stuff up [and see below] > In conclusion: it works. I would say, that the next important step is > to decide when to stop commiting to CVS... I'm just waiting for a > signal to terminate the updates from CVS to github and we are done. OK - so the basics are ready (backing up from github to an OBF machine). Good job. > In the meantime, it would make sense to make it more stable which > involves some technical details (mostly related to user accounts) > Namely, we need to > - set up these scripts on biopython account instead of my own (see below) > - decide whether we want other things to be done by these scripts > (generating src tarballs, etc) > >> In the meantime, I've checked that biopython account on dev.open-bio >> machine is assigned to Brad Marshall. I haven't seen him posting to >> the list lately. Does anyone have the access to this account? > > This would come in handy now. Anybody knows how to access this account? I have no idea who Brad Marshall is. We'll have to take this up with the OBF. I'll email you off list... Peter From biopython at maubp.freeserve.co.uk Tue Aug 25 09:23:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 14:23:15 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com> Message-ID: <320fb6e00908250623j19daa0cey429265f8c2bcb4ff@mail.gmail.com> >>> In the meantime, I've checked that biopython account on dev.open-bio >>> machine is assigned to Brad Marshall. I haven't seen him posting to >>> the list lately. Does anyone have the access to this account? >> >> This would come in handy now. Anybody knows how to access this account? > > I have no idea who Brad Marshall is. We'll have to take this up with > the OBF. I'll email you off list... Just for the record, on closer inspection, Brad Marshall has/had a separate account but it included "biopython" in the user's name. I presume he was another former contributor to the project. Peter From biopython at maubp.freeserve.co.uk Tue Aug 25 11:11:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 16:11:34 +0100 Subject: [Biopython-dev] Fwd: More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: <320fb6e00908250811r18aaec6fj6c2f0e40996fda0a@mail.gmail.com> Hi all, This was posted to the OBF cross project mailing list, but if any of you guys have some sample FASTQ data please consider sharing a small sample (e.g. the first ten reads). We would need this to be "no-strings attached" so that it could be used in any of the OBF projects under their assorted open source licences. In addition to the notes below, I would be interested in is any FASTQ files from your local sequence centre, which may use their own conventions for the record title lines (e.g. record names). Thanks, Peter P.S. Rather that trying to send any attachments to the mailing list, please email me personally. ---------- Forwarded message ---------- From: Peter Date: Tue, Aug 25, 2009 at 12:24 PM Subject: More FASTQ examples for cross project testing To: open-bio-l at lists.open-bio.org Cc: Peter Rice , Chris Fields Hi all, I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) off list about this plan. I'm going to co-ordinate putting together a set of valid FASTQ files for shared testing (to supplement the existing set of invalid FASTQ files already done and being used in Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). What I have in mind is: XXX_original_YYY.fastq - sample input XXX_as_sanger.fastq - reference output XXX_as_solexa.fastq - reference output XXX_as_illumina.fastq - reference output where XXX is some name (e.g. wrapped1, wrapped2, shortreads, longreads, sanger_full_range, solexa_full_range ...) and YYY is the FASTQ variant (sanger, solexa or illumina) for the "input" file. For example, we might have: wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping, perhaps repeating the title on the plus lines wrapped1_as_sanger.fastq - The same data but using the consensus of no line wrapping and omitting the repeated title on the plus lines. wrapped1_as_solexa.fastq - As above, but converted in Solexa scores (ASCII offset 64), with capping at Solexa 62 (ASCII 126). wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII offset 64, with capping at PHRED 62 (ASCII 126). Here "wrapped1" would be a Sanger FASTQ file with some line wrapping (e.g. at 60 characters). I will include "sanger_full_range" which would cover all the valid PHRED scores from 0 to 93, and similarly for Solexa and Illumina files - these are important for testing the score conversions. I have some ideas for deliberately tricky (but valid) files which should properly test any parser. The point is we have "perhaps odd but valid" originals, plus the "cleaned up" versions (using the same FASTQ variant), and "cleaned up" versions in the other two FASTQ variants. Ideally asking Biopython/BioPerl/EMBOSS to convert the XXX_original_YYY.fastq files into any of the three FASTQ variants will give exactly the same as the reference outputs. If anyone has any comments or suggestions please speak up (e.g. my suggested naming conventions). Real life examples of FASTQ files anyone has had trouble parsing (even with 3rd party tools) would be particularly useful - although we'd probably want to cut down big example files in order to keep the dataset to a reasonable size. Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Aug 26 07:36:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 12:36:36 +0100 Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures Message-ID: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> Hi all, I've retitled this thread (originally on the main list) to focus on the more general idea of filtering SeqRecord feature list (as that has very little to do with SQLAlchemy) and how this interact with nested SeqFeature objects. On Wed, Aug 26, 2009, Peter wrote: > On Wed, Aug 26, 2009, Kyle Ellrott wrote: >> I've added a new database function lookupFeature to quickly search for >> sequences features without have to load all of them for any particular >> sequence. >> ... > > Interesting - and potentially useful if you are interested in just > part of the genome (e.g. an operon). > > Have you tested this on composite features (e.g. a join)? > Without looking into the details of your code this isn't clear. > > I wonder how well this would scale with a big BioSQL database > ... > > On the other hand, if all the record's features have already been > loaded into memory, there would just be thousands of locations > to look at - it might be quicker. > > This brings me to another idea for how this interface might work, > via the SeqRecord - how about adding a method like this: > > def filtered_features(self, start=None, end=None, type=None): > > Note I think it would also be nice to filter on the feature type (e.g. > CDS or gene). This method would return a sublist of the full > feature list (i.e. a list of those SeqFeature objects within the > range given, and of the appropriate type). This could initially > be implemented with a simple loop, but there would be scope > for building an index or something more clever. > > [Note we are glossing over some potentially ambiguous > cases with complex composite locations, where the "start" > and "end" may differ from the "span" of the feature.] > > The DBSeqRecord would be able to do the same (just inherit > the method), but you could try doing this via an SQL query, ... Brad, it occurred to me this idea (a filtered_features method on the SeqRecord) might cause trouble with what I believe you have in mind for parsing GFF files into nested SeqFeatures. Is that still your plan? In particular, if you have save a CDS feature within a gene feature, and the user asked for all the CDS features, simply scanning the top level features list would miss it. Would it be safe to assume (or even enforce) that subfeatures are always *with* the location spanned by the parent feature? Even with this proviso, a daughter feature may still be small enough to pass a start/end filter, even if the parent feature is not. Again, scanning the top level features list would miss it. All these issues go away if we continue to treat the SeqRecord features list as a flat list, and only use the SeqFeature subfeatures list purely for storing composite locations (i.e. sub regions of the parent feature - not for true subfeatures). There are other downsides to using nested SubFeatures, it will probably require a lot of reworking of the GenBank output due to how composite features like joins are currently stored, and I haven't even looked at the BioSQL side of things. You may have looked at that already though, so I may just be worrying about nothing. Peter From eoc210 at googlemail.com Sun Aug 30 15:33:59 2009 From: eoc210 at googlemail.com (Ed Cannon) Date: Sun, 30 Aug 2009 20:33:59 +0100 Subject: [Biopython-dev] OBO2OWL parser / converter Message-ID: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> Hi All, I would like to thank you guys for all your hard work and effort in making biopython a great piece of open software. I would also like to introduce myself, my name is Ed Cannon, I am a postdoc at Cambridge University working in the fields of chemo/bioinformatics and semantic web technologies in the group of Peter Murray-Rust. Since a fair amount of my work involves ontologies, I have written an open biomedical ontology (.obo) to web ontology language (.owl) converter. The resultant file can be loaded and used from Protege. I was wondering if this software would be of any interest to the biopython community? I have just sent a pull request to biopython on github. The code is located at my branch on my account: http://github.com/eoc21/biopython/tree/eoc21Branch. Thanks, Ed From krother at rubor.de Mon Aug 31 07:19:07 2009 From: krother at rubor.de (Kristian Rother) Date: Mon, 31 Aug 2009 13:19:07 +0200 Subject: [Biopython-dev] RNA module contributions Message-ID: <4A9BB1AB.1070608@rubor.de> Hi, to start work on RNA modules, I'd like to contribute some of our tested modules to BioPython. Before I place them into my GIT branch, it would be great to get some comments: Bio.RNA.SecStruc - represents a RNA secondary structures, - recognizing of SSEs (helix, loop, bulge, junction) - recognizing pseudoknots Bio.RNA.ViennaParser - parses RNA secondary structures in the Vienna format into SecStruc objects. Bio.RNA.BpseqParser - parses RNA secondary structures in the Bpseq format into SecStruc objects. Connected to RNA, but with a wider focus: Bio.???.ChemicalGroupFinder - identifies chemical groups (ribose, carboxyl, etc) in a molecule graph (place to be defined yet) There is a contribution from Bjoern Gruening as well: Bio.PDB.PDBMLParser - creates PDB.Structure objects from PDB-XML files. Comments and suggestions welcome! Best Regards, Kristian Rother From hlapp at gmx.net Mon Aug 31 08:17:43 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 31 Aug 2009 08:17:43 -0400 Subject: [Biopython-dev] OBO2OWL parser / converter In-Reply-To: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> Message-ID: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net> Hi Ed - is your converter operating in a way that is congruent with (or even utilizing) the mapping and the converter provided by the NCBO and Berkeley Ontology projects? http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page If not, I'm not sure how beneficial it is for users to have multiple and possibly conflicting mappings. -hilmar On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote: > Hi All, > > I would like to thank you guys for all your hard work and effort in > making > biopython a great piece of open software. > > I would also like to introduce myself, my name is Ed Cannon, I am a > postdoc > at Cambridge University working in the fields of chemo/ > bioinformatics and > semantic web technologies in the group of Peter Murray-Rust. > > Since a fair amount of my work involves ontologies, I have written > an open > biomedical ontology (.obo) to web ontology language (.owl) > converter. The > resultant file can be loaded and used from Protege. I was wondering > if this > software would be of any interest to the biopython community? I > have just > sent a pull request to biopython on github. The code is located at > my branch > on my account: http://github.com/eoc21/biopython/tree/eoc21Branch. > > Thanks, > > Ed > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon Aug 31 08:42:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 13:42:52 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> Message-ID: <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> On Thu, Aug 20, 2009 at 12:28 PM, Peter wrote: > Hi all, > > You may recall a thread back in June with Cedar Mckay (cc'd - not > sure if he follows the dev list or not) about indexing large sequence > files - specifically FASTA files but any sequential file format. I posted > some rough code which did this building on Bio.SeqIO: > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html The Bio.SeqIO.indexed_dict() functionality is in CVS/github now as I would like some wider testing. My earlier email explained the implementation approach, and gave some example code: http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html This aims to solve a fairly narrow problem - dictionary like random access to any record in a sequence file as a SeqRecord via the record id string as the key. It should works on any sequential file format, and can even works on binary SFF files (code on a branch in github still). Bio.SeqIO.to_dict() has always offered a very simple in memory solution (a python dictionary of SeqRecord objects) which is fine for small files (e.g. a few thousand FASTA entries), but won't scale much more than that. Using a BioSQL database would also allow random access to any SeqRecord (and not just by looking it up by the identifier), but I doubt it would scale well to 10s of millions of short read sequences. It is also non-trivial to install the DB itself, the schema and the Python bindings. The new Bio.SeqIO.indexed_dict() code offers a read only dictionary interface which does work for millions of reads. As implemented, there is still a memory bound as all the keys and their associated file offsets are held in memory. For example, a 7 million record FASTQ file taking 1.3GB on disk seems to need almost 700MB in memory (just a very crude measurement). Although clearly this is much more capable than the naive full dictionary in memory approach (which is out of the question here), this too could become a real bottle-neck before long Biopython's old Martel/Mindy code used to build an on disk index, which avoided this memory constraint. However, we're removed that (due to mxTextTool breakage etc). In any case, it was also much much slower: http://lists.open-bio.org/pipermail/biopython/2009-June/005309.html Using a Bio.SeqIO.indexed_dict() like API, we could of course build an index file on disk to avoid this potential memory problem. As Cedar suggested, this index file could be handled transparently (created and deleted automatically), or indeed could be explicitly persisted/reloaded to avoid re-indexing unnecessarily: http://lists.open-bio.org/pipermail/biopython/2009-June/005265.html Sticking to the narrow use case of (read only) random access to a sequence file, all we really need to store is the lookup table of keys (or their Python hash) and offsets in the original file. If they are fast enough, we might even be able to reuse the old Martel/ Mindy index file format... or the OBDA specification if that is still in use: http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html Another option (like the shelve idea we talked about last month) is to parse the sequence file with SeqIO, and serialise all the SeqRecord objects to disk, e.g. with pickle or some key/value database. This is potentially very complex (e.g. arbitrary Python objects in the annotation), and could lead to a very large "index" file on disk. On the other hand, some possible back ends would allow editing the database... which could be very useful. Brad - do you have any thoughts? I know you did some work with key/value indexers: http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ Peter From chapmanb at 50mail.com Mon Aug 31 08:54:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 08:54:52 -0400 Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures In-Reply-To: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> References: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> Message-ID: <20090831125452.GA75451@sobchak.mgh.harvard.edu> Peter and Kyle; > I've retitled this thread (originally on the main list) to focus on the > more general idea of filtering SeqRecord feature list (as that has > very little to do with SQLAlchemy) and how this interact with > nested SeqFeature objects. Sorry to have missed this thread in real time; I was out of town last week. Generally, it is great we are focusing on standard queries and building up APIs to make them more intuitive. Nice. > Brad, it occurred to me this idea (a filtered_features method > on the SeqRecord) might cause trouble with what I believe you > have in mind for parsing GFF files into nested SeqFeatures. > Is that still your plan? Yes, that was still the idea although I haven't dug into it much beyond last time we discussed this. This is the direct translation of the GFF way of handling multiple transcripts and coding features, and seems like the intuitive way to handle the problem. > In particular, if you have save a CDS feature within a gene > feature, and the user asked for all the CDS features, simply > scanning the top level features list would miss it. I think we'll be okay here. With nesting everything would still be stored in the seqfeature table. The seqfeature_relationship table defines the nesting relationship but for the sake of queries all of the features can be treated as flat directly related to the bioentry of interest. Secondarily, you would need to reconstitute the nested relationship if that is of interest, but for the query example of "give me all features of this type in this region" you could return a simple flat iterator of them. > Would it be safe to assume (or even enforce) that subfeatures > are always *with* the location spanned by the parent feature? > Even with this proviso, a daughter feature may still be small > enough to pass a start/end filter, even if the parent feature > is not. Again, scanning the top level features list would miss > it. The within assumption makes sense to me here. There may be pathological cases that fall outside of this, but no examples are coming to mind right now. > There are other downsides to using nested SubFeatures, > it will probably require a lot of reworking of the GenBank > output due to how composite features like joins are > currently stored, and I haven't even looked at the BioSQL > side of things. You may have looked at that already > though, so I may just be worrying about nothing. Agreed. My thought was to prototype this with GFF and then think further about GenBank features. Initially, I just want to get the GFF parsing documented and in the Biopython repository, and then the BioSQL storage would be a logical next step. Brad From chapmanb at 50mail.com Mon Aug 31 08:58:54 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 08:58:54 -0400 Subject: [Biopython-dev] Command line wrappers for assembly tools In-Reply-To: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> References: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> Message-ID: <20090831125854.GB75451@sobchak.mgh.harvard.edu> Hi all; > Osvaldo Zagordi has recently offered a Bio.Application style command line > wrapper for Novoalign (a commercial short read aligner from Novocraft), see > enhancement Bug 2904, and the Novocraft website: > http://bugzilla.open-bio.org/show_bug.cgi?id=2904 > http://www.novocraft.com/products.html Very nice. I've been meaning to play with Novoalign and have heard some good things. > While some of these tools would fit under Bio.Align.Applications, does > creating a similar collection at Bio.Sequencing.Applications make more > sense? For example, the Roche sffinfo tool isn't in itself a alignment > application - but it is related to DNA sequencing. I like the idea of a Sequencing namespace or at least something different than the current Align, which implicitly refers mostly to multiple alignment programs. Brad From biopython at maubp.freeserve.co.uk Mon Aug 31 09:11:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 14:11:19 +0100 Subject: [Biopython-dev] Command line wrappers for assembly tools In-Reply-To: <20090831125854.GB75451@sobchak.mgh.harvard.edu> References: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> <20090831125854.GB75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908310611i2ce6a639i550631cb47a02050@mail.gmail.com> On Mon, Aug 31, 2009 at 1:58 PM, Brad Chapman wrote: > Hi all; > >> Osvaldo Zagordi has recently offered a Bio.Application style command line >> wrapper for Novoalign (a commercial short read aligner from Novocraft), see >> enhancement Bug 2904, and the Novocraft website: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2904 >> http://www.novocraft.com/products.html > > Very nice. I've been meaning to play with Novoalign and have heard > some good things. Cool. Do you think you'll be able to try that out, and test Osvaldo's wrapper at the same time? >> While some of these tools would fit under Bio.Align.Applications, does >> creating a similar collection at Bio.Sequencing.Applications make more >> sense? For example, the Roche sffinfo tool isn't in itself a alignment >> application - but it is related to DNA sequencing. > > I like the idea of a Sequencing namespace or at least something > different than the current Align, which implicitly refers mostly to > multiple alignment programs. That sounds like a plan then... Peter From biopython at maubp.freeserve.co.uk Mon Aug 31 09:15:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 14:15:42 +0100 Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures In-Reply-To: <20090831125452.GA75451@sobchak.mgh.harvard.edu> References: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> <20090831125452.GA75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908310615y23051634sbe6076fa9667296b@mail.gmail.com> On Mon, Aug 31, 2009 at 1:54 PM, Brad Chapman wrote: >> There are other downsides to using nested SubFeatures, >> it will probably require a lot of reworking of the GenBank >> output due to how composite features like joins are >> currently stored, and I haven't even looked at the BioSQL >> side of things. You may have looked at that already >> though, so I may just be worrying about nothing. > > Agreed. My thought was to prototype this with GFF and then > think further about GenBank features. Initially, I just want to > get the GFF parsing documented and in the Biopython > repository, and then the BioSQL storage would be a logical > next step. If (as Michiel and I suggested) your GFF parser returns some generic object (e.g. a GFF record class, or a tuple of basic python types including a dictionary of annotation), then yes, that can be checked in without side effects. However, if your code goes straight to SeqRecord and SeqFeature objects, we are going to have to deal with how BioSQL and the existing SeqIO output code will react (e.g. the GenBank output). Peter From chapmanb at 50mail.com Mon Aug 31 09:24:51 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 09:24:51 -0400 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> Message-ID: <20090831132451.GD75451@sobchak.mgh.harvard.edu> Hi Peter; > The Bio.SeqIO.indexed_dict() functionality is in CVS/github now > as I would like some wider testing. My earlier email explained the > implementation approach, and gave some example code: > http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html Sweet. I pulled this from your branch earlier for something I was doing at work and it's great stuff. My only suggestion would be to change the function name to make it clear it's an in memory index. This will clear us up for similar file based index functions. > Another option (like the shelve idea we talked about last month) > is to parse the sequence file with SeqIO, and serialise all the > SeqRecord objects to disk, e.g. with pickle or some key/value > database. This is potentially very complex (e.g. arbitrary Python > objects in the annotation), and could lead to a very large "index" > file on disk. On the other hand, some possible back ends would > allow editing the database... which could be very useful. My thought here was to use BioSQL and the SQLite mappings for serializing. We build off a tested and existing serialization, and also guide people into using BioSQL for larger projects. Essentially, we would build an API on top of existing BioSQL functionality that creates the index by loading the SQL and then pushes the parsed records into it. > Brad - do you have any thoughts? I know you did some work > with key/value indexers: > http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ I've been using MongoDB (http://www.mongodb.org/display/DOCS/Home) extensively and it rocks; it's fast and scales well. The bit of work that is needed is translating objects into JSON representations. There are object mappers like MongoKit (http://bitbucket.org/namlook/mongokit/) that help with this. Connecting these thoughts together, a rough two step development plan would be: - Modify the underlying Biopython BioSQL representation to be object based, using SQLAlchemy. This is essentially what I'd suggested as a building block from Kyle's implementation. - Use this to provide object mappings for object-based stores, like MongoDB/MongoKit or Google App Engine. Brad From biopython at maubp.freeserve.co.uk Mon Aug 31 09:49:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 14:49:40 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090831132451.GD75451@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> On Mon, Aug 31, 2009 at 2:24 PM, Brad Chapman wrote: > > Hi Peter; > >> The Bio.SeqIO.indexed_dict() functionality is in CVS/github now >> as I would like some wider testing. My earlier email explained the >> implementation approach, and gave some example code: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html > > Sweet. I pulled this from your branch earlier for something I was > doing at work and it's great stuff. Thanks :) What file formats where you working on, and how many records? > My only suggestion would be to > change the function name to make it clear it's an in memory index. > This will clear us up for similar file based index functions. True. Have got any bright ideas for a better name? While the index is in memory, the SeqRecord objects are not (unlike the original Bio.SeqIO.to_dict() function). Or we have one function Bio.SeqIO.indexed_dict() which can either use an in memory index, OR an on disk index, offering the same functionality. >> Another option (like the shelve idea we talked about last month) >> is to parse the sequence file with SeqIO, and serialise all the >> SeqRecord objects to disk, e.g. with pickle or some key/value >> database. This is potentially very complex (e.g. arbitrary Python >> objects in the annotation), and could lead to a very large "index" >> file on disk. On the other hand, some possible back ends would >> allow editing the database... which could be very useful. > > My thought here was to use BioSQL and the SQLite mappings for > serializing. We build off a tested and existing serialization, and > also guide people into using BioSQL for larger projects. > Essentially, we would build an API on top of existing BioSQL > functionality that creates the index by loading the SQL and then > pushes the parsed records into it. Using BioSQL in this way is a much more general tool than simply "indexing a sequence file". It feels like a sledgehammer to crack a nut. Also, do you expect it to scale well for 10 million plus short reads? It may do, but on the other hand it may not. You will also face the (file format specific but potentially significant) up front cost of parsing the full file in order to get the SeqRecord objects which are then mapped into the database. My new Bio.SeqIO.indexed_dict() code (whatever we call it) avoids this and the speed up is very nice (file format specific of course). Also while the current BioSQL mappings are "tried and tested", they don't cover everything, in particular per-letter-annotation such as a set of quality scores (something that needs addressing anyway, probably with JSON or XML serialisation). All the above make me lean towards a less ambitious target (read only dictionary access to a sequence file), which just requires having an (on disk) index of file offsets (which could be done with SQLite or anything else suitable). This choice could even be done on the fly at run time (e.g. we look at the size of the file to decide if we should use an in memory index or on disk - or start out in memory and if the number of records gets too big, switch to on disk). Peter From mjldehoon at yahoo.com Mon Aug 31 09:50:37 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 31 Aug 2009 06:50:37 -0700 (PDT) Subject: [Biopython-dev] Fw: Re: RNA module contributions Message-ID: <444088.91207.qm@web62408.mail.re1.yahoo.com> Forgot to forward this to the list. --- On Mon, 8/31/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [Biopython-dev] RNA module contributions > To: "Kristian Rother" > Date: Monday, August 31, 2009, 9:49 AM > Hi Kristian, > > As I am working in transcriptomics, I'll be happy to see > some more RNA modules in Biopython. Thanks! > Just one comment for now: > Recent parsers in Biopython use a function rather than a > class. > So instead of > > from Bio import ThisOrThatModule > handle = open("myinputfile") > parser = ThisOrThatModule.Parser() > record = parser.parse(handle) > > you would have > > from Bio import ThisOrThatModule > handle = open("myinputfile") > record = ThisOrThatModule.read(handle) > > This assumes that myinputfile contains only one record. If > you have input files with multiple records, you can use > > from Bio import ThisOrThatModule > handle = open("myinputfile") > records = ThisOrThatModule.parse(handle) > > where the parse function is a generator function. > > How about the following for the RNA module? > > from Bio import RNA > handle = open("myinputfile") > record = RNA.read(handle, format="vienna") > # or format="bpseq", as appropriate > > where record will be a Bio.RNA.SecStruc object. > > For consistency with other Biopython modules, you might > also consider to rename Bio.RNA.SecStruc as Bio.RNA.Record. > On the other hand, the name SecStruc is more informative, > and maybe some day there will be other kinds of records in > Bio.RNA. > > Thanks! > > --Michiel. > > --- On Mon, 8/31/09, Kristian Rother > wrote: > > > From: Kristian Rother > > Subject: [Biopython-dev] RNA module contributions > > To: "Biopython-Dev Mailing List" > > Date: Monday, August 31, 2009, 7:19 AM > > > > Hi, > > > > to start work on RNA modules, I'd like to contribute > some > > of our tested modules to BioPython. Before I place > them into > > my GIT branch, it would be great to get some > comments: > > > > Bio.RNA.SecStruc > > ???- represents a RNA secondary structures, > > ???- recognizing of SSEs (helix, loop, > > bulge, junction) > > ???- recognizing pseudoknots > > > > Bio.RNA.ViennaParser? ? ? > > ???- parses RNA secondary structures in the > > Vienna format into SecStruc objects. > > > > Bio.RNA.BpseqParser? ? ? ? ? - > > parses RNA secondary structures in the Bpseq format > into > > SecStruc objects. > > > > Connected to RNA, but with a wider focus: > > > > Bio.???.ChemicalGroupFinder > > ???- identifies chemical groups (ribose, > > carboxyl, etc) in a molecule graph (place to be > defined > > yet) > > > > There is a contribution from Bjoern Gruening as well: > > > > Bio.PDB.PDBMLParser > > ???- creates PDB.Structure objects from > > PDB-XML files. > > > > > > Comments and suggestions welcome! > > > > Best Regards, > > ???Kristian Rother > > > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > From biopython at maubp.freeserve.co.uk Mon Aug 31 13:44:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 18:44:44 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> Message-ID: <320fb6e00908311044h24cd62d9n809582c7d32e5824@mail.gmail.com> On Mon, Aug 31, 2009 at 2:49 PM, Peter wrote: > All the above make me lean towards a less ambitious target > (read only dictionary access to a sequence file), which just > requires having an (on disk) index of file offsets (which could > be done with SQLite or anything else suitable). This choice > could even be done on the fly at run time (e.g. we look at the > size of the file to decide if we should use an in memory index > or on disk - or start out in memory and if the number of records > gets too big, switch to on disk). With the current code (in memory dictionary mapping keys to file offsets), the 7 million record FASTQ file (1.3GB on disk) required almost 700MB in memory. Indexing took about 1 min. This is probably OK for many potential uses. I just did a quick hack to use shelve (default settings) to hold the key to file offset mapping. RAM usage was about 10MB, the index file about 320MB (could have been a little more, my code cleaned up after itself), but indexing took about 12 minutes. http://github.com/peterjc/biopython/tree/index-shelve I also did a proof of principle implementation using SQLite to hold the key to file offset mapping. This also needed only about 10MB of RAM, the SQLite index file was about 400MB and indexing took about 8 minutes. Perhaps this can be sped up... http://github.com/peterjc/biopython/tree/index-sqlite On the bright side, these all work for all the previously supported indexable file formats, even SFF - which is pretty cool. The trade off of 1 minute and 700MB RAM (in memory) versus 8 minutes but only 10MB RAM (using SQLite) means neither solution will suit every use case. So unless the SQLite dict approach can be sped up, it may be worthwhile to support both this and the in memory index - although I haven't worked out how best to arrange my code to achieve this elegantly. Anyway, using SQLite like this seems workable (especially since for Python 2.5+ it is included in the standard library). Another option is the Berkeley DB library (especially if we can do this following the OBF OBDA standard for the index file), but while bsddb was included in Python 2.x it has been deprecated for Python 2.6+ and removed in Python 3.0+ It is still available as a third party install though... Peter From bugzilla-daemon at portal.open-bio.org Sat Aug 1 20:46:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 16:46:38 -0400 Subject: [Biopython-dev] [Bug 2894] New: Jython List difference causes failed assertion in CondonTable Fix+Patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2894 Summary: Jython List difference causes failed assertion in CondonTable Fix+Patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu Different list behaviour in Jython causes assertion to fail because last to elements on produced list are swapped. Haven't taken the time to figure out if this caused by sloppy list usage or Jython list weirdness. At this point, will assume that list order doesn't matter and simple expand the assertion to allow both cases... list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) Python : ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] Jython : ['TGA', 'TAA', 'TAG', 'TRA', 'TAR'] NOTE: Fixing this bug causes setup.py to fail (java.lang.ClassFormatError: Invalid method Code length) because it exposes previously untested bugs *** biopython-1.51b_orig/Bio/Data/CodonTable.py 2009-05-08 14:20:19.000000000 -0700 --- biopython-1.51b/Bio/Data/CodonTable.py 2009-08-01 13:30:46.000000000 -0700 *************** *** 615,621 **** assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TGA'] assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TAA', 'TAR'] assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values) == ['UAG', 'UAA', 'UAR'] ! assert list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] # Forward translation is "onto", that is, any given codon always maps # to the same protein, or it doesn't map at all. Thus, I can build --- 615,623 ---- assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TGA'] assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values) == ['TAG', 'TAA', 'TAR'] assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values) == ['UAG', 'UAA', 'UAR'] ! #Jython BUG? For some order Jython swaps the order of the last two elements... ! assert list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] or\ ! list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TRA', 'TAR'] # Forward translation is "onto", that is, any given codon always maps # to the same protein, or it doesn't map at all. Thus, I can build -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Aug 1 21:16:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 17:16:48 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200908012116.n71LGmgG031493@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-01 17:16 EST ------- (In reply to comment #0) > Different list behaviour in Jython causes assertion to fail because last to > elements on produced list are swapped. Haven't taken the time to figure out > if this caused by sloppy list usage or Jython list weirdness. ... Are you using Biopython 1.51b, or the latest code from CVS/github? This sounds like a duplicate of Bug 2887 (set order is Python implementation dependent). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:47 -0400 Subject: [Biopython-dev] [Bug 2895] New: Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Summary: Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu BugsThisDependsOn: 2891,2892,2893,2894 Jython is limited to JVM method sizes, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). The Bio.Restriction.Restriction_Dictionary module defines to much data in the base method, by breaking the defined dicts into pieces held in separate methods, then merging them, the code will correctly compile in Jython. Patch: 11,12c11,14 < rest_dict = \ < {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'), --- > > > def RestDict1(): > return {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'), 1503,1504c1505,1508 < 'suppl': ('I',)}, < 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'), --- > 'suppl': ('I',)} } > > def RestDict2(): > return { 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'), 3500c3504,3508 < 'suppl': ('X',)}, --- > 'suppl': ('X',)} } > > > def RestDict3(): > return { 4497c4505,4508 < 'suppl': ('I',)}, --- > 'suppl': ('I',)} } > > def RestDict4(): > return { 5494,5495c5505,5508 < 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')}, < 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'), --- > 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')} } > > def RestDict5(): > return { 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'), 6479c6492,6495 < 'suppl': ('N',)}, --- > 'suppl': ('N',)} } > > def RestDict6(): > return { 7194,7195c7210,7214 < 'suppl': ('N',)}, < 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'), --- > 'suppl': ('N',)} } > > > def RestDict7(): > return { 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'), 8491c8510,8513 < 'suppl': ()}, --- > 'suppl': ()} } > > def RestDict8(): > return { 9608c9630,9634 < 'suppl': ('F',)}, --- > 'suppl': ('F',)} } > > > def RestDict9(): > return { 11992,11993c12018,12051 < suppliers = \ < {'A': ('Amersham Pharmacia Biotech', --- > > > rest_dict = {} > tmp = RestDict1() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict2() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict3() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict4() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict5() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict6() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict7() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict8() > for a in tmp: > rest_dict[a] = tmp[a] > tmp = RestDict9() > for a in tmp: > rest_dict[a] = tmp[a] > > > def Suppliers(): > return {'A': ('Amersham Pharmacia Biotech', 13626,13627c13684,13692 < typedict = \ < {'type145': (('NonPalindromic', --- > > > suppliers = Suppliers() > > > > > def TypeDict(): > return {'type145': (('NonPalindromic', 14498a14564,14567 > > typedict = TypeDict() > > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:49 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200908020246.n722knhV005000@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:50 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200908020246.n722koqM005006@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:51 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200908020246.n722kpGh005015@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 1 Aug 2009 22:46:52 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200908020246.n722kq8g005021@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2895 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Aug 3 14:57:59 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 3 Aug 2009 10:57:59 -0400 Subject: [Biopython-dev] GSoC Weekly Update 11: PhyloXML for Biopython Message-ID: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> Hi all, Previously (July 27-31) I: - Added the remaining checks for restricted tokens - Modified the tree, parser and writer for phyloXML 1.10 support -- it validates now, and unit tests pass. PhyloXML 1.00 validation breaks, but that won't affect anyone except BioPerl, and they said they can deal with it on their end - Changed how the Parser and Writer classes work to resemble other Biopython parser classes more closely - Picked standard attributes for BaseTree's Tree and Node objects (informed by PhyloDB, though the names are slightly different); added properties to PhyloXML's Clade to mimic both types - Made SeqRecord conversion actually work (with reasonable round-tripping capability); added a unit test - Changed __str__ methods to not include the object's class name if there's another representative label to use (e.g. name) -- that's easy enough to add in the caller - Sorted out the TreeIO read/parse/write API and added some support for the Newick format, as recommended by Peter on biopython-dev - Split some "plumbing" (depth_first_search) off from the Tree.find() method. Since there are a lot of potentially useful methods to have on phylogenetic tree objects, I think it's best to distinguish between "porcelain" (specific, easy-to-use methods for common operations) and "plumbing" (generalized or low-level methods/algorithms that porcelain can rely on) in the Tree class in Bio.Tree.BaseTree. - Started a function for networkx export. The edges are screwy right now, so I haven't checked it in yet. This week (Aug. 3-7) I will: Scan the code base for lingering TODO/ENH/XXX comments Discuss merging back upstream Work on enhancements (time permitting): - Clean up the Parser class a bit more, to resemble Writer - Finish networkx export - Port common methods to Bio.Tree.BaseTree (from Bio.Nexus.Trees and other packages) Run automated testing: - Re-run performance benchmarks - Run tests and benchmarks on alternate platforms - Check epydoc's generated API documentation and fix docstrings Update wiki documentation with new features: - Tree: base classes, find() etc., - TreeIO: 'phyloxml', 'nexus', 'newick' wrappers; PhyloXMLIO extras; warn that Nexus/Newick wrappers don't return Bio.Tree objects yet - PhyloXML: singular properties, improved str() Remarks: - Most of the work done this week and last, shuffling base classes and adding various checks, actually made the I/O functions a little slower. I don't think this will be a big deal, and the changes were necessary, but it's still a little disappointing. - The networkx export will look pretty cool. After exporting a Biopython tree to a networkx graph, it takes a couple more imports and commands to draw the tree to the screen or a file. Would anyone find it handy to have a short function in Bio.Tree or Bio.Graphics to go straight from a tree to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz) - I have to admit this: I don't know anything about BioSQL. How would I use and test the PhyloDB extension, and what's involved in writing a Biopython interface for it? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From krother at rubor.de Mon Aug 3 15:11:15 2009 From: krother at rubor.de (Kristian Rother) Date: Mon, 03 Aug 2009 17:11:15 +0200 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython Message-ID: <4A76FE13.6050203@rubor.de> Hi, We have created a lot of code that works on RNA structures in Poznan, Poland. There are some jewels that I consider useful and mature enough to meet a wider audience. I'd be interested in refactorizing and packaging them as a RNAStructure package and contribute it to BioPython. I just discussed the possibilities with Magdalena Musielak & Tomasz Puton who wrote & tested significant portions of the code. They came up with a list of 'most wanted' Use Cases: - Calculate RNA base pairs - Generate RNA secondary structures from 3D structures - Recognize pseudoknots - Recognize modified nucleotides in RNA 3D structures. - Superimpose two RNA molecules. The existing code massively uses Bio.PDB already, and has little dependancies apart from that. Any comments how this kind of functionality would fit into BioPython are welcome. Best Regards, Kristian Rother www.rubor.de Structural Bioinformatics Group UAM Poznan From bugzilla-daemon at portal.open-bio.org Mon Aug 3 16:28:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 3 Aug 2009 12:28:39 -0400 Subject: [Biopython-dev] [Bug 2896] New: BLAST XML parser: stripped leading/trailing spaces in Hsp_midline Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2896 Summary: BLAST XML parser: stripped leading/trailing spaces in Hsp_midline Product: Biopython Version: 1.50 Platform: All OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: volkmer at mpi-cbg.de Parsing a XML output file from NCBI BLAST using blastp & complexity filters on omits leading/trailing spaces in the hsp match line: hsp.query u'XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV' hsp.match u'P+ T L S PPL S+S + PQ+ L+ + R+K+ + + RHHH R LDL ++V' This makes it more awkward to evaluate the alignment. It would be the best when query, subject and alignment always have the same length. The BLAST XML output file at least has the correct Hsp_midline:
<,Hsp_qseq>XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV</Hsp_qseq>
<<Hsp_hseq>>EFFEPAITGLYYS-PPLFSVSRLTGLLHLLERPQETLF-TNYRNKIKRLDIPLRHHHIRHLDLEQLV</Hsp_hseq>
<Hsp_midline>    P+ T L  S PPL S+S    +      PQ+ L+ +  R+K+ +  +  RHHH R
LDL ++V</Hsp_midline>
And as the plaintext parser gives the complete alignment line it would be nice to get the same behaviour. Thanks, Michael -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 3 17:20:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 3 Aug 2009 13:20:24 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908031720.n73HKOFr019079@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-03 13:20 EST ------- Could you attach a complete XML file we could use for a unit test please? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Aug 3 20:48:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 21:48:49 +0100 Subject: [Biopython-dev] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> Message-ID: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> On 22 June 2009, I wrote: > ... > I'd like to officially deprecate Bio.Fasta for the next release (Biopython > 1.51), which means you can continue to use it for a couple more > releases, but at import time you will see a warning message. See also: > http://biopython.org/wiki/Deprecation_policy > > Would this cause anyone any problems? If you are still using Bio.Fasta, > it would be interesting to know if this is just some old code that hasn't > been updated, or if there is some stronger reason for still using it. No one replied, so I plan to make this change in CVS shortly, meaning that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work but will trigger a deprecation warning at import. Please speak up ASAP if this concerns you. Thanks, Peter From chapmanb at 50mail.com Mon Aug 3 22:38:47 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 3 Aug 2009 18:38:47 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> Message-ID: <20090803223847.GM8112@sobchak.mgh.harvard.edu> Hi Eric; Thanks for the update. Things are looking in great shape as we get towards the home stretch. > - Most of the work done this week and last, shuffling base classes and > adding various checks, actually made the I/O functions a little slower. > I don't think this will be a big deal, and the changes were necessary, > but it's still a little disappointing. The unfortunate influence of generalization. I think the adjustment to the generalized Tree is a big win and gives a solid framework for any future phylogenetic modules. I don't know what the numbers are but as long as performance is reasonable, few people will complain. This is always something to go back around on if it becomes a hangup in the future. > - The networkx export will look pretty cool. After exporting a Biopython > tree to a networkx graph, it takes a couple more imports and commands to > draw the tree to the screen or a file. Would anyone find it handy to have > a short function in Bio.Tree or Bio.Graphics to go straight from a tree > to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz) Awesome. Looking forward to seeing some trees that come out of this. It's definitely worthwhile to formalize the functionality to go straight from a tree to png or pdf. This will add some more localized dependencies, so I'm torn as to whether it would be best as a utility function or an example script. Peter might have an opinion here. Either way, this would be really useful as a cookbook example with a final figure. Being able to produce some pretty is a good way to convince people to store trees in a reasonable format like PhyloXML. > - I have to admit this: I don't know anything about BioSQL. How would I use > and test the PhyloDB extension, and what's involved in writing a > Biopython interface for it? BioSQL and the PhyloDB extension are a set of relational database tables. Looking at the SVN logs, it appears as if the main work on PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps lagging behind, so my suggestion is to start with PostgreSQL. Hilmar, please feel free to correct me here. The schemas are available from SVN: http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/sql You'd want biosqldb-pg.sql and presumably also biosqldb-views-pg.sql for BioSQL and biosql-phylodb-pg.sql and biosql-phylodata-pg.sql. The Biopython docs are pretty nice on this -- you create the empty tables: http://biopython.org/wiki/BioSQL#PostgreSQL >From there you should be able to browse to get a sense of what is there. In terms of writing an interface, the first step is loading the data where you can mimic what is done with SeqIO and BioSQL: http://biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database Pass the database an iterator of trees and they are stored. Secondarily is retrieving and querying persisted trees. Here you would want TreeDB objects that act like standard trees, but retrieve information from the database on demand. Here are Seq/SeqRecord models in BioSQL: http://github.com/biopython/biopython/tree/master/BioSQL/BioSeq.py So it's a bit of an extended task. Time frames being what they are, any steps in this direction are useful. If you haven't played with BioSQL before, it's worth a look for your own interest. The underlying key/value model is really flexible and kind of models RDF triplets. I've used BioSQL here recently as the backend for a web app that differs a bit from the standard GenBank like thing, and found it very flexible. Again, great stuff. Let me know if I can add to any of that, Brad From bugzilla-daemon at portal.open-bio.org Tue Aug 4 08:45:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 4 Aug 2009 04:45:03 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908040845.n748j36R015856@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #2 from volkmer at mpi-cbg.de 2009-08-04 04:45 EST ------- Created an attachment (id=1353) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1353&action=view) blastp xml sample -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Aug 4 12:32:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 4 Aug 2009 08:32:39 -0400 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: <4A76FE13.6050203@rubor.de> References: <4A76FE13.6050203@rubor.de> Message-ID: <20090804123239.GN8112@sobchak.mgh.harvard.edu> Hi Kristian; > We have created a lot of code that works on RNA structures in Poznan, > Poland. There are some jewels that I consider useful and mature enough > to meet a wider audience. I'd be interested in refactorizing and > packaging them as a RNAStructure package and contribute it to BioPython. This sounds great. I don't know enough about the area to comment directly on your use cases -- my experience is limited to folding structures with RNAFold and the like -- but it sounds like a solid feature set. > I just discussed the possibilities with Magdalena Musielak & Tomasz > Puton who wrote & tested significant portions of the code. They came up > with a list of 'most wanted' Use Cases: > > - Calculate RNA base pairs > - Generate RNA secondary structures from 3D structures > - Recognize pseudoknots > - Recognize modified nucleotides in RNA 3D structures. > - Superimpose two RNA molecules. > > The existing code massively uses Bio.PDB already, and has little > dependancies apart from that. You may also want to have a look at PyCogent, which has wrappers and parsers for several command line programs involved with RNA structure, along with a representation of RNA secondary structure: http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/struct/rna2d.py?view=markup It would be great to complement this functionality, and interact with PyCogent where feasible. We could offer more specific suggestions as you get rolling with this and there is code to review. Glad to have you interested, Brad From tiagoantao at gmail.com Tue Aug 4 15:29:36 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 4 Aug 2009 16:29:36 +0100 Subject: [Biopython-dev] 1.52 Message-ID: <6d941f120908040829g6531804dpe51e9f24720dab78@mail.gmail.com> Hi, I am currently working on the implementation of Genepop support on Bio.PopGen. Genepop support will allow calculation of basic frequentist statistics. This is the biggest addition to Bio.PopGen and makes the module useful for a wide range of applications. In fact I never tried to publicize Bio.PopGen in the population genetics community, but with this addon, that will change. The status is as follows: 1. Code done 90% done. Check http://github.com/tiagoantao/biopython/tree/genepop 2. Test code around 30% coverage 3. Documentation 50% Check http://biopython.org/wiki/PopGen_dev_Genepop for a tutorial under development. This will be ready for 1.52. And I would like to make the code available after the Summer vacation. And it is about 1.52 that this mail is about ;) I remember Peter writing about 1.52 being ad-hoc scheduled for fall. I have September blocked with work, but I managed to have October clear mostly just for this. So my request is: if there is more or less a Fall release please don't schedule it for the first week in the Fall (which is still in September) ;) . Mid-October or somewhere around that time would be good. Thanks a lot, Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From matzke at berkeley.edu Tue Aug 4 17:01:34 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 04 Aug 2009 10:01:34 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> Message-ID: <4A78696E.8010808@berkeley.edu> Hi all, update: Major improvements/fixes: - removed any reliance on lagrange tree module, refactored all phylogeny code to use the revised Bio.Nexus.Tree module - tree functions put in TreeSum (tree summary) class - added functions for calculating phylodiversity measures, including necessary subroutines like subsetting trees, randomly selecting tips from a larger pool - Code dealing with GBIF xml output completely refactored into the following classes: * ObsRecs (observation records & search results/summary) * ObsRec (an individual observation record) * XmlString (functions for cleaning xml returned by Gbif) * GbifXml (extention of capabilities for ElementTree xml trees, parsed from GBIF xml returns. - another suggestion implemented: dependencies on tempfiles eliminated by using cStringIO (temporary file-like strings, not stored as temporary files) file_str objects instead - another suggestion implemented: the _open method from biopython's ncbi www functionality has been copied & modified so that it is now a method of ObsRecs, and doesn't contain NCBI-specific defaults etc. (it does still include a 3-second waiting time between GBIF requests, figuring that is good practice). - function to download large numbers of records in increments implemented as method of ObsRecs. This week: - Put GIS functions in a class (easy), allowing each ObsRec to be classified into an are (easy) - Improve extraction of data from GBIF xmltree -- my Utricularia "practice XML file" didn't have problems, but with running online searches, I am discovering some fields are not always filled in, etc. This shouldn't be too hard, using the GbifXml xmltree searching functions, and including defaults for exceptions. - Function for converting points to KML for Google Earth display. Code uploaded here: http://github.com/nmatzke/biopython/commits/Geography -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Tue Aug 4 18:28:33 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 04 Aug 2009 11:28:33 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu> Message-ID: <4A787DD1.40301@berkeley.edu> Hilmar Lapp wrote: > > On Aug 4, 2009, at 1:01 PM, Nick Matzke wrote: > >> * ObsRecs (observation records & search results/summary) >> * ObsRec (an individual observation record) > > > I'll let the Biopython folks make the call on this, but in general I'd > recommend to everyone trying to write reusable code to spell out names, > especially non-local names. > > The days in which the length of a variable or class name was somehow > limited or affected the speed of a program are definitely over since > more than a decade. I know the temptation is big to save on a few > keystrokes every time you have to type the name, but the time that you > will cause your fellow programmers who will later try to understand your > code is vastly greater. What prevents me from thinking that ObsRec is a > class for an obsolete recording? Good point, this is easy to fix, I will put it on the list. Cheers! Nick > > Just my $0.02 :-) > > -hilmar -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Tue Aug 4 18:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Aug 2009 19:44:29 +0100 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: <4A76FE13.6050203@rubor.de> References: <4A76FE13.6050203@rubor.de> Message-ID: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> On Mon, Aug 3, 2009 at 4:11 PM, Kristian Rother wrote: > Hi, > > We have created a lot of code that works on RNA structures in Poznan, > Poland. There are some jewels that I consider useful and mature enough to > meet a wider audience. I'd be interested in refactorizing and packaging them > as a RNAStructure package and contribute it to BioPython. I remember we talked about this briefly at BOSC/ISMB - it sounds good. Did you get a chance to talk to Thomas Hamelryck about this? > I just discussed the possibilities with Magdalena Musielak & Tomasz Puton > who wrote & tested significant portions of the code. They came up with a > list of 'most wanted' Use Cases: > > - Calculate RNA base pairs > - Generate RNA secondary structures from 3D structures > - Recognize pseudoknots > - Recognize modified nucleotides in RNA 3D structures. > - Superimpose two RNA molecules. > > The existing code massively uses Bio.PDB already, and has little > dependancies apart from that. > > Any comments how this kind of functionality would fit into BioPython are > welcome. I see you have already started a github branch, which is great: http://github.com/krother/biopython/tree/rol Am I right in thinking all of this code is for 3D RNA work? Maybe that might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA? Did you have something in mind? Peter P.S. Who won the ISMB Art and Science Exhibition prize? http://www.iscb.org/ismbeccb2009/artscience.php From biopython at maubp.freeserve.co.uk Tue Aug 4 19:29:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Aug 2009 20:29:47 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> Message-ID: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote: > On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote: >> How about adding a function like "run_arguments" to the >> commandlines that returns the commandline as a list. > > That would be a simple alternative to my vague idea "Maybe we > can make the command line wrapper object more list like to make > subprocess happy without needing to create a string?", which may > not be possible. Either way, this will require a bit of work on the > Bio.Application parameter objects... By defining an __iter__ method, we can make the Biopython application wrapper object sufficiently list-like that it can be passed directly to subprocess. I think I have something working (only tested on Linux so far), at least for the case where none of the arguments have spaces or quotes in them. If this works, it should make things a little easier in that we don't have to do str(cline), and also I think it avoids the OS specific behaviour of the shell argument as Brad noted earlier: >> This avoids the shell nastiness with the argument list, is as >> simple as it gets with subprocess, and gives users an easy >> path to getting stdout, stderr and the return codes. i.e. I am hoping we can replace this: child = subprocess.Popen(str(cline), shell(sys.platform!="win32"), ...) with just: child = subprocess.Popen(cline, ...) where the "..." represents any messing about with stdin, stdout and stderr. Peter From chapmanb at 50mail.com Tue Aug 4 22:27:31 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 4 Aug 2009 18:27:31 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A78696E.8010808@berkeley.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> Message-ID: <20090804222731.GA12604@sobchak.mgh.harvard.edu> Hi Nick; Thanks for the update -- great to see things moving along. > - removed any reliance on lagrange tree module, refactored all phylogeny > code to use the revised Bio.Nexus.Tree module Awesome -- glad this worked for you. Are the lagrange_* files in Bio.Geography still necessary? If not, we should remove them from the repository to clean things up. More generally, it would be really helpful if we could do a bit of housekeeping on the repository. The Geography namespace has a lot of things in it which belong in different parts of the tree: - The test code should move to the 'Tests' directory as a set of test_Geography* files that we can use for unit testing the code. - Similarly there are a lot of data files in there which are appear to be test related; these could move to Tests/Geography - What is happening with the Nodes_v2 and Treesv2 files? They look like duplicates of the Nexus Nodes and Trees with some changes. Could we roll those changes into the main Nexus code to avoid duplication? > - Code dealing with GBIF xml output completely refactored into the > following classes: > > * ObsRecs (observation records & search results/summary) > * ObsRec (an individual observation record) > * XmlString (functions for cleaning xml returned by Gbif) > * GbifXml (extention of capabilities for ElementTree xml trees, parsed > from GBIF xml returns. I'm agreed with Hilmar -- the user classes would probably benefit from expanded naming. There is a art to naming to get them somewhere between the hideous RidicuouslyLongNamesWithEverythingSpecified names and short truncated names. Specifically, you've got a lot of filler in the names -- dbfUtils, geogUtils, shpUtils. The Utils probably doesn't tell the user much and makes all of the names sort of blend together, just as the Rec/Recs pluralization hides a quite large difference in what the classes hold. Something like Observation and ObservationSearchResult would make it clear immediately what they do and the information they hold. > This week: What are your thoughts on documentation? As a naive user of these tools without much experience with the formats, I could offer better feedback if I had an idea of the public APIs and how they are expected to be used. Moreover, cookbook and API documentation is something we will definitely need to integrate into Biopython. How does this fit in your timeline for the remaining weeks? Thanks again. Hope this helps, Brad From hlapp at gmx.net Tue Aug 4 23:34:26 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 4 Aug 2009 19:34:26 -0400 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> References: <4A76FE13.6050203@rubor.de> <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> Message-ID: On Aug 4, 2009, at 2:44 PM, Peter wrote: > P.S. Who won the ISMB Art and Science Exhibition prize? > http://www.iscb.org/ismbeccb2009/artscience.php Guess who - Kristian did :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From krother at rubor.de Wed Aug 5 08:07:12 2009 From: krother at rubor.de (Kristian Rother) Date: Wed, 05 Aug 2009 10:07:12 +0200 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython Message-ID: <4A793DB0.5000805@rubor.de> Hi Peter, I remember we talked about this briefly at BOSC/ISMB - it sounds good. Did you get a chance to talk to Thomas Hamelryck about this? We talked on ISMB, but no details yet. Am I right in thinking all of this code is for 3D RNA work? Maybe that might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA? Did you have something in mind? I was thinking of 'RNAStructure' - I also like 'RNA' as long as it does not violate any claims. P.S. Who won the ISMB Art and Science Exhibition prize? http://www.iscb.org/ismbeccb2009/artscience.php The winning picture can be found here: http://www.rubor.de/twentycharacters_en.html Best Regards, Kristian From biopython at maubp.freeserve.co.uk Wed Aug 5 08:15:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 09:15:36 +0100 Subject: [Biopython-dev] RFC: RNAStructure package for BioPython In-Reply-To: References: <4A76FE13.6050203@rubor.de> <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com> Message-ID: <320fb6e00908050115y612d89b2h757f5aa59fbb99ed@mail.gmail.com> On Wed, Aug 5, 2009 at 12:34 AM, Hilmar Lapp wrote: > > On Aug 4, 2009, at 2:44 PM, Peter wrote: > >> P.S. Who won the ISMB Art and Science Exhibition ?prize? >> http://www.iscb.org/ismbeccb2009/artscience.php > > Guess who - Kristian did :-) > > ? ? ? ?-hilmar Ha! That's cool. Congratulations Kristian! Peter From biopython at maubp.freeserve.co.uk Wed Aug 5 10:29:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 11:29:45 +0100 Subject: [Biopython-dev] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> Message-ID: <320fb6e00908050329m44fa2596ife06917306ae44ab@mail.gmail.com> On Mon, Aug 3, 2009 at 9:48 PM, Peter wrote: > On 22 June 2009, I wrote: >> ... >> I'd like to officially deprecate Bio.Fasta for the next release (Biopython >> 1.51), which means you can continue to use it for a couple more >> releases, but at import time you will see a warning message. See also: >> http://biopython.org/wiki/Deprecation_policy >> ... > > No one replied, so I plan to make this change in CVS shortly, meaning > that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work > but will trigger a deprecation warning at import. > > Please speak up ASAP if this concerns you. I've just committed the deprecation of Bio.Fasta to CVS. This could be reverted if anyone has a compelling reason (and tells us before we do the final release of Biopython 1.51). The docstring for Bio.Fasta should cover the typical situations for moving from Bio.Fasta to Bio.SeqIO, but please feel free to ask on the mailing list if you have a more complicated bit of old code that needs to be ported. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Aug 5 11:29:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Aug 2009 07:29:41 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908051129.n75BTf8i026537@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 07:29 EST ------- Thanks for the sample XML file. I could reproduce this, I think I have fixed it. hsp.query, hsp.match and hsp.sbjct should all be the same length. Previously, at the end of each tag our XML parser strips the leading/trailing white space from the tag's value before processing it. In the case of Hsp_midline this is a very bad idea. However, the reason it did this was that the way the current tag value was built up wasn't context aware. In particular case, there was white space outside tags like Hsp_midline, which really belong to the parent tag (Hsp), but was wrongly being combined. Would you be able to test this please? All you really need to try this is the new Bio/Blast/NCBIXML.py file (CVS revision 1.23). It might be easiest just to update to the latest code in CVS (or on github), but I could attach the file here if you like. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Aug 5 13:13:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Aug 2009 09:13:40 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908051313.n75DDeFt031305@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #4 from volkmer at mpi-cbg.de 2009-08-05 09:13 EST ------- Hi Peter, could you please attach the file? The latest version of NCBIXML.py I get from cvs at code.open-bio.org still seems to be from April 2009. When I try to specify revision 1.23 I get a checkout warning and no file. Or is there a testing branch for this? Michael -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Aug 5 13:27:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Aug 2009 09:27:45 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908051327.n75DRjjg031915@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 09:27 EST ------- Created an attachment (id=1357) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1357&action=view) Updated version of NCBIXML.py as in CVS revision 1.23 (In reply to comment #4) > Hi Peter, > > could you please attach the file? Sure. > The latest version of NCBIXML.py I get from cvs at code.open-bio.org > still seems to be from April 2009. When I try to specify revision > 1.23 I get a checkout warning and no file. Or is there a testing > branch for this? Using code.open-bio.org (or its various aliases like cvs.biopython.org) actually gives you access to a read only mirror of the real CVS data, which is on dev.open-bio.org (for use by those with commit rights). I'm not sure exactly how often the public mirror is updated, but I would guess hourly. I would guess if you try again later it would work, but in the meantime I have attached the new file to this bug. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Wed Aug 5 22:31:31 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 5 Aug 2009 18:31:31 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <20090803223847.GM8112@sobchak.mgh.harvard.edu> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> On Mon, Aug 3, 2009 at 6:38 PM, Brad Chapman wrote: > Hi Eric; > Thanks for the update. Things are looking in great shape as we get > towards the home stretch. > > > - Most of the work done this week and last, shuffling base classes > and > > adding various checks, actually made the I/O functions a little > slower. > > I don't think this will be a big deal, and the changes were > necessary, > > but it's still a little disappointing. > > The unfortunate influence of generalization. I think the adjustment > to the generalized Tree is a big win and gives a solid framework for > any future phylogenetic modules. I don't know what the numbers are > but as long as performance is reasonable, few people will complain. > This is always something to go back around on if it becomes a hangup > in the future. > The complete unit test suite used to take about 4.5 seconds, and now it takes 5.8 seconds, though I've added a few more tests since then. I don't think it will feel like it's hanging for most operations, besides parsing or searching a huge tree. > - The networkx export will look pretty cool. After exporting a > Biopython > > tree to a networkx graph, it takes a couple more imports and > commands to > > draw the tree to the screen or a file. Would anyone find it handy > to have > > a short function in Bio.Tree or Bio.Graphics to go straight from a > tree > > to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe > graphviz) > > Awesome. Looking forward to seeing some trees that come out of this. > It's definitely worthwhile to formalize the functionality to go > straight from a tree to png or pdf. This will add some more > localized dependencies, so I'm torn as to whether it would be best > as a utility function or an example script. Peter might have an > opinion here. > > Either way, this would be really useful as a cookbook example with a > final figure. Being able to produce some pretty is a good way to > convince people to store trees in a reasonable format like PhyloXML. > OK, it works now but the resulting trees look a little odd. The options needed to get a reasonable tree representation are fiddly, so I made draw_graphviz() a separate function that basically just handles the RTFM work (not trivial), while the graph export still happens in to_networkx(). Here are a few recipes and a taste of each dish. The matplotlib engine seems usable for interactive exploration, albeit cluttered -- I can't hide the internal clade identifiers since graphviz needs unique labels, though maybe I could make them less prominent. Drawing directly to PDF gets cluttered for big files, and if you stray from the default settings (I played with it a bit to get it right), it can look surreal. There would still be some benefit to having a reportlab-based tree module in Bio.Graphics, and maybe one day I'll get around to that. $ ipython -pylab from Bio import Tree, TreeIO apaf = TreeIO.read('apaf.xml', 'phyloxml') Tree.draw_graphviz(apaf) # http://etal.myweb.uga.edu/phylo-nx-apaf.png Tree.draw_graphviz(apaf, 'apaf.pdf') # http://etal.myweb.uga.edu/apaf.pdf Tree.draw_graphviz(apaf, 'apaf.png', format='png', prog='dot') # http://etal.myweb.uga.edu/apaf.png -- why it's best to leave the defaults alone Thoughts: the internal node labels could be clear instead of red; if a node doesn't have a name, it could check its taxonomy attribute to see if anything's there; there's probably a way to make pygraphviz understand distinct nodes that happen to have the same label, although I haven't found it yet. Is PDF a good default format, or would PNG or PostScript be better? > - I have to admit this: I don't know anything about BioSQL. How would > I use > > and test the PhyloDB extension, and what's involved in writing a > > Biopython interface for it? > > BioSQL and the PhyloDB extension are a set of relational database > tables. Looking at the SVN logs, it appears as if the main work on > PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps > lagging behind, so my suggestion is to start with PostgreSQL. > Hilmar, please feel free to correct me here. > > [...] > > So it's a bit of an extended task. Time frames being what they are, > any steps in this direction are useful. If you haven't played with > BioSQL before, it's worth a look for your own interest. The underlying > key/value model is really flexible and kind of models RDF triplets. I've > used BioSQL here recently as the backend for a web app that differs a > bit from the standard GenBank like thing, and found it very flexible. > > I think I've seen that app, but I thought it was backed by AppEngine. Neat stuff. I will learn BioSQL for my own benefit, but I don't think there's enough time left in GSoC for me to add a useful PhyloDB adapter to Biopython. So that, along with refactoring Nexus.Trees to use Bio.Tree.BaseTree, would be a good project to continue with in the fall, at a slower pace and with more discussion along the way. Cheers, Eric From bugzilla-daemon at portal.open-bio.org Thu Aug 6 07:56:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 Aug 2009 03:56:25 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908060756.n767uPk1031552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 ------- Comment #6 from volkmer at mpi-cbg.de 2009-08-06 03:56 EST ------- (In reply to comment #3) > I could reproduce this, I think I have fixed > it. > hsp.query, hsp.match and hsp.sbjct should all be the same length. > > Previously, at the end of each tag our XML parser strips the leading/trailing > white space from the tag's value before processing it. In the case of > Hsp_midline this is a very bad idea. Ok, the fix seems to solve the problem. Well I guess the only time when this problem appears is when you have filtered/masked residues at the beginning/end of the query hsp. Otherwise the hsp would just start with the first match and end with the last one. Thanks, Michael -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Aug 6 08:03:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 Aug 2009 04:03:03 -0400 Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped leading/trailing spaces in Hsp_midline In-Reply-To: Message-ID: <200908060803.n76833YJ032257@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2896 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-06 04:03 EST ------- (In reply to comment #6) > > Ok, the fix seems to solve the problem. > Great - I'm marking this bug as fixed, thanks for your time reporting and then testing this. > Well I guess the only time when this problem appears is when you have > filtered/masked residues at the beginning/end of the query hsp. Otherwise > the hsp would just start with the first match and end with the last one. I suspect there are other situations it might happen, but the fix is general. Cheers, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Aug 6 08:06:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 09:06:43 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> Message-ID: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com> On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich wrote: > OK, it works now but the resulting trees look a little odd. The options > needed to get a reasonable tree representation are fiddly, so I made > draw_graphviz() a separate function that basically just handles the RTFM > work (not trivial), while the graph export still happens in to_networkx(). > > Here are a few recipes and a taste of each dish. The matplotlib engine seems > usable for interactive exploration, albeit cluttered -- I can't hide the > internal clade identifiers since graphviz needs unique labels, though maybe > I could make them less prominent. ... Graphviv does need unique names, and the node labels default to the node name - but you can override this and use a blank label if you want. How are you calling Graphviz? There are several Python wrappers out there, or you could just write a dot file directly and call the graphviz command line tools. Peter From eric.talevich at gmail.com Thu Aug 6 12:47:47 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 6 Aug 2009 08:47:47 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com> Message-ID: <3f6baf360908060547r8f299dao413b3657966fe9f4@mail.gmail.com> On Thu, Aug 6, 2009 at 4:06 AM, Peter wrote: > On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich > wrote: > > > OK, it works now but the resulting trees look a little odd. The options > > needed to get a reasonable tree representation are fiddly, so I made > > draw_graphviz() a separate function that basically just handles the RTFM > > work (not trivial), while the graph export still happens in > to_networkx(). > > > > Here are a few recipes and a taste of each dish. The matplotlib engine > seems > > usable for interactive exploration, albeit cluttered -- I can't hide the > > internal clade identifiers since graphviz needs unique labels, though > maybe > > I could make them less prominent. ... > > Graphviv does need unique names, and the node labels default to the > node name - but you can override this and use a blank label if you want. > How are you calling Graphviz? There are several Python wrappers out > there, or you could just write a dot file directly and call the graphviz > command line tools. > I'm using the networkx and pygraphviz wrappers, since networkx already partly wraps pygraphviz. The direct networkx->matplotlib rendering engine figures out the associations correctly when I pass a LabeledDiGraph instance, using Clade objects as nodes and the str() representation as the label -- so networkx.draw(tree) shows a tree with the internal nodes all labeled as "Clade". But networkx.draw_graphviz(tree), while otherwise working the same as the other networkx drawing functions, seems to convert nodes to strings earlier, and then treats all "Clade" strings as the same node. Surely there's a way to fix this through the networkx or pygraphviz API, but I couldn't figure it out yesterday from the documentation and source code. I'll poke at it some more today and try using blank labels. Thanks, Eric From chapmanb at 50mail.com Thu Aug 6 13:14:42 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 6 Aug 2009 09:14:42 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com> <20090803223847.GM8112@sobchak.mgh.harvard.edu> <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com> Message-ID: <20090806131442.GG12604@sobchak.mgh.harvard.edu> Hi Eric; > OK, it works now but the resulting trees look a little odd. The options > needed to get a reasonable tree representation are fiddly, so I made > draw_graphviz() a separate function that basically just handles the RTFM > work (not trivial), while the graph export still happens in to_networkx(). > > Here are a few recipes and a taste of each dish. The matplotlib engine seems > usable for interactive exploration, albeit cluttered -- I can't hide the > internal clade identifiers since graphviz needs unique labels, though maybe > I could make them less prominent. Drawing directly to PDF gets cluttered for > big files, and if you stray from the default settings (I played with it a > bit to get it right), it can look surreal. There would still be some benefit > to having a reportlab-based tree module in Bio.Graphics, and maybe one day > I'll get around to that. This is great start. I remember pygraphviz and the networkx representation being a bit finicky last I used it. In the end, I ended up making a pygraphviz AGraph directly. Either way, if you can remove the unneeded labels and change colorization as you suggested, this is a great quick visualizations of trees. Something reportlab based that looks like biologists expect a phylogenetic tree to look would also be very useful. There is a benefit in familiarity of display. Building something generally usable like that is a longer term project. > I think I've seen that app, but I thought it was backed by AppEngine. Neat > stuff. I will learn BioSQL for my own benefit, but I don't think there's > enough time left in GSoC for me to add a useful PhyloDB adapter to > Biopython. So that, along with refactoring Nexus.Trees to use > Bio.Tree.BaseTree, would be a good project to continue with in the fall, at > a slower pace and with more discussion along the way. Yes, the AppEngine display is also BioSQL on the backend; I ported over some of the tables to the object representation used in AppEngine. I also have used the relational schema in work projects -- it generally is just a good place to get started. Agreed on the timelines for GSoC. We'd be very happy to have you continue that on those projects into the fall. Both are very useful additions to the great work you've already done. Brad From biopython at maubp.freeserve.co.uk Thu Aug 6 14:39:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 15:39:33 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> Message-ID: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> On Tue, Aug 4, 2009 at 8:29 PM, Peter wrote: > On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote: >> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote: >>> How about adding a function like "run_arguments" to the >>> commandlines that returns the commandline as a list. >> >> That would be a simple alternative to my vague idea "Maybe we >> can make the command line wrapper object more list like to make >> subprocess happy without needing to create a string?", which may >> not be possible. Either way, this will require a bit of work on the >> Bio.Application parameter objects... > > By defining an __iter__ method, we can make the Biopython > application wrapper object sufficiently list-like that it can be > passed directly to subprocess. I think I have something working > (only tested on Linux so far), at least for the case where none > of the arguments have spaces or quotes in them. The current Bio.Application code works around generating command line strings, and works fine cross platform. Making the Bio.Application objects "list like" and getting this to work cross platform isn't looking easy. Spaces on Windows are causing me big headaches. Switching to lists of arguments appears to work fine on Unix (specifically tested on Linux and Mac OS X), but things are more complicated Windows. Basically using an array/list of arguments is normal on Unix, but on Windows things get passed as strings. The upshot is different Windows tools (or libraries used to compile them) have to parse their command line string themselves, so different tools do it differently. The result is you *may* need to adopt different spaces/quotes escaping for different command line tools on Windows. Now, if you give subprocess a list, on Windows it must first be turned into a string, before subprocess can use the Windows API to run it. The subprocess function list2cmdline does this, but the conventions it follows are not universal. I have examples of working command line strings for ClustalW and PRANK where both the executable and some of the arguments have spaces in them. It seems the quoting I was using to make ClustalW (or PRANK) happy cannot be achieved via subprocess.list2cmdline (and I suspect this applies to other tools too). I will try and look into this further. However, even if it is possible, I don't think we can implement the list approach in time for Biopython 1.51, as there are just too many potential pitfalls. I have in the meantime extended the command line tool unit tests somewhat to include more examples with spaces in the filenames [I'm beginning to think replacing Bio.Application.generic_run with a simpler helper function would be easier in the short term, continuing to just using a string with subprocess, but haven't given up yet.] Peter From biopython at maubp.freeserve.co.uk Thu Aug 6 15:48:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 16:48:12 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> Message-ID: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> On Thu, Aug 6, 2009 at 3:39 PM, Peter wrote: > Now, if you give subprocess a list, on Windows it must first be turned > into a string, before subprocess can use the Windows API to run it. > The subprocess function list2cmdline does this, but the conventions it > follows are not universal. > > I have examples of working command line strings for ClustalW and PRANK > where both the executable and some of the arguments have spaces in > them. It seems the quoting I was using to make ClustalW (or PRANK) > happy cannot be achieved via subprocess.list2cmdline (and I suspect > this applies to other tools too). e.g. This is a valid and working command line for PRANK, which works both at the command line, or in Python via subprocess when given as a string: C:\repository\biopython\Tests>"C:\Program Files\prank.exe" -d=Quality/example.fasta -o="temp with space" -f=11 -convert Now, breaking up the arguments according to the description given in the subprocess.list2cmdline docstring, I think the arguments are: "C:\Program Files\prank.exe" -d=Quality/example.fasta -o="temp with space" -f=11 -convert Of these, the middle guy causes problems. By my reading of the subprocess.list2cmdline docstring this is valid: >> 2) A string surrounded by double quotation marks is >> interpreted as a single argument, regardless of white >> space or pipe characters contained within. A quoted >> string can be embedded in an argument. The example -o="temp with space" is a string surrounded by double quotes, "temp with space", embedded in an argument. Unfortunately, giving these five strings to subprocess.list2cmdline results in a mess as it never checks to see if the arguments are already quoted (as we have done for the program name and also the output filename base). We can pass the program name in without the quotes, and list2cmdline will do the right thing. But there is no way for the -o argument to be handled that I can see. This may be a bug in subprocess.list2cmdline, but it is certainly a real limitation in my opinion. So, it would appear that (on Windows) making our command line wrappers act like lists (by defining __iter__) will not work in general. The other approach which would allow our command line wrappers to be passed directly to subprocess is to make them more string like - but the subprocess code checks for string command lines using isinstance(args, types.StringTypes) which means we would have to subclass str (or unicode). I'm not sure if this can be made to work yet... Peter From biopython at maubp.freeserve.co.uk Thu Aug 6 16:05:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 17:05:24 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> Message-ID: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com> On Thu, Aug 6, 2009 at 4:48 PM, Peter wrote: > The other approach which would allow our command line wrappers > to be passed directly to subprocess is to make them more string > like - but the subprocess code checks for string command lines > using isinstance(args, types.StringTypes) which means we would > have to subclass str (or unicode). I'm not sure if this can be made > to work yet... Thinking about it a bit more, str and unicode are immutable objects, but we want the command line wrapper to be mutable (e.g. to add, change or remove parameters and arguments). So it won't work. Going back to my the original email, we could replace Bio.Application.generic_run instead: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006344.html > > Possible helper functions that come to mind are: > (a) Returns the return code (integer) only. This would basically > be a cross-platform version of os.system using the subprocess > module internally. > (b) Returns the return code (integer) plus the stdout and stderr > (which would have to be StringIO handles, with the data in > memory). This would be a direct replacement for the current > Bio.Application.generic_run function. > (c) Returns the stdout (and stderr) handles. This basically is > recreating a deprecated Python popen*() function, which seems > silly. Or we just declare both Bio.Application.generic_run and ApplicationResult obsolete, and simply recommend using subprocess with str(cline) as before. Would someone like to proof read (and test) the tutorial in CVS where I switched all the generic_run usage to subprocess? Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 11:14:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 12:14:18 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> Message-ID: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> On Wed, Jul 29, 2009 at 8:43 AM, Peter wrote: > On Tue, Jul 28, 2009 at 11:09 PM, Brad Chapman wrote: >> Extending this to AlignIO and TreeIO as Eric suggested is >> also great. > > Whatever we do for Bio.SeqIO, we can follow the same pattern > for Bio.AlignIO etc. > >> So +1 from me, >> Brad > > And we basically had a +0 from Michiel, and a +1 from Eric. > And I like the idea but am not convinced we need it. Maybe > we should put the suggestion forward on the main discussion > list for debate? I've stuck a branch up on github which (thus far) simply defines the Bio.SeqIO.convert and Bio.AlignIO.convert functions. Adding optimised code can come later. http://github.com/peterjc/biopython/commits/convert Right now (based on the other thread), I've experimented with making the convert functions accept either handles or filenames. This will make the convert function even more of a convenience wrapper, in addition to its role as a standardised API to allow file format specific optimisations. Taking handles and/or filenames does rather complicate things, and not just for remembering to close the handles. There are issues like should we silently replace any existing output file (I went for yes), and should the output file be deleted if the conversion fails part way (I went for no)? Dealing with just handles would free us from all these considerations. You could even consider using Python's temporary file support to write the file to a temp location, and only at the end move it to the desired location. However that is getting far too complicated for my liking (and may runs into permissions issues on Unix). If anyone wants to do this, they can do it explicitly in the calling script. How does this look so far? Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 19:41:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 20:41:20 +0100 Subject: [Biopython-dev] Unit tests for deprecated modules? In-Reply-To: <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com> References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com> <48AACE23.3050107@biologie.uni-kl.de> <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com> Message-ID: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com> Last year we talked about what to do with the unit tests for deprecated modules, http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html On Tue, Aug 19, 2008, Peter wrote: > Are there any strong views about when to remove unit tests for > deprecated modules? I can see two main approaches: > > (a) Remove the unit test when the code is deprecated, as this avoids > warning messages from the test suite. > (b) Remove the unit test only when the deprecated code is actually > removed, as continuing to test the code will catch any unexpected > breakage of the deprecated code. > > I lean towards (b), but wondered what other people think. > > Peter On Tue, Aug 19, 2008, Michiel de Hoon wrote: > I would say (a). In my opinion, deprecated means that the module > is in essence no longer part of Biopython; we just keep it around > to give people time to change. Also, deprecation warnings distract > from real warnings and errors in the unit tests, are likely to confuse > users, and give the impression that Biopython is not clean. I don't > remember a case where we had to resurrect a deprecated module, > so we may as well remove the unit test right away. > > --Michiel On Tue, Aug 19, 2008, Frank Kauff wrote: > I favor option a. Deprecated modules are no longer under development, > so there's not much need for a unit test. A failed test would probably > not trigger any action anyway, because nobody's going to do much > bugfixing in deprecated modules. > > Frank So, what we agreed last year was to remove tests for deprecated modules. This issue has come up again with the deprecation of Bio.Fasta, and the question of what to do with test_Fasta.py I'd like to suggest a third option: Keep the tests for deprecated modules, but silence the deprecation warning. e.g. make can test_Fasta.py silence the Bio.Fasta deprecation warning. Hiding the warning would prevent the likely user confusion on running the test suite (an issue Michiel pointed out last year). Keeping the test will prevent us accidentally breaking Bio.Fasta during the phasing out period. Any thoughts? Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 19:50:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 20:50:47 +0100 Subject: [Biopython-dev] Unit tests for deprecated modules? In-Reply-To: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com> References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com> <48AACE23.3050107@biologie.uni-kl.de> <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com> <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com> Message-ID: <320fb6e00908081250j189ba590o5cd9c6e98f596193@mail.gmail.com> On Sat, Aug 8, 2009 at 8:41 PM, Peter wrote: > Last year we talked about what to do with the unit tests for deprecated modules, > http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html > ... > I'd like to suggest a third option: Keep the tests for deprecated > modules, but silence the deprecation warning. e.g. make > test_Fasta.py silence the Bio.Fasta deprecation warning. I've done that in CVS as a proof of principle, replacing: from Bio import Fasta with: import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) from Bio import Fasta warnings.resetwarnings() There may be a more elegant way to do this, but it works. Peter From bugzilla-daemon at portal.open-bio.org Mon Aug 10 13:43:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 10 Aug 2009 09:43:15 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200908101343.n7ADhF4c020240@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1303 is|0 |1 obsolete| | ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-10 09:43 EST ------- (From update of attachment 1303) This file is already a tiny bit out of date - I've started working on this on a git branch. http://github.com/peterjc/biopython/commits/sff See also James Casbon's parser, also on github: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006456.html http://github.com/jamescasbon/biopython/tree/sff It looks like we could try and merge the two. James' code looks like it doesn't need seek/tell, which means it should work on any input handle (not just an open file). Note neither parser yet copes with paired end data (and I have not yet found any test files to work on). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Aug 10 16:46:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 17:46:16 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> Message-ID: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote: > I've stuck a branch up on github which (thus far) simply defines > the Bio.SeqIO.convert and Bio.AlignIO.convert functions. > Adding optimised code can come later. > > http://github.com/peterjc/biopython/commits/convert There is now a new file Bio/SeqIO/_convert.py on this branch, and a few optimised conversions have been done. In particular GenBank/EMBL to FASTA, any FASTQ to FASTA, and inter-conversion between any of the three FASTQ formats. In terms of speed, this new code takes under a minute to convert a 7 million short read FASTQ file to another FASTQ variant, or to a (line wrapped) FASTA file. In comparison, using Bio.SeqIO parse/write takes over five minutes. In terms of code organisation within Bio/SeqIO/_convert.py I am (as with Bio.SeqIO etc for parsing and writing) just using a dictionary of functions, keyed on the format names. Initially, as you can tell from the code history, I was thinking about having each sub-function potentially dealing with more than one conversion (e.g. GenBank to anything not needing features), but have removed this level of complication in the most recent commit. The current Bio/SeqIO/_convert.py file actually looks very long and complicated - but if you ignore the doctests (which I would probably more to a dedicated unit test), it isn't that much code at all. Would anyone like to try this out? Peter From eric.talevich at gmail.com Mon Aug 10 17:44:31 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 10 Aug 2009 13:44:31 -0400 Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython Message-ID: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> Hi folks, Previously I (Aug. 3-7): - Refactored the PhyloXML parser somewhat, to behave more like the other Biopython parsers and also handle 'other' elements better - Reorganized Bio.Tree a bit, generalizing the Tree base class and improving BaseTree-PhyloXML interoperability - Worked on networkx export and graphviz display - Added some more tests (thanks, Diana!) - Added TreeIO.convert(), to match the AlignIO and SeqIO modules Next week (Aug. 10-14) I will: - Update the wiki documentation - Fix any surprises that come up during testing Automated testing: - Check unit tests for complete coverage - Re-run performance benchmarks - Run tests and benchmarks on alternate platforms - Check epydoc's generated API documentation Remarks: - Performance of the I/O functions is close to what it was before, in the best of times; parsing Taxonomy nodes incrementally seems to have helped. - Drawing trees with Graphviz is still ugly. Hopefully I can fix it this week, but if not, I'll probably do it after GSoC because I like pretty things. - Presumably, any discussion of merging with Biopython will have to wait until after the biopython-1.51 release. I'll be around. For GSoC requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO modules along with the unit test suite as standalone files, rather than as a patch set since the last upstream revision I pulled was just a random untagged one around the time of the last beta release. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From matzke at berkeley.edu Mon Aug 10 20:23:15 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 10 Aug 2009 13:23:15 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <20090804222731.GA12604@sobchak.mgh.harvard.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> Message-ID: <4A8081B3.2080600@berkeley.edu> Hi all...updates... Summary: Major focus is getting the GBIF access/search/parse module into "done"/submittable shape. This primarily requires getting the documentation and testing up to biopython specs. I have a fair bit of documentation and testing, need advice (see below) for specifics on what it should look like. Brad Chapman wrote: > Hi Nick; > Thanks for the update -- great to see things moving along. > >> - removed any reliance on lagrange tree module, refactored all phylogeny >> code to use the revised Bio.Nexus.Tree module > > Awesome -- glad this worked for you. Are the lagrange_* files in > Bio.Geography still necessary? If not, we should remove them from > the repository to clean things up. Ah, they had been deleted locally but it took an extra command to delete on git. Done. > > More generally, it would be really helpful if we could do a bit of > housekeeping on the repository. The Geography namespace has a lot of > things in it which belong in different parts of the tree: > > - The test code should move to the 'Tests' directory as a set of > test_Geography* files that we can use for unit testing the code. OK, I will do this. Should I try and figure out the unittest stuff? I could use a simple example of what this is supposed to look like. > - Similarly there are a lot of data files in there which are > appear to be test related; these could move to Tests/Geography Will do. > - What is happening with the Nodes_v2 and Treesv2 files? They look > like duplicates of the Nexus Nodes and Trees with some changes. > Could we roll those changes into the main Nexus code to avoid > duplication? Yeah, these were just copies with your bug fix, and with a few mods I used to track crashes. Presumably I don't need these with after a fresh download of biopython. >> - Code dealing with GBIF xml output completely refactored into the >> following classes: >> >> * ObsRecs (observation records & search results/summary) >> * ObsRec (an individual observation record) >> * XmlString (functions for cleaning xml returned by Gbif) >> * GbifXml (extention of capabilities for ElementTree xml trees, parsed >> from GBIF xml returns. > > I'm agreed with Hilmar -- the user classes would probably benefit from expanded > naming. There is a art to naming to get them somewhere between the hideous > RidicuouslyLongNamesWithEverythingSpecified names and short truncated names. > Specifically, you've got a lot of filler in the names -- dbfUtils, > geogUtils, shpUtils. The Utils probably doesn't tell the user much > and makes all of the names sort of blend together, just as the Rec/Recs > pluralization hides a quite large difference in what the classes hold. Will work on this, these should be made part of the GbifObservationRecord() object or be accessed by it, basically they only exist to classify lat/long points into user-specified areas. > Something like Observation and ObservationSearchResult would make it > clear immediately what they do and the information they hold. Agreed, here is a new scheme for the names (changes already made): ============= class GbifSearchResults(): GbifSearchResults is a class for holding a series of GbifObservationRecord records, and processing them e.g. into classified areas. Also can hold a GbifDarwincoreXmlString record (the raw output returned from a GBIF search) and a GbifXmlTree (a class for holding/processing the ElementTree object returned by parsing the GbifDarwincoreXmlString). class GbifObservationRecord(): GbifObservationRecord is a class for holding an individual observation at an individual lat/long point. class GbifDarwincoreXmlString(str): GbifDarwincoreXmlString is a class for holding the xmlstring returned by a GBIF search, & processing it to plain text, then an xmltree (an ElementTree). GbifDarwincoreXmlString inherits string methods from str (class String). class GbifXmlTree(): gbifxml is a class for holding and processing xmltrees of GBIF records. ============= ...description of methods below... > >> This week: > > What are your thoughts on documentation? As a naive user of these > tools without much experience with the formats, I could offer better > feedback if I had an idea of the public APIs and how they are > expected to be used. Moreover, cookbook and API documentation is something > we will definitely need to integrate into Biopython. How does this fit > in your timeline for the remaining weeks? The API is really just the interface with GBIF. I think developing a cookbook entry is pretty easy, I assume you want something like one of the entries in the official biopython cookbook? Re: API documentation...are you just talking about the function descriptions that are typically in """ """ strings beneath the function definitions? I've got that done. Again, if there is more, an example of what it should look like would be useful. Documentation for the GBIF stuff below. ============ gbif_xml.py Functions for accessing GBIF, downloading records, processing them into a class, and extracting information from the xmltree in that class. class GbifObservationRecord(Exception): pass class GbifObservationRecord(): GbifObservationRecord is a class for holding an individual observation at an individual lat/long point. __init__(self): This is an instantiation class for setting up new objects of this class. latlong_to_obj(self, line): Read in a string, read species/lat/long to GbifObservationRecord object This can be slow, e.g. 10 seconds for even just ~1000 records. parse_occurrence_element(self, element): Parse a TaxonOccurrence element, store in OccurrenceRecord fill_occ_attribute(self, element, el_tag, format='str'): Return the text found in matching element matching_el.text. find_1st_matching_subelement(self, element, el_tag, return_element): Burrow down into the XML tree, retrieve the first element with the matching tag. record_to_string(self): Print the attributes of a record to a string class GbifDarwincoreXmlString(Exception): pass class GbifDarwincoreXmlString(str): GbifDarwincoreXmlString is a class for holding the xmlstring returned by a GBIF search, & processing it to plain text, then an xmltree (an ElementTree). GbifDarwincoreXmlString inherits string methods from str (class String). __init__(self, rawstring=None): This is an instantiation class for setting up new objects of this class. fix_ASCII_lines(self, endline=''): Convert each line in an input string into pure ASCII (This avoids crashes when printing to screen, etc.) _fix_ASCII_line(self, line): Convert a single string line into pure ASCII (This avoids crashes when printing to screen, etc.) _unescape(self, text): # Removes HTML or XML character references and entities from a text string. @param text The HTML (or XML) source text. @return The plain text, as a Unicode string, if necessary. source: http://effbot.org/zone/re-sub.htm#unescape-html _fix_ampersand(self, line): Replaces "&" with "&" in a string; this is otherwise not caught by the unescape and unicodedata.normalize functions. class GbifXmlTreeError(Exception): pass class GbifXmlTree(): gbifxml is a class for holding and processing xmltrees of GBIF records. __init__(self, xmltree=None): This is an instantiation class for setting up new objects of this class. print_xmltree(self): Prints all the elements & subelements of the xmltree to screen (may require fix_ASCII to input file to succeed) print_subelements(self, element): Takes an element from an XML tree and prints the subelements tag & text, and the within-tag items (key/value or whatnot) _element_items_to_dictionary(self, element_items): If the XML tree element has items encoded in the tag, e.g. key/value or whatever, this function puts them in a python dictionary and returns them. extract_latlongs(self, element): Create a temporary pseudofile, extract lat longs to it, return results as string. Inspired by: http://www.skymind.com/~ocrow/python_string/ (Method 5: Write to a pseudo file) _extract_latlong_datum(self, element, file_str): Searches an element in an XML tree for lat/long information, and the complete name. Searches recursively, if there are subelements. file_str is a string created by StringIO in extract_latlongs() (i.e., a temp filestr) extract_all_matching_elements(self, start_element, el_to_match): Returns a list of the elements, picking elements by TaxonOccurrence; this should return a list of elements equal to the number of hits. _recursive_el_match(self, element, el_to_match, output_list): Search recursively through xmltree, starting with element, recording all instances of el_to_match. find_to_elements_w_ancs(self, el_tag, anc_el_tag): Burrow into XML to get an element with tag el_tag, return only those el_tags underneath a particular parent element parent_el_tag xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, match_el_list): Recursively burrows down to find whatever elements with el_tag exist inside a parent_el_tag. create_sub_xmltree(self, element): Create a subset xmltree (to avoid going back to irrelevant parents) _xml_burrow_up(self, element, anc_el_tag, found_anc): Burrow up xml to find anc_el_tag _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): Burrow up from element of interest, until a cousin is found with cousin_el_tag _return_parent_in_xmltree(self, child_to_search_for): Search through an xmltree to get the parent of child_to_search_for _return_parent_in_element(self, potential_parent, child_to_search_for, returned_parent): Search through an XML element to return parent of child_to_search_for find_1st_matching_element(self, element, el_tag, return_element): Burrow down into the XML tree, retrieve the first element with the matching tag extract_numhits(self, element): Search an element of a parsed XML string and find the number of hits, if it exists. Recursively searches, if there are subelements. class GbifSearchResults(Exception): pass class GbifSearchResults(): GbifSearchResults is a class for holding a series of GbifObservationRecord records, and processing them e.g. into classified areas. __init__(self, gbif_recs_xmltree=None): This is an instantiation class for setting up new objects of this class. print_records(self): Print all records in tab-delimited format to screen. print_records_to_file(self, fn): Print the attributes of a record to a file with filename fn latlongs_to_obj(self): Takes the string from extract_latlongs, puts each line into a GbifObservationRecord object. Return a list of the objects Functions devoted to accessing/downloading GBIF records access_gbif(self, url, params): Helper function to access various GBIF services choose the URL ("url") from here: http://data.gbif.org/ws/rest/occurrence params are a dictionary of key/value pairs "self._open" is from Bio.Entrez.self._open, online here: http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open Get the handle of results (looks like e.g.: > ) (open with results_handle.read() ) _get_hits(self, params): Get the actual hits that are be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). get_xml_hits(self, params): Returns hits like _get_hits, but returns a parsed XML tree. get_record(self, key): Given the key, get a single record, return xmltree for it. get_numhits(self, params): Get the number of hits that will be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). xmlstring_to_xmltree(self, xmlstring): Take the text string returned by GBIF and parse to an XML tree using ElementTree. Requires the intermediate step of saving to a temporary file (required to make ElementTree.parse work, apparently) tempfn = 'tempxml.xml' fh = open(tempfn, 'w') fh.write(xmlstring) fh.close() get_all_records_by_increment(self, params, inc): Download all of the records in stages, store in list of elements. Increments of e.g. 100 to not overload server extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree): Extract all of the 'TaxonOccurrence' elements to a list, store them in a GbifObservationRecord. _paramsdict_to_string(self, params): Converts the python dictionary of search parameters into a text string for submission to GBIF _open(self, cgi, params={}): Function for accessing online databases. Modified from: http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html Helper function to build the URL and open a handle to it (PRIVATE). Open a handle to GBIF. cgi is the URL for the cgi script to access. params is a dictionary with the options to pass to it. Does some simple error checking, and will raise an IOError if it encounters one. This function also enforces the "three second rule" to avoid abusing the GBIF servers (modified after NCBI requirement). ============ > > Thanks again. Hope this helps, > Brad Very much, thanks!! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Mon Aug 10 20:25:10 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 10 Aug 2009 13:25:10 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A8081B3.2080600@berkeley.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> Message-ID: <4A808226.5020302@berkeley.edu> PS: Evidence of interest in this GBIF functionality already, see fwd below... PPS: Commit with updates names, deleted old files here: http://github.com/nmatzke/biopython/commits/Geography -------- Original Message -------- Subject: Re: biogeopython Date: Fri, 07 Aug 2009 16:34:26 -0700 From: Nick Matzke Reply-To: matzke at berkeley.edu Organization: Dept. Integ. Biology, UC Berkeley To: James Pringle References: <4A7C6DEE.1000305 at berkeley.edu> Coolness, let me know how it works for you, feedback appreciated at this stage. Cheers! Nick James Pringle wrote: > Thanks! > Jamie > > On Fri, Aug 7, 2009 at 2:09 PM, Nick Matzke > wrote: > > Hi Jamie! > > It's still under development, eventually it will be a biopython > module, but what I've got should do exactly what you need. > > Just take the files from the most recent commit here: > http://github.com/nmatzke/biopython/commits/Geography > > ...and run test_gbif_xml.py to get the idea, it will search on a > taxon name, count/download all hits, parse the xml to a set of > record objects, output each record to screen or tab-delimited file, > etc. > > Cheers! > Nick > > > > > > James Pringle wrote: > > Dear Mr. Matzke-- > > I am an oceanographer at the University of New Hampshire, and > with my colleagues John Wares and Jeb Byers am looking at the > interaction of ocean circulation and species ranges. As part > of that effort, I am using GBIF data, and was looking at your > Summer-of-Code project. I want to start from a species name > and get lat/long of occurance data. Is you toolbox in usable > shape (I am an ok pythonista)? What is the best way to download > a tested version of it (I can figure out how to get code from > CVS/GIT, etc, so I am just looking for a pointer to a stable-ish > tree)? > > Cheers, > & Thanks > Jamie Pringle > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When > people thought the earth was spherical, they were wrong. But if you > think that thinking the earth is spherical is just as wrong as > thinking the earth is flat, then your view is wronger than both of > them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical > Inquirer, 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== > > Nick Matzke wrote: > Hi all...updates... > > Summary: Major focus is getting the GBIF access/search/parse module into > "done"/submittable shape. This primarily requires getting the > documentation and testing up to biopython specs. I have a fair bit of > documentation and testing, need advice (see below) for specifics on what > it should look like. > > > Brad Chapman wrote: >> Hi Nick; >> Thanks for the update -- great to see things moving along. >> >>> - removed any reliance on lagrange tree module, refactored all >>> phylogeny code to use the revised Bio.Nexus.Tree module >> >> Awesome -- glad this worked for you. Are the lagrange_* files in >> Bio.Geography still necessary? If not, we should remove them from >> the repository to clean things up. > > > Ah, they had been deleted locally but it took an extra command to delete > on git. Done. > >> >> More generally, it would be really helpful if we could do a bit of >> housekeeping on the repository. The Geography namespace has a lot of >> things in it which belong in different parts of the tree: >> >> - The test code should move to the 'Tests' directory as a set of >> test_Geography* files that we can use for unit testing the code. > > OK, I will do this. Should I try and figure out the unittest stuff? I > could use a simple example of what this is supposed to look like. > > >> - Similarly there are a lot of data files in there which are >> appear to be test related; these could move to Tests/Geography > > Will do. > >> - What is happening with the Nodes_v2 and Treesv2 files? They look >> like duplicates of the Nexus Nodes and Trees with some changes. >> Could we roll those changes into the main Nexus code to avoid >> duplication? > > Yeah, these were just copies with your bug fix, and with a few mods I > used to track crashes. Presumably I don't need these with after a fresh > download of biopython. > > > >>> - Code dealing with GBIF xml output completely refactored into the >>> following classes: >>> >>> * ObsRecs (observation records & search results/summary) >>> * ObsRec (an individual observation record) >>> * XmlString (functions for cleaning xml returned by Gbif) >>> * GbifXml (extention of capabilities for ElementTree xml trees, >>> parsed from GBIF xml returns. >> >> I'm agreed with Hilmar -- the user classes would probably benefit from >> expanded >> naming. There is a art to naming to get them somewhere between the >> hideous RidicuouslyLongNamesWithEverythingSpecified names and short >> truncated names. >> Specifically, you've got a lot of filler in the names -- dbfUtils, >> geogUtils, shpUtils. The Utils probably doesn't tell the user much >> and makes all of the names sort of blend together, just as the >> Rec/Recs pluralization hides a quite large difference in what the >> classes hold. > > Will work on this, these should be made part of the > GbifObservationRecord() object or be accessed by it, basically they only > exist to classify lat/long points into user-specified areas. > >> Something like Observation and ObservationSearchResult would make it >> clear immediately what they do and the information they hold. > > > Agreed, here is a new scheme for the names (changes already made): > > ============= > class GbifSearchResults(): > > GbifSearchResults is a class for holding a series of > GbifObservationRecord records, and processing them e.g. into classified > areas. > > Also can hold a GbifDarwincoreXmlString record (the raw output returned > from a GBIF search) and a GbifXmlTree (a class for holding/processing > the ElementTree object returned by parsing the GbifDarwincoreXmlString). > > > > class GbifObservationRecord(): > > GbifObservationRecord is a class for holding an individual observation > at an individual lat/long point. > > > > class GbifDarwincoreXmlString(str): > > GbifDarwincoreXmlString is a class for holding the xmlstring returned by > a GBIF search, & processing it to plain text, then an xmltree (an > ElementTree). > > GbifDarwincoreXmlString inherits string methods from str (class String). > > > > class GbifXmlTree(): > gbifxml is a class for holding and processing xmltrees of GBIF records. > ============= > > ...description of methods below... > > >> >>> This week: >> >> What are your thoughts on documentation? As a naive user of these >> tools without much experience with the formats, I could offer better >> feedback if I had an idea of the public APIs and how they are >> expected to be used. Moreover, cookbook and API documentation is >> something we will definitely need to integrate into Biopython. How >> does this fit in your timeline for the remaining weeks? > > The API is really just the interface with GBIF. I think developing a > cookbook entry is pretty easy, I assume you want something like one of > the entries in the official biopython cookbook? > > Re: API documentation...are you just talking about the function > descriptions that are typically in """ """ strings beneath the function > definitions? I've got that done. Again, if there is more, an example > of what it should look like would be useful. > > Documentation for the GBIF stuff below. > > ============ > gbif_xml.py > Functions for accessing GBIF, downloading records, processing them into > a class, and extracting information from the xmltree in that class. > > > class GbifObservationRecord(Exception): pass > class GbifObservationRecord(): > GbifObservationRecord is a class for holding an individual observation > at an individual lat/long point. > > > __init__(self): > > This is an instantiation class for setting up new objects of this class. > > > > latlong_to_obj(self, line): > > Read in a string, read species/lat/long to GbifObservationRecord object > This can be slow, e.g. 10 seconds for even just ~1000 records. > > > parse_occurrence_element(self, element): > > Parse a TaxonOccurrence element, store in OccurrenceRecord > > > fill_occ_attribute(self, element, el_tag, format='str'): > > Return the text found in matching element matching_el.text. > > > > find_1st_matching_subelement(self, element, el_tag, return_element): > > Burrow down into the XML tree, retrieve the first element with the > matching tag. > > > record_to_string(self): > > Print the attributes of a record to a string > > > > > > > > class GbifDarwincoreXmlString(Exception): pass > > class GbifDarwincoreXmlString(str): > GbifDarwincoreXmlString is a class for holding the xmlstring returned by > a GBIF search, & processing it to plain text, then an xmltree (an > ElementTree). > > GbifDarwincoreXmlString inherits string methods from str (class String). > > > > __init__(self, rawstring=None): > > This is an instantiation class for setting up new objects of this class. > > > > fix_ASCII_lines(self, endline=''): > > Convert each line in an input string into pure ASCII > (This avoids crashes when printing to screen, etc.) > > > _fix_ASCII_line(self, line): > > Convert a single string line into pure ASCII > (This avoids crashes when printing to screen, etc.) > > > _unescape(self, text): > > # > Removes HTML or XML character references and entities from a text string. > > @param text The HTML (or XML) source text. > @return The plain text, as a Unicode string, if necessary. > source: http://effbot.org/zone/re-sub.htm#unescape-html > > > _fix_ampersand(self, line): > > Replaces "&" with "&" in a string; this is otherwise > not caught by the unescape and unicodedata.normalize functions. > > > > > > > > class GbifXmlTreeError(Exception): pass > class GbifXmlTree(): > gbifxml is a class for holding and processing xmltrees of GBIF records. > > __init__(self, xmltree=None): > > This is an instantiation class for setting up new objects of this class. > > > print_xmltree(self): > > Prints all the elements & subelements of the xmltree to screen (may require > fix_ASCII to input file to succeed) > > > print_subelements(self, element): > > Takes an element from an XML tree and prints the subelements tag & text, > and > the within-tag items (key/value or whatnot) > > > _element_items_to_dictionary(self, element_items): > > If the XML tree element has items encoded in the tag, e.g. key/value or > whatever, this function puts them in a python dictionary and returns > them. > > > extract_latlongs(self, element): > > Create a temporary pseudofile, extract lat longs to it, > return results as string. > > Inspired by: http://www.skymind.com/~ocrow/python_string/ > (Method 5: Write to a pseudo file) > > > > > _extract_latlong_datum(self, element, file_str): > > Searches an element in an XML tree for lat/long information, and the > complete name. Searches recursively, if there are subelements. > > file_str is a string created by StringIO in extract_latlongs() (i.e., a > temp filestr) > > > > extract_all_matching_elements(self, start_element, el_to_match): > > Returns a list of the elements, picking elements by TaxonOccurrence; > this should > return a list of elements equal to the number of hits. > > > > _recursive_el_match(self, element, el_to_match, output_list): > > Search recursively through xmltree, starting with element, recording all > instances of el_to_match. > > > find_to_elements_w_ancs(self, el_tag, anc_el_tag): > > Burrow into XML to get an element with tag el_tag, return only those > el_tags underneath a particular parent element parent_el_tag > > > xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, > match_el_list): > > Recursively burrows down to find whatever elements with el_tag exist > inside a parent_el_tag. > > > > create_sub_xmltree(self, element): > > Create a subset xmltree (to avoid going back to irrelevant parents) > > > > _xml_burrow_up(self, element, anc_el_tag, found_anc): > > Burrow up xml to find anc_el_tag > > > > _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): > > Burrow up from element of interest, until a cousin is found with > cousin_el_tag > > > > > _return_parent_in_xmltree(self, child_to_search_for): > > Search through an xmltree to get the parent of child_to_search_for > > > > _return_parent_in_element(self, potential_parent, child_to_search_for, > returned_parent): > > Search through an XML element to return parent of child_to_search_for > > > find_1st_matching_element(self, element, el_tag, return_element): > > Burrow down into the XML tree, retrieve the first element with the > matching tag > > > > > extract_numhits(self, element): > > Search an element of a parsed XML string and find the > number of hits, if it exists. Recursively searches, > if there are subelements. > > > > > > > > > > > > > class GbifSearchResults(Exception): pass > > class GbifSearchResults(): > > GbifSearchResults is a class for holding a series of > GbifObservationRecord records, and processing them e.g. into classified > areas. > > > > __init__(self, gbif_recs_xmltree=None): > > This is an instantiation class for setting up new objects of this class. > > > > print_records(self): > > Print all records in tab-delimited format to screen. > > > > > print_records_to_file(self, fn): > > Print the attributes of a record to a file with filename fn > > > > latlongs_to_obj(self): > > Takes the string from extract_latlongs, puts each line into a > GbifObservationRecord object. > > Return a list of the objects > > > Functions devoted to accessing/downloading GBIF records > access_gbif(self, url, params): > > Helper function to access various GBIF services > > choose the URL ("url") from here: > http://data.gbif.org/ws/rest/occurrence > > params are a dictionary of key/value pairs > > "self._open" is from Bio.Entrez.self._open, online here: > http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open > > Get the handle of results > (looks like e.g.: object at 0x48117f0>> ) > > (open with results_handle.read() ) > > > _get_hits(self, params): > > Get the actual hits that are be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > > It will return the LAST non-none instance (in a standard search result > there > should be only one, anyway). > > > > > get_xml_hits(self, params): > > Returns hits like _get_hits, but returns a parsed XML tree. > > > > > get_record(self, key): > > Given the key, get a single record, return xmltree for it. > > > > get_numhits(self, params): > > Get the number of hits that will be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > > It will return the LAST non-none instance (in a standard search result > there > should be only one, anyway). > > > xmlstring_to_xmltree(self, xmlstring): > > Take the text string returned by GBIF and parse to an XML tree using > ElementTree. > Requires the intermediate step of saving to a temporary file (required > to make > ElementTree.parse work, apparently) > > > > tempfn = 'tempxml.xml' > fh = open(tempfn, 'w') > fh.write(xmlstring) > fh.close() > > > > > > get_all_records_by_increment(self, params, inc): > > Download all of the records in stages, store in list of elements. > Increments of e.g. 100 to not overload server > > > > extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree): > > Extract all of the 'TaxonOccurrence' elements to a list, store them in a > GbifObservationRecord. > > > > _paramsdict_to_string(self, params): > > Converts the python dictionary of search parameters into a text > string for submission to GBIF > > > > _open(self, cgi, params={}): > > Function for accessing online databases. > > Modified from: > http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html > > Helper function to build the URL and open a handle to it (PRIVATE). > > Open a handle to GBIF. cgi is the URL for the cgi script to access. > params is a dictionary with the options to pass to it. Does some > simple error checking, and will raise an IOError if it encounters one. > > This function also enforces the "three second rule" to avoid abusing > the GBIF servers (modified after NCBI requirement). > ============ > > >> >> Thanks again. Hope this helps, >> Brad > > Very much, thanks!! > Nick > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Mon Aug 10 20:49:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 21:49:29 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A8081B3.2080600@berkeley.edu> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> Message-ID: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com> On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote: > Hi all...updates... > > Summary: Major focus is getting the GBIF access/search/parse module into > "done"/submittable shape. ?This primarily requires getting the documentation > and testing up to biopython specs. ?I have a fair bit of documentation and > testing, need advice (see below) for specifics on what it should look like. > >> - The test code should move to the 'Tests' directory as a set of >> ?test_Geography* files that we can use for unit testing the code. > > OK, I will do this. ?Should I try and figure out the unittest stuff? ?I > could use a simple example of what this is supposed to look like. You can either go for "unittest" based tests (generally better, but more of a learning curve - but useful for any python project), or our own Biopython specific "print and compare" tests (basically sample scripts with their expected output). Read the tests chapter in the Biopython Tutorial if you haven't already. (And if you think anything could be clearer, or you spot a typo, let us know please - feedback would be great). Peter From matzke at berkeley.edu Mon Aug 10 21:10:26 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 10 Aug 2009 14:10:26 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com> References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com> Message-ID: <4A808CC2.6000308@berkeley.edu> Peter wrote: > On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote: >> Hi all...updates... >> >> Summary: Major focus is getting the GBIF access/search/parse module into >> "done"/submittable shape. This primarily requires getting the documentation >> and testing up to biopython specs. I have a fair bit of documentation and >> testing, need advice (see below) for specifics on what it should look like. >> >>> - The test code should move to the 'Tests' directory as a set of >>> test_Geography* files that we can use for unit testing the code. >> OK, I will do this. Should I try and figure out the unittest stuff? I >> could use a simple example of what this is supposed to look like. > > You can either go for "unittest" based tests (generally better, but more > of a learning curve - but useful for any python project), or our own > Biopython specific "print and compare" tests (basically sample scripts > with their expected output). > > Read the tests chapter in the Biopython Tutorial if you haven't already. > (And if you think anything could be clearer, or you spot a typo, let us > know please - feedback would be great). Thanks! Nick > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Tue Aug 11 12:19:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 13:19:25 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> Message-ID: <320fb6e00908110519k313d6d34g40502fd2578326e1@mail.gmail.com> On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote: > In terms of speed, this new code takes under a minute to > convert a 7 million short read FASTQ file to another FASTQ > variant, or to a (line wrapped) FASTA file. In comparison, > using Bio.SeqIO parse/write takes over five minutes. If anyone is interested in the details, here I am using a 7 million entry FASTQ file of short reads (length 36bp) from a Solexa FASTQ format file (downloaded from the NCBI and then converted from the Sanger FASTQ format). I'm timing conversion from Solexa to Sanger FASTQ as it is a more common operation, and I can include the MAQ script for comparison. I pipe the output via grep and word count as a check on the conversion. Using a (patched) version of MAQ's fq_all2std.pl we get about 4 mins: $ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 3m58.978s user 4m13.475s sys 0m3.705s And using a patched version of EMBOSS 6.1.0 (without the optimisations Peter Rice has mentioned), we get 3m42s. $ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger < SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 3m41.625s user 3m56.753s sys 0m4.091s Using the latest Biopython in CVS (or the git master branch), with Bio.SeqIO.parse/write, takes about twice this, 7m11s: $ time python biopython_solexa2sanger.py < SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 7m10.706s user 7m27.597s sys 0m3.850s This is at least a marked improvement over Biopython 1.51b with Bio.SeqIO.parse/write, which took about 17 minutes! The bad news is while the Bio.SeqIO FASTQ read/write in CVS is faster than in Biopython 1.51b, it is also much less elegant. I'm think once I've finished adding test cases (and probably after 1.51 is out) it might be worth while trying to make it more beautiful without sacrificing too much of the speed gain. Now to the good news, using my github branch with the convert function we get a massive reduction to under a minute (52s): $ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l 7047668 real 0m51.618s user 1m7.735s sys 0m3.162s We have a winner! Assuming of course there are no mistakes ;) In fact, these measurements are a little misleading because I am including grep (to check the record count) and the output isn't actually going to disk. Doing the grep on its own takes about 15s: $ time grep "^@SRR" SRR001666_1.fastq_solexa | wc -l 7047668 real 0m15.318s user 0m17.890s sys 0m1.087s However, if you actually output to a file the disk speed itself becomes important when the conversion is this fast: $ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq real 1m3.448s user 0m49.672s sys 0m4.826s $ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger < SRR001666_1.fastq_solexa > temp.fastq real 3m55.086s user 3m39.548s sys 0m5.998s $ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std SRR001666_1.fastq_solexa > temp.fastq real 4m10.245s user 3m54.880s sys 0m5.085s $ time python ../biopython/Tests/Quality/biopython_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq real 7m27.879s user 7m9.084s sys 0m6.008s Nevertheless, the Bio.SeqIO.convert(...) function still wins for now. Peter For those interested, here are the tiny little Biopython scripts I'm using: # biopython_solexa2sanger.py #FASTQ conversion using Bio.SeqIO, needs Biopython 1.50 or later. import sys from Bio import SeqIO records = SeqIO.parse(sys.stdin, "fastq-solexa") SeqIO.write(records, sys.stdout, "fastq") and: #convert_solexa2sanger.py #High performance FASTQ conversion using Bio.SeqIO.convert(...) #function likely to be in Biopython 1.52 onwards. import sys from Bio import SeqIO SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq") From chapmanb at 50mail.com Tue Aug 11 13:10:19 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 11 Aug 2009 09:10:19 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A8081B3.2080600@berkeley.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> Message-ID: <20090811131019.GW12604@sobchak.mgh.harvard.edu> Hi Nick; > Summary: Major focus is getting the GBIF access/search/parse module into > "done"/submittable shape. This primarily requires getting the > documentation and testing up to biopython specs. I have a fair bit of > documentation and testing, need advice (see below) for specifics on what > it should look like. Awesome. Thanks for working on the cleanup for this. > OK, I will do this. Should I try and figure out the unittest stuff? I > could use a simple example of what this is supposed to look like. In addition to Peter's pointers, here is a simple example from a small thing I wrote: http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py You can copy/paste the unit test part to get a base, and then replace the t_* functions with your own real tests. Simple scripts that generate consistent output are also fine; that's the print and compare approach. > > - What is happening with the Nodes_v2 and Treesv2 files? They look > > like duplicates of the Nexus Nodes and Trees with some changes. > > Could we roll those changes into the main Nexus code to avoid > > duplication? > > Yeah, these were just copies with your bug fix, and with a few mods I > used to track crashes. Presumably I don't need these with after a fresh > download of biopython. Cool. It would be great if we could weed these out as well. > The API is really just the interface with GBIF. I think developing a > cookbook entry is pretty easy, I assume you want something like one of > the entries in the official biopython cookbook? Yes, that would work great. What I was thinking of are some examples where you provide background and motivation: Describe some useful information you want to get from GBIF, and then show how to do it. This is definitely the most useful part as it gives people working examples to start with. From there they can usually browse the lower level docs or code to figure out other specific things. > Re: API documentation...are you just talking about the function > descriptions that are typically in """ """ strings beneath the function > definitions? I've got that done. Again, if there is more, an example > of what it should look like would be useful. That looks great for API level docs. You are right on here; for this week I'd focus on the cookbook examples and cleanup stuff. My other suggestion would be to rename these to follow Biopython conventions, something like: gbif_xml -> GbifXml shpUtils -> ShapefileUtils geogUtils -> GeographyUtils dbfUtils -> DbfUtils The *Utils might have underscores if they are not intended to be called directly. Thanks for all your hard work, Brad From chapmanb at 50mail.com Tue Aug 11 13:20:57 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 11 Aug 2009 09:20:57 -0400 Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython In-Reply-To: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> Message-ID: <20090811132057.GX12604@sobchak.mgh.harvard.edu> Hi Eric; All sounds great -- looks like you are in good shape for finishing things up this week. Really great work. > - Presumably, any discussion of merging with Biopython will have to wait > until after the biopython-1.51 release. I'll be around. For GSoC > requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO > modules along with the unit test suite as standalone files, rather than > as a patch set since the last upstream revision I pulled was just a > random untagged one around the time of the last beta release. We were discussing a release at the end of this week or over the weekend. I think we should roll this in soon after that so anyone can get it from the main trunk. I don't see any major issues with integrating it. How did you like the Git/GitHub experience? One thing we should push after this release is moving over to that as the official repository. Since you have been doing full time Git work this summer, your experience will be really helpful. I still rely on CVS as a bit of a crutch, but should learn to do things fully in Git. Brad From biopython at maubp.freeserve.co.uk Tue Aug 11 16:13:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 17:13:58 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com> <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com> <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com> <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com> Message-ID: <320fb6e00908110913x6cfe7826xa683a6dc130da26e@mail.gmail.com> On Thu, Aug 6, 2009 at 5:05 PM, Peter wrote: > Or we just declare both Bio.Application.generic_run and > ApplicationResult obsolete, and simply recommend using > subprocess with str(cline) as before. Would someone like to > proof read (and test) the tutorial in CVS where I switched all > the generic_run usage to subprocess? > I've just marked Bio.Application.generic_run and ApplicationResult as obsolete in CVS. I am content to wait for a consensus about any replacement for generic_run once more people have tried using subprocess directly. Peter From biopython at maubp.freeserve.co.uk Tue Aug 11 16:44:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 17:44:11 +0100 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? Message-ID: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> Hi David & John, Would either of you be able to draft a release announcement for Biopython 1.51? We're aiming for the end of this week... touch wood. I'm pretty sure the NEWS and DEPRECATED files are up to date (if anyone can spot any omissions, please let us know), these try and summarise changes for each release. Unless you have CVS or git installed, the easiest way to read these files is currently from the github website: http://github.com/biopython/biopython/tree/master Thanks, Peter P.S. Don't be afraid to repeat things from the Biopython 1.51 beta announcement: http://news.open-bio.org/news/2009/06/biopython-151-beta-released/ http://lists.open-bio.org/pipermail/biopython-announce/2009-June/000057.html From eric.talevich at gmail.com Tue Aug 11 18:50:02 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 11 Aug 2009 14:50:02 -0400 Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython In-Reply-To: <20090811132057.GX12604@sobchak.mgh.harvard.edu> References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com> <20090811132057.GX12604@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360908111150q495e541bv405b25f0d74127fd@mail.gmail.com> On Tue, Aug 11, 2009 at 9:20 AM, Brad Chapman wrote: > > How did you like the Git/GitHub experience? One thing we should push > after this release is moving over to that as the official > repository. Since you have been doing full time Git work this > summer, your experience will be really helpful. I still rely on CVS > as a bit of a crutch, but should learn to do things fully in Git. > > I liked it a lot! I've spent some time with Subversion, Bazaar, Mercurial and Git now, and I'm confident that Git was the right choice for Biopython. My commit history shows a quick flurry of activity on each of the past few Fridays -- that's from a couple days of exploration toward the end of the week, then repeated calls to "git add -i" to pick out the parts that are worth keeping. I'm careful with git-rebase, but "git commit --amend" gets a fair amount of use. I could add a section on the Biopython wiki's GitUsage page, called something like "Managing Commits", giving some examples of this. GitHub has been down briefly a few times. It was only a problem because it happened on Monday mornings, when I wanted to push an updated README to my public fork at the same time as my weekly update e-mail to this list. Having a mirror on GitHub is great for getting started with Biopython development, but I'm still unclear on how changes should propagate back upstream after Biopython switches from CVS to Git. Pull requests? Core devs pushing to a central Git repository on OBF servers? Maybe the BioRuby folks have advice; if this has been settled on biopython-dev, I've missed it. Anyway. To create the final patch tarball next Monday for GSoC, I believe the right incantation looks like this: git format-patch -o gsoc-phyloxml master...phyloxml tar czf gsoc-phyloxml.tgz gsoc-phyloxml That's cleaner than I expected it to be. Neat. Cheers, Eric From winda002 at student.otago.ac.nz Wed Aug 12 05:47:13 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 12 Aug 2009 17:47:13 +1200 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> Message-ID: <4A825761.10106@student.otago.ac.nz> Peter wrote: > Hi David & John, > > Would either of you be able to draft a release announcement for > Biopython 1.51? We're aiming for the end of this week... touch wood. > We'll definitely aim to have something for the list to check out in the next 24hrs. I guess the main points are all the Cool New Stuff from the beta being in a stable release for the first time, FASTQ has been shown to play nicely with across a bunch of projects and Application.generic_run() is now on the deprecation path? On that note, would it be useful to have a cookbook example or even a blog-post ready to go showing a few of the ways one might use subprocess to run commands defined with Biopython? I'm happy to put something together that others can evaluate. Cheers, David From biopython at maubp.freeserve.co.uk Wed Aug 12 09:49:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Aug 2009 10:49:50 +0100 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <4A825761.10106@student.otago.ac.nz> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> Message-ID: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> On Wed, Aug 12, 2009 at 6:47 AM, David Winter wrote: > > Peter wrote: >> >> Hi David & John, >> >> Would either of you be able to draft a release announcement for >> Biopython 1.51? We're aiming for the end of this week... touch wood. > > We'll definitely aim to have something for the list to check out in the next > 24hrs. I guess the main points are all the Cool New Stuff from the beta > being in a stable release for the first time, FASTQ has been shown to play > nicely with across a bunch of projects and Application.generic_run() is now > on the deprecation path? Historically we haven't made a big thing about deprecations in the release announcements. Maybe we should - in which case also note that Bio.Fasta has finally been deprecated. > On that note, would it be useful to have a cookbook example or even a > blog-post ready to go showing a few of the ways one might use subprocess to > run commands defined with Biopython? I'm happy to put something together > that others can evaluate. The tutorial has several examples at the end of the chapter on alignments (because lots of the wrappers at the moment are for alignment tools). I've just updated the copy online to the current version from CVS (dated 10 August 2009). If you can spot any errors in the next couple of days we can get them fixed before the release. Peter From biopython at maubp.freeserve.co.uk Wed Aug 12 12:54:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Aug 2009 13:54:15 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> Message-ID: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> On Thu, Jul 23, 2009 at 10:34 AM, Peter wrote: > On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote: >> I don't think there is much in it really. ?You have a factored >> BinaryFile class, I have classes for the components of the SFF file. >> Both are based around struct. I have now written a third variant (loosely based on Jose's code). This is just a single generator function (also based on struct). Right now it is a slightly long function, but it can be refactored easily enough. Is also a lot faster than Jose's code which is a big plus point for large files. See: http://github.com/peterjc/biopython/tree/sff I haven't compared my new code against yours for speed yet James, because your parser didn't like my large SFF file. You have hard coded it to expect read names of length 14, and 400 flows per read. I have some data from Sanger where the read names are length 14, but there are 800 flows per read. Having the two reference parsers to look at was educational, so thank you both (James and Jose) for sharing your code. I now understand the SFF file format much better, and am now confident I could design an indexer to provide dictionary like access to it - a possible addition to Bio.SeqIO - see this thread: http://lists.open-bio.org/pipermail/biopython/2009-June/005312.html > Jose's code uses seek/tell which means it has to have a handle > to an actual file. He also used binary read mode - I'm not sure if > this was essential or not. Binary more was not essential - opening an SFF file in default mode also seemed to work fine with Jose's code. > James' code seems to make a single pass though the file handle, > without using seek/tell to jump about. I think this is nicer, as it is > consistent with the other SeqIO parsers, and should work on > more types of handles (e.g. from gzip, StringIO, or even a > network connection). I've also avoided using seek/tell in my rewrite. > It looks like you (James) construct Seq objects using the full > untrimmed sequence as is. I was undecided on if trimmed or > untrimmed should be the default, but the idea of some kind of > masked or trimmed Seq object had come up on the mailing list > which might be useful here (and in contig alignments). i.e. > something which acts like a Seq object giving the trimmed > sequence, but which also contains the full sequence and trim > positions. I'm still thinking about this. One simplistic option (as used on my branch) would be to have two input formats in Bio.SeqIO, one untrimmed and one trimmed, e.g. "sff" and "sff-trim". Peter From winda002 at student.otago.ac.nz Thu Aug 13 00:32:55 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 13 Aug 2009 12:32:55 +1200 Subject: [Biopython-dev] Draft announcement for Biopython 1.51 In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> Message-ID: <4A835F37.5040907@student.otago.ac.nz> Hi all, here is a draft announcement to go out when 1.51 is built and ready to go. Comments and corrections are very welcome (should we keep the deprecation paragraph in?) I've also added a draft post to the OBF blog with this text marked up with links and ready to go, hopefully that way whoever builds the release can just ask someone with an account there (Brad and Peter at least) to push post once everything is ready. ++ We are pleased to announce the release of Biopython 1.51.This new stable release enhances version 1.50 (released in April) by extending the functionality of existing modules, adding a set of application wrappers for popular alignment programs and fixing a number of minor bugs. In particular, the SeqIO module can now write genbank files that include features and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using Sloexa, Sanger and Ilumina quality scores and has been validated against the the BioPerl and EMBOSS implementations of this format. Biopython 1.51 is the first stable release to include the Align.Applications module which allows users to define command line wrappers for popular alignment programs including ClustalW, Muscle and T-Coffee. ?? This new release also spells the beginning of the end for some of Biopython's older tools. Bio.Fasta and the application tools ApplicationResult and generic_run() have been marked as deprecated which means they can still be imported but doing who warn the user that these functions will be removed in the future. Bio.Fasta has been superseded by SeqIO's support for the Fasta format while we now suggest using the subprocess module from the Python Standard Library to call applications - use of this module is extensively documented in section 6.3 of the Biopython Tutorial and Cookbook. ?? As always the Tutorial and Cookbook has been updated to document the other changes made since the last release. Thank you to everyone who tested our 1.51 beta or submitted bugs since out last stable release and to all of our contributors Sources and Windows Installer for the new release are available from the downloads page. ++ From winda002 at student.otago.ac.nz Thu Aug 13 00:37:12 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 13 Aug 2009 12:37:12 +1200 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> Message-ID: <4A836038.1060609@student.otago.ac.nz> >> On that note, would it be useful to have a cookbook example or even a >> blog-post ready to go showing a few of the ways one might use subprocess to >> run commands defined with Biopython? I'm happy to put something together >> that others can evaluate. >> > > The tutorial has several examples at the end of the chapter on > alignments (because lots of the wrappers at the moment are for > alignment tools). I've just updated the copy online to the current > version from CVS (dated 10 August 2009). If you can spot any > errors in the next couple of days we can get them fixed before > the release. > > Peter > > OK, I had only looked at the doc strings (my editor chokes on long text files and I don't have anything to set Tex docs with) so didn't know that existed. That looks really good (and the feeding output into handles bit is pretty wizardly!) Cheers, David From biopython at maubp.freeserve.co.uk Thu Aug 13 10:00:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 11:00:49 +0100 Subject: [Biopython-dev] Drafting announcement for Biopython 1.51? In-Reply-To: <4A836038.1060609@student.otago.ac.nz> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> <4A836038.1060609@student.otago.ac.nz> Message-ID: <320fb6e00908130300x3b4f1eb7m7711b76e0e03fd8a@mail.gmail.com> On Thu, Aug 13, 2009 at 1:37 AM, David Winter wrote: > > OK, I had only looked at the doc strings (my editor chokes on long > text files and I don't have anything to set Tex docs with) so didn't > know that existed. TeX or LaTeX files are just plain text with some magic markup e.g. \emph{text to emphasise}. Any decent text editor should be able to load them, and some will even colour code things. Even if you don't understand the markup, most of the time you can actually read the raw files directly and understand them. But yeah, the PDF or HTML output is what most people will want to look at ;) > That looks really good (and the feeding output into handles > bit is pretty wizardly!) Yeah - it is pretty cool. Sadly not all command line tools will accept input via stdin, so this kind of thing isn't always possible. Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 10:10:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 11:10:44 +0100 Subject: [Biopython-dev] Draft announcement for Biopython 1.51 In-Reply-To: <4A835F37.5040907@student.otago.ac.nz> References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com> <4A825761.10106@student.otago.ac.nz> <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com> <4A835F37.5040907@student.otago.ac.nz> Message-ID: <320fb6e00908130310n8efa09dv81963277e607da52@mail.gmail.com> Thanks for the first draft David, On Thu, Aug 13, 2009 at 1:32 AM, David Winter wrote: > In particular, the SeqIO module can now write genbank files that include > features and deal with FASTQ files created by Illumina 1.3+. Support for > this format allows interconversion between FASTQ files using Sloexa, Sanger > and Ilumina quality scores and has been validated against the the BioPerl > and EMBOSS implementations of this format. Typo: Sloexa -> Solexa. I would probably rephrase the rest a little, there are some subtleties with 3 container formats but only 2 scoring systems... In particular, the SeqIO module can now write GenBank with features, and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using the Sanger, Solexa or Illumina 1.3+ FASTQ variants, using conventions agreed with the BioPerl and EMBOSS projects. [BioPerl and EMBOSS are still working on the FASTQ variants, so we haven't actually got everything cross validated yet.] > ?? > This new release also spells the beginning of the end for some of > Biopython's older tools. Bio.Fasta and the application tools > ApplicationResult and generic_run() have been marked as deprecated which > means they can still be imported but doing who warn the user that these > functions will be removed in the future. Bio.Fasta has been superseded by > SeqIO's support for the Fasta format while we now suggest using the > subprocess module from the Python Standard Library to call applications - > use of this module is extensively documented in section 6.3 of the Biopython > Tutorial and Cookbook. > ?? I would omit that, or at least cut it down a lot. It might also be worth mentioning we no longer include Martel/Mindy, and thus don't have any dependence on mxTextTools. Also we don't support Python 2.3 anymore. P.S. I try and avoid referring to sections of the Tutorial by number, as these often change from release to release. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 13:02:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 14:02:17 +0100 Subject: [Biopython-dev] [Biopython] Trimming adaptors sequences In-Reply-To: <20090813124432.GB90165@sobchak.mgh.harvard.edu> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> <20090813124432.GB90165@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908130602n607add6fme67f7934234a5540@mail.gmail.com> On Thu, Aug 13, 2009 at 1:44 PM, Brad Chapman wrote: >> However, if you just want speed AND you really want to have a FASTQ >> input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator >> parser which gives plain strings, and handle the output yourself. Working >> directly with Python strings is going to be faster than using Seq and >> SeqRecord objects. You can even opt for outputting FASTQ files - as >> long as you leave the qualities as an encoded string, you can just slice >> that too. The downside is the code will be very specific. e.g. something >> along these lines: >> >> from Bio.SeqIO.QualityIO import FastqGeneralIterator >> in_handle = open(input_fastq_filename) >> out_handle = open(output_fastq_filename, "w") >> for title, seq, qual in FastqGeneralIterator(in_handle) : >> ? ? #Do trim logic here on the string seq >> ? ? if trim : >> ? ? ? ? seq = seq[start:end] >> ? ? ? ? qual = qual[start:end] # kept as ASCII string! >> ? ? #Save the (possibly trimmed) FASTQ record: >> ? ? out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) >> out_handle.close() >> in_handle.close() > > Nice -- I will have to play with this. I hadn't dug into the current > SeqRecord slicing code at all but I wonder if there is a way to keep > the SeqRecord interface but incorporate some of these speed ups > for common cases like this FASTQ trimming. I suggest we continue this on the dev mailing list (this reply is cross posted), as it is starting to get rather technical. When you really care about speed, any object creation becomes an issue. Right now for *any* record we have at least the following objects being created: SeqRecord, Seq, two lists (for features and dbxrefs), two dicts (for annotation and the per letter annotation), and the restricted dict (for per letter annotations), and at least four strings (sequence, id, name and description). Perhaps some lazy instantiation might be worth exploring... for example make dbxref, features, annotations or letter_annotations into properties where the underlying object isn't created unless accessed. [Something to try after Biopython 1.51 is out?] I would guess (but haven't timed it) that for trimming FASTQ SeqRecords, a bit part of the overhead is that we are using Python lists of integers (rather than just a string) for the scores. So sticking with the current SeqRecord object as is, one speed up we could try would be to leave the FASTQ quality string as an encoded string (rather than turning into integer quality scores, and back again on output). It would be a hack, but adding this as another SeqIO format name, e.g. "fastq-raw" or "fastq-ascii", might work. We'd still need a new letter_annotations key, say "fastq_qual_ascii". This idea might work, but it does seem ugly. Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 17:33:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 18:33:41 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <1250099579.4a83017b4e97c@webmail.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> Message-ID: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> [Jose - you didn't CC the list with your reply] On Wed, Aug 12, 2009 at 6:52 PM, Blanca Postigo Jose Miguel wrote: > > Hi: > > I just love free software :) It's great to watch how the code is being improved > by the work of so many people. I hope to get some time to get a look at the > latest sff reader. You'll probably be interested to know I've made some excellent progress with the (optional) SFF index block. I note that the specifications (both on the NCBI page and in the Roche manual) appear to suggest that the index block could appear in the middle of the the read data. However, in all the examples I have looked at, the index is actually at the end. http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff Sadly the format of the index isn't documented, but I think I have reverse engineered the format that Roche SFF files are using. In a slight twist of the specification they are actually using the index bock for both XML meta data AND and index of the read offsets. This will dovetail nicely with the indexing support in Bio.SeqIO which I am working on for Biopython 1.52, branch on github. I expect to have fast random access to reads in an SFF file very soon. See http://github.com/peterjc/biopython/tree/convert >> > It looks like you (James) construct Seq objects using the full >> > untrimmed sequence as is. I was undecided on if trimmed or >> > untrimmed should be the default, but the idea of some kind of >> > masked or trimmed Seq object had come up on the mailing list >> > which might be useful here (and in contig alignments). i.e. >> > something which acts like a Seq object giving the trimmed >> > sequence, but which also contains the full sequence and trim >> > positions. >> >> I'm still thinking about this. One simplistic option (as used on >> my branch) would be to have two input formats in Bio.SeqIO, >> one untrimmed and one trimmed, e.g. "sff" and "sff-trim". > > I think that some way to mask the SeqRecord or Seq object > would be great. It would be useful for many tasks, not just this > one. Sure - if we can come up with a suitable design... Peter From biopython at maubp.freeserve.co.uk Thu Aug 13 17:38:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 18:38:43 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> Message-ID: <320fb6e00908131038v567ed86fjb775d810fb69e7d@mail.gmail.com> Peter wrote: > > Sadly the format of the index isn't documented, but I think I have > reverse engineered the format that Roche SFF files are using. In a > slight twist of the specification they are actually using the index bock > for both XML meta data AND and index of the read offsets. I'm not the first to notice this, see for example the Celera Assembler looks in a Roche SFF file's XML meta data to determine how the quality scores were called: http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Roche_454_Platforms Peter From jblanca at btc.upv.es Fri Aug 14 06:01:42 2009 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Fri, 14 Aug 2009 08:01:42 +0200 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> <1250192416.4a846c2045f94@webmail.upv.es> <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> Message-ID: <1250229702.4a84fdc6c403a@webmail.upv.es> Mensaje citado por Peter : > On Thu, Aug 13, 2009 at 8:40 PM, Blanca Postigo Jose > Miguel wrote: > > > >> This will dovetail nicely with the indexing support in Bio.SeqIO > >> which I am working on for Biopython 1.52, branch on github. > >> I expect to have fast random access to reads in an SFF file > >> very soon. See http://github.com/peterjc/biopython/tree/convert > > > > I've written some code to solve a similar problem. Maybe you > > could take a look to it. It's in the classes FileIndex and > > FileSequenceIndex at: > > > > > http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/biolib_seqio_utils.py > > > > Did you see this thread? > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html > > The coding style is quite different, but it looks the essential idea > is the same - we both scan the file to find each record, and use > a dictionary to record the offset. Interestingly you and Peio also > keeps the record's length in the dictionary, which will double the > memory requirements - for something you don't actually need. > > Peter > > P.S. You can forward or CC this back to the list if you like. We keep the record length to be able to return the record without having to scan the file again. Jose Blanca From biopython at maubp.freeserve.co.uk Fri Aug 14 09:36:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 10:36:31 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <1250229702.4a84fdc6c403a@webmail.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> <1250192416.4a846c2045f94@webmail.upv.es> <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> <1250229702.4a84fdc6c403a@webmail.upv.es> Message-ID: <320fb6e00908140236v547ea056g965c7b7cd61d555c@mail.gmail.com> On Fri, Aug 14, 2009 at 7:01 AM, Blanca Postigo Jose Miguel wrote: > >> The coding style is quite different, but it looks the essential idea >> is the same - we both scan the file to find each record, and use >> a dictionary to record the offset. Interestingly you and Peio also >> keeps the record's length in the dictionary, which will double the >> memory requirements - for something you don't actually need. > > We keep the record length to be able to return the record without > having to scan the file again. If you want to be able to extract the raw record, that makes sense. It is still a trade off between memory usage and speed of access, and depending on your requirements either way makes sense. For Bio.SeqIO, I want to parse the raw record on access via the key in order to return a SeqRecord, so I have no need to keep the raw record length in memory. I'm using this github branch: http://github.com/peterjc/biopython/commits/index Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 11:57:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 12:57:26 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> Message-ID: <320fb6e00908140457k66e747dep881abf0b044ab9c1@mail.gmail.com> On Thu, Aug 13, 2009 at 6:33 PM, Peter wrote: > > You'll probably be interested to know I've made some excellent progress > with the (optional) SFF index block. I note that the specifications (both > on the NCBI page and in the Roche manual) appear to suggest that the > index block could appear in the middle of the the read data. However, > in all the examples I have looked at, the index is actually at the end. > > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff > > Sadly the format of the index isn't documented, but I think I have > reverse engineered the format that Roche SFF files are using. In > a slight twist of the specification they are actually using the index > block for both XML meta data AND an index of the read offsets. > > This will dovetail nicely with the indexing support in Bio.SeqIO > which I am working on for Biopython 1.52, branch on github. > I expect to have fast random access to reads in an SFF file > very soon. See http://github.com/peterjc/biopython/tree/convert Sorry, wrong branch - my "index" branch has the indexing (as well as SFF files and the Bio.SeqIO.convert() functionality): http://github.com/peterjc/biopython/tree/index I've got this code working nicely for reading or indexing SFF files. Testing with a 2GB SFF file with 660808 Roche 454 reads, using the Roche index I can load this in under 3 seconds and retrieve any single record almost instantly. If the index is missing (or not in the expected format) I have to scan the file to build my own index, and that takes about 11 seconds - which is still fine :) Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 12:00:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 13:00:15 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> Message-ID: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com> On Wed, Aug 12, 2009 at 1:54 PM, Peter wrote: > >> Jose's code uses seek/tell which means it has to have a handle >> to an actual file. He also used binary read mode - I'm not sure if >> this was essential or not. > > Binary mode was not essential - opening an SFF file in default > mode also seemed to work fine with Jose's code. Having worked on this more, default mode or binary mode are fine. However, as you might expect, you can't use Python's universal read lines mode when parsing SFF files. Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 13:25:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 14:25:43 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <1250252775.4a8557e7d9ae4@webmail.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <1250099579.4a83017b4e97c@webmail.upv.es> <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com> <1250192416.4a846c2045f94@webmail.upv.es> <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com> <1250229702.4a84fdc6c403a@webmail.upv.es> <320fb6e00908140236v547ea056g965c7b7cd61d555c@mail.gmail.com> <1250252775.4a8557e7d9ae4@webmail.upv.es> Message-ID: <320fb6e00908140625v5f5bd338qc081e0e5091df9bf@mail.gmail.com> Jose wrote: >>> We keep the record length to be able to return the record without >>> having to scan the file again. Peter wrote: >> If you want to be able to extract the raw record, that makes sense. >> It is still a trade off between memory usage and speed of access, >> and depending on your requirements either way makes sense. >> >> For Bio.SeqIO, I want to parse the raw record on access via the >> key in order to return a SeqRecord, so I have no need to keep >> the raw record length in memory. I'm using this github branch: >> http://github.com/peterjc/biopython/commits/index Jose wrote: > We want the raw record because we plan to use this FileIndex on several > different files, not just for sequences. In fact you have an example on how to > use it for sequences in SequenceFileIndex, a class that uses the general > FileIndex. I think that this FileIndex class will be able even to index xml > files. This is the motivation for the design. I see - that makes sense. Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 15:20:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 16:20:21 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com> <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com> Message-ID: <320fb6e00908140820uf86603bh408dc93f99a3641a@mail.gmail.com> On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote: > On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote: >> I've stuck a branch up on github which (thus far) simply defines >> the Bio.SeqIO.convert and Bio.AlignIO.convert functions. >> Adding optimised code can come later. >> >> http://github.com/peterjc/biopython/commits/convert > > There is now a new file Bio/SeqIO/_convert.py on this > branch, and a few optimised conversions have been done. > In particular GenBank/EMBL to FASTA, any FASTQ to > FASTA, and inter-conversion between any of the three > FASTQ formats. > > The current Bio/SeqIO/_convert.py file actually looks very > long and complicated - but if you ignore the doctests (which > I would probably move to a dedicated unit test), it isn't that > much code at all. I have now moved all the test code to a new unit test file, test_SeqIO_convert.py, and think this code is ready for public testing/review, with a the aim of inclusion in Biopython 1.52 (i.e. it can wait until after 1.51 is done). I would still need to add this to the tutorial, but that won't take very long. Peter From bugzilla-daemon at portal.open-bio.org Fri Aug 14 15:23:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Aug 2009 11:23:14 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200908141523.n7EFNExJ014906@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-14 11:23 EST ------- (In reply to comment #2) > (From update of attachment 1303 [details]) > This file is already a tiny bit out of date - I've started working on this > on a git branch. > > http://github.com/peterjc/biopython/commits/sff Actually, I got rid of that branch after merging it into my work on Bio.SeqIO indexing. I can now parse the Roche SFF index, allowing fast random access to the reads. See: http://github.com/peterjc/biopython/commits/index http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri Aug 14 21:08:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Aug 2009 17:08:32 -0400 Subject: [Biopython-dev] Biopython 1.51 code freeze Message-ID: <20090814210832.GL90165@sobchak.mgh.harvard.edu> Hey all; I'll be doing the 1.51 release this weekend, so am declaring an official code freeze until things get finished. If you have any last minute bugs or issues please check them in this evening; otherwise no more CVS commits until 1.51 is officially rolled and announced. Like, um, go outside this weekend or something. David -- thanks for writing up the release announcement. Everyone -- thanks for all your hard work on getting things ready for the release. After this is rolled we should be able to start checking in new functionality for 1.52 and beyond. Have a great weekend, Brad From biopython at maubp.freeserve.co.uk Sat Aug 15 12:09:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 15 Aug 2009 13:09:39 +0100 Subject: [Biopython-dev] Biopython 1.51 code freeze In-Reply-To: <20090814210832.GL90165@sobchak.mgh.harvard.edu> References: <20090814210832.GL90165@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908150509m322bfbd5yc55ab67b2af733a@mail.gmail.com> On Fri, Aug 14, 2009 at 10:08 PM, Brad Chapman wrote: > Hey all; > I'll be doing the 1.51 release this weekend, so am declaring an > official code freeze until things get finished. If you have any last > minute bugs or issues please check them in this evening; otherwise > no more CVS commits until 1.51 is officially rolled and announced. > Like, um, go outside this weekend or something. Cool - now that it has stopped raining, I might do that ;) Peter From tiagoantao at gmail.com Sat Aug 15 18:05:40 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 15 Aug 2009 19:05:40 +0100 Subject: [Biopython-dev] Biopython 1.51 code freeze In-Reply-To: <20090814210832.GL90165@sobchak.mgh.harvard.edu> References: <20090814210832.GL90165@sobchak.mgh.harvard.edu> Message-ID: <6d941f120908151105l7144f806ub43b6aa761ed22a8@mail.gmail.com> Outside in this case means 37C and planting trees under heavy sun (with a short break for checking email on my mobile behind a shadow). Congratz on 1.51. I intend to start checking in new functionality in around 2 weeks. If someone wants to have a look at the code that is on git(genepop branch) and criticize, feel free. back to the trees now. 2009/8/14, Brad Chapman : > Hey all; > I'll be doing the 1.51 release this weekend, so am declaring an > official code freeze until things get finished. If you have any last > minute bugs or issues please check them in this evening; otherwise > no more CVS commits until 1.51 is officially rolled and announced. > Like, um, go outside this weekend or something. > > David -- thanks for writing up the release announcement. > > Everyone -- thanks for all your hard work on getting things ready for > the release. After this is rolled we should be able to start checking in > new functionality for 1.52 and beyond. > > Have a great weekend, > Brad > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Enviada a partir do meu dispositivo m?vel "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From chapmanb at 50mail.com Mon Aug 17 00:48:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 16 Aug 2009 20:48:26 -0400 Subject: [Biopython-dev] Biopython 1.51 status Message-ID: <20090817004826.GA4221@kunkel> Hey all; 1.51 is all checked and prepped and ready to go. However, I don't appear to have a user account on portal.open-bio.org, so can't transfer the new tarballs and api over there. Peter, you had mentioned you could do the windows installers. When you do those, could you also transfer over these tarballs and stick them in the right places: http://chapmanb.50mail.com/biopython-1.51.tar.gz http://chapmanb.50mail.com/biopython-1.51.zip http://chapmanb.50mail.com/api.tar.gz If you can do that I'll update the website and send out announcements in the morning. Thanks much. 1.51 on the way, Brad From biopython at maubp.freeserve.co.uk Mon Aug 17 09:04:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 10:04:16 +0100 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <20090817004826.GA4221@kunkel> References: <20090817004826.GA4221@kunkel> Message-ID: <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> On Mon, Aug 17, 2009 at 1:48 AM, Brad Chapman wrote: > Hey all; > 1.51 is all checked and prepped and ready to go. However, I don't > appear to have a user account on portal.open-bio.org, so can't > transfer the new tarballs and api over there. Right - your old account probably as had its password reset or something - do you want to contact the OBF or should I? > Peter, you had mentioned you could do the windows installers. > When you do those, could you also transfer over these tarballs > and stick them in the right places: > > http://chapmanb.50mail.com/biopython-1.51.tar.gz > http://chapmanb.50mail.com/biopython-1.51.zip > http://chapmanb.50mail.com/api.tar.gz Will do... > If you can do that I'll update the website and send out > announcements in the morning. Thanks much. Give me an hour or so ;) Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 10:01:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 11:01:47 +0100 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> Message-ID: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> On Mon, Aug 17, 2009 at 10:04 AM, Peter wrote: >> If you can do that I'll update the website and send out >> announcements in the morning. Thanks much. > > Give me an hour or so ;) OK, all uploaded, including the new tutorial. I also did the wiki (as it was simple for me to get the new file sizes), and added version 1.51 to bugzilla (not sure if you have the relevent permissions there or not - could you check?). Over to you now Brad for the release announcements (OBF blog, email) and PyPi, http://pypi.python.org/pypi/biopython/ and anything else on the list. Thanks, Peter From chapmanb at 50mail.com Mon Aug 17 12:16:18 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 17 Aug 2009 08:16:18 -0400 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> Message-ID: <20090817121618.GD12768@sobchak.mgh.harvard.edu> Peter; Thanks for the help with this. Everything else is all finished up -- news posted and message sent to the lists. The announcement e-mail only needs to be approved on biopython-announce. I wrote a message to open-bio support to get my password reset on portal, so hopefully we'll get that all sorted. It's great to have this out. Thanks again to everyone for all the hard work, Brad > On Mon, Aug 17, 2009 at 10:04 AM, Peter wrote: > >> If you can do that I'll update the website and send out > >> announcements in the morning. Thanks much. > > > > Give me an hour or so ;) > > OK, all uploaded, including the new tutorial. I also did the wiki > (as it was simple for me to get the new file sizes), and added > version 1.51 to bugzilla (not sure if you have the relevent > permissions there or not - could you check?). > > Over to you now Brad for the release announcements (OBF > blog, email) and PyPi, http://pypi.python.org/pypi/biopython/ > and anything else on the list. > > Thanks, > > Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 12:17:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 13:17:53 +0100 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <20090817121618.GD12768@sobchak.mgh.harvard.edu> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> <20090817121618.GD12768@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908170517s7b1a6fb4h37bd2d22046dc3a@mail.gmail.com> On Mon, Aug 17, 2009 at 1:16 PM, Brad Chapman wrote: > Peter; > Thanks for the help with this. Everything else is all finished up -- > news posted and message sent to the lists. The announcement e-mail > only needs to be approved on biopython-announce. Done. > I wrote a message to open-bio support to get my password reset on portal, > so hopefully we'll get that all sorted. Cool. > It's great to have this out. Thanks again to everyone for all the hard > work, > Brad And thank you Brad :) Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 12:43:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 13:43:01 +0100 Subject: [Biopython-dev] Moving from CVS to git Message-ID: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> Hi all, Now that Biopython 1.51 is out (thanks Brad), we should discuss finally moving from CVS to git. This was something we talked about at BOSC/ISMB 2009, but not everyone was there. We have two main options: (a) Move from CVS (on the OBF servers) to github. All our developers will need to get github accounts, and be added as "collaborators" to the existing github repository. I would want a mechanism in place to backup the repository to the OBF servers (Bartek already has something that should work). (b) Move from CVS to git (on the OBF servers). All our developers can continue to use their existing OBF accounts. Bartek's existing scripts could be modified to push the updates from this OBF git repository onto github. In either case, there will be some "plumbing" work required, for example I'd like to continue to offer a recent source code dump at http://biopython.open-bio.org/SRC/biopython/ etc. Given we don't really seem to have the expertise "in house" to run an OBF git server ourselves right now, option (a) is simplest, and as I recall those of us at BOSC where OK with this plan. Assuming we go down this route (CVS to github), everyone with an existing CVS account should setup a github account if they want to continue to have commit access (e.g. Frank, Iddo). I would suggest that initially you get used to working with git and github BEFORE trying anything directly on what would be the "official" repository. It took me a while and I'm still learning ;) Is this agreeable? Are there any other suggestions? [Once this is settled, we can talk about things like merge requests and if they should be accompanied by a Bugzilla ticket or not.] Peter From eric.talevich at gmail.com Mon Aug 17 14:02:02 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 17 Aug 2009 10:02:02 -0400 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> Message-ID: <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> On Mon, Aug 17, 2009 at 6:01 AM, Peter wrote: > On Mon, Aug 17, 2009 at 10:04 AM, Peter > wrote: > >> If you can do that I'll update the website and send out > >> announcements in the morning. Thanks much. > > > > Give me an hour or so ;) > > OK, all uploaded, including the new tutorial. I also did the wiki > (as it was simple for me to get the new file sizes), and added > version 1.51 to bugzilla (not sure if you have the relevent > permissions there or not - could you check?). > > Over to you now Brad for the release announcements (OBF > blog, email) and PyPi, http://pypi.python.org/pypi/biopython/ > and anything else on the list. > > Thanks, > > Peter > Great to see the release went smoothly! I'm probably being impatient here, but was a tag created for v1.51 final? I don't see it in GitHub yet, and it's been slightly over an hour since the last push. Thanks, Eric From chapmanb at 50mail.com Mon Aug 17 14:17:58 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 17 Aug 2009 10:17:58 -0400 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> Message-ID: <20090817141758.GE12768@sobchak.mgh.harvard.edu> Hi Eric; > Great to see the release went smoothly! I'm probably being impatient here, > but was a tag created for v1.51 final? I don't see it in GitHub yet, and > it's been slightly over an hour since the last push. It was tagged last evening as biopython-151: > cvs log setup.py | head RCS file: /home/repository/biopython/biopython/setup.py,v Working file: setup.py head: 1.171 branch: locks: strict access list: symbolic names: biopython-151: 1.171 biopython-151b: 1.168 Maybe there is an issue with tags pushing to Git. Bartek and Peter were discussing this, but I don't remember the ultimate conclusion. Brad From bartek at rezolwenta.eu.org Mon Aug 17 14:29:31 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 17 Aug 2009 16:29:31 +0200 Subject: [Biopython-dev] Biopython 1.51 status In-Reply-To: <20090817141758.GE12768@sobchak.mgh.harvard.edu> References: <20090817004826.GA4221@kunkel> <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com> <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com> <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com> <20090817141758.GE12768@sobchak.mgh.harvard.edu> Message-ID: <8b34ec180908170729g6c13333dk98d722cdb1d54bf0@mail.gmail.com> On Mon, Aug 17, 2009 at 4:17 PM, Brad Chapman wrote: > Hi Eric; > >> Great to see the release went smoothly! I'm probably being impatient here, >> but was a tag created for v1.51 final? I don't see it in GitHub yet, and >> it's been slightly over an hour since the last push. > > Maybe there is an issue with tags pushing to Git. Bartek and Peter > were discussing this, but I don't remember the ultimate conclusion. The ultimate conclusion will be reached when we move to github... ;) But for now, I'll just need to convert this tag manually. Just give me a few hours Bartek From bartek at rezolwenta.eu.org Mon Aug 17 15:07:32 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 17 Aug 2009 17:07:32 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> Message-ID: <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> On Mon, Aug 17, 2009 at 2:43 PM, Peter wrote: > Hi all, > > Given we don't really seem to have the expertise "in house" > to run an OBF git server ourselves right now, option (a) is > simplest, and as I recall those of us at BOSC where OK > with this plan. > > Assuming we go down this route (CVS to github), everyone > with an existing CVS account should setup a github account > if they want to continue to have commit access (e.g. Frank, > Iddo). I would suggest that initially you get used to working > with git and github BEFORE trying anything directly on what > would be the "official" repository. It took me a while and I'm > still learning ;) > > Is this agreeable? Are there any other suggestions? > > [Once this is settled, we can talk about things like merge > requests and if they should be accompanied by a Bugzilla > ticket or not.] > Hi All, I absolutely agree here with Peter, i.e. I would suggest we move now from CVS to a git branch hosted on github. Since I'm more involved in the technical setup we currently have, I'd also add a few more technical arguments for this move: - While current setup is working, it is suboptimal because there is an extra conversion step both for accepting changes done by people in git (git to CVS) and propagating releases (CVS to github). - Once we move to git as our version control system, we need to have a "master" branch which will be easily available for viewing and branching: we can't do it now on open-bio servers (it requires git installation and some server-side scripts to have a browseable repository), also "moving" to github is easier because it actually requires no physical action, we just need to stop updating CVS. - If anyone have fears of depending on github, I think it's much less of a problem than with CVS, moving our "master" branch from github to somewhere else is very easy and does not require any action on the side of github, we just post the branch somewhere, and start pushing there (you can find a list of possible hosting solutions here: http://git.or.cz/gitwiki/GitHosting) - Regarding the backups of the github branch: I'm already doing this. If you have a shell account on dev.open-bio.org, you can get the current git branch of biopython from /home/bartek/git_branch (location subject to change), so this would require no additional work, although it would be optimal, to actually install git on open-bio server, so that the updating script can be run from there. If we had that, we could actually hook it up directly to github, so that instead of running once in an hour, it would be run after each push to the branch (http://github.com/guides/post-receive-hooks) To summarize, I'm ready to switch off the part of my script which is updating the gihub branch from CVS. For now, I would leave the part that is making backups of github branch on open-bio server (via rsync). Once we have git installed on dev.open-bio, I can hook it up to notifications from github. cheers Bartek From biopython at maubp.freeserve.co.uk Mon Aug 17 15:52:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 16:52:36 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> Message-ID: <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> On Mon, Aug 17, 2009 at 4:07 PM, Bartek Wilczynski wrote: > > Hi All, > > I absolutely agree here with Peter, i.e. I would suggest we move now from > CVS to a git branch hosted on github. > > Since I'm more involved in the technical setup we currently have, I'd also add > a few more technical arguments for this move: > > - While current setup is working, it is suboptimal because there is an extra > ?conversion step both for accepting changes done by people in git (git to CVS) > ?and propagating releases (CVS to github). Yeah - this works but for anything non-trivial it would be a pain. > - Once we move to git as our version control system, we need to have a > "master" branch ?which will be easily available for viewing and branching: > we can't do it now on open-bio?servers (it requires git installation and > some server-side scripts to have a browseable?repository), ... My impression from talking to OBF guys is if we really want to we can do this, but it require us (Biopython) to take care of installing and running git on an OBF machine. > ... also "moving" to github is easier because it actually > requires no physical action, ?we just need to stop updating CVS. Yes - this is the big plus of "option (a)" over "option (b)" in my earlier email. > - If anyone have fears of depending on github, I think it's much less > of a problem than with CVS, ?moving our "master" branch from > github to somewhere else is very easy and does not require any > action on the side of github, we just post the branch somewhere, > and start pushing there (you can find a list of possible hosting > solutions here: http://git.or.cz/gitwiki/GitHosting) Yes, it is good to know we won't be tied to github (unless we start using more of the tools they offer on top of git itself). > - Regarding the backups of the github branch: I'm already doing this. > If you have a shell account on dev.open-bio.org, you can get the > current ?git branch of biopython from /home/bartek/git_branch >?(location subject to change), so this would require no additional > work, Yes - that is what I was hinting at in my email (trying to be brief). > ... although it would be optimal, to actually install git on open-bio > server, so that the updating script can be run from there. Yes. Even something as simple as a cron job running on an OBF server would satisfy me from a back up point of view. > If we had that, we could actually hook it up directly to github, > so that instead of running once in an hour, it would be run >?after each push to the branch (http://github.com/guides/post-receive-hooks) More complex, but worth considering. > To summarize, I'm ready to switch off the part of my script which is > updating the gihub branch from CVS. Good. We'll also want to ask the OBF admins to make CVS read only once we move. > For now, I would leave the part that is making backups of github > branch on open-bio server (via rsync). That would be my plan for the short term. We can then talk to the OBF server admins about how we can do this better. > Once we have git installed on dev.open-bio, I can hook it > up to notifications from github. If we go to the trouble of installing git on the OBF servers ;) Peter From mhampton at d.umn.edu Mon Aug 17 15:42:39 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 10:42:39 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: Message-ID: Hi, I am preparing biopython-1.51 for inclusion as an optional package for Sage (www.sagemath.org). I ran the test suite and got 8 errors; I am not sure if these are all expected. The KDTree ones I have seen before, but some look new. My test log is available at: http://sage.math.washington.edu/home/mhampton/biopython-1.51-testlog.txt in case anyone wants to take a look. Biopython has been available in Sage for several years as an optional package, but I would like to make it a standard component. This has become much more likely since the clean-up of Numeric and mx-texttools dependencies. I think the only real issue is setting up some testing during the Sage package installation, which is my motivation for really understanding the test failures. Cheers, Marshall Hampton Department of Mathematics and Statistics University of Minnesota, Duluth From biopython at maubp.freeserve.co.uk Mon Aug 17 16:28:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 17:28:27 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: Message-ID: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> On Mon, Aug 17, 2009 at 4:42 PM, Marshall Hampton wrote: > > Hi, > > I am preparing biopython-1.51 for inclusion as an optional package for Sage > (www.sagemath.org). ?I ran the test suite and got 8 errors; I am not sure if > these are all expected. I wouldn't have expected any failures. > The KDTree ones I have seen before, but some look new. ?My test log is > available at: > > http://sage.math.washington.edu/home/mhampton/biopython-1.51-testlog.txt > > in case anyone wants to take a look. This one should be simple: test_EMBOSS.py ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in genbank format: 16 vs 1 records This is a known regression in EMBOSS 6.1.0 which will be fixed in their next release. Can you check this by running embossversion? The others are all ImportErrors (e.g. cannot import name _CKDTree) I rather suspect you are running the test suite BEFORE compiling the C extensions, and that this may similarly affect Bio.Restriction. > Biopython has been available in Sage for several years as an optional > package, but I would like to make it a standard component. This has > has become much more likely since the clean-up of Numeric and > mx-texttools dependencies. Cool. > I think the only real issue is setting up some testing during the Sage > package installation, which is my motivation for really understanding > the test failures. I don't know anything about your test framework, but surely other packages (e.g. NumPy) have a similar requirement (compile before test) so this should be fixable. Regards, Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 16:35:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 17:35:53 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com> On Mon, Aug 17, 2009 at 5:28 PM, Peter wrote: > > The others are all ImportErrors (e.g. cannot import name _CKDTree) > I rather suspect you are running the test suite BEFORE compiling > the C extensions, and that this may similarly affect Bio.Restriction. Also this line is interesting - it suggest you have not installed NumPy, or not told sage is is a dependency? test_Cluster ... skipping. If you want to use Bio.Cluster, install NumPy first and then reinstall Biopython P.S. Why does this page talk about Biopython version "4.2b"? http://wiki.sagemath.org/Sage_Spkg_Tracking Peter From matzke at berkeley.edu Mon Aug 17 19:48:33 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 17 Aug 2009 12:48:33 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <20090811131019.GW12604@sobchak.mgh.harvard.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <20090811131019.GW12604@sobchak.mgh.harvard.edu> Message-ID: <4A89B411.4090501@berkeley.edu> Pencils down update: I have uploaded the relevant test scripts and data files to git, and deleted old loose files. http://github.com/nmatzke/biopython/commits/Geography Here is a simple draft tutorial: http://biopython.org/wiki/BioGeography#Tutorial Strangely, while working on the tutorial I discovered that I did something somewhere in the last revision that is messing up the parsing of automatically downloaded records from GBIF, I am tracking this down currently and will upload as soon as I find it. I would like to thank everyone for the opportunity to participate in GSoC, and to thank everyone for their help. For me, this summer turned into more of a "growing from a scripter to a programmer" summer than I expected initially. As a result I spent a more time refactoring and retracing my steps than I figured. However I think the resulting main product, a GBIF interface and associated tools, is much better than it would have been without the advice & encouragement of Brad, Hilmar, etc. I will be using this for my own research and will continue developing it. Cheers! Nick Brad Chapman wrote: > Hi Nick; > >> Summary: Major focus is getting the GBIF access/search/parse module into >> "done"/submittable shape. This primarily requires getting the >> documentation and testing up to biopython specs. I have a fair bit of >> documentation and testing, need advice (see below) for specifics on what >> it should look like. > > Awesome. Thanks for working on the cleanup for this. > >> OK, I will do this. Should I try and figure out the unittest stuff? I >> could use a simple example of what this is supposed to look like. > > In addition to Peter's pointers, here is a simple example from a > small thing I wrote: > > http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py > > You can copy/paste the unit test part to get a base, and then > replace the t_* functions with your own real tests. > > Simple scripts that generate consistent output are also fine; that's > the print and compare approach. > >>> - What is happening with the Nodes_v2 and Treesv2 files? They look >>> like duplicates of the Nexus Nodes and Trees with some changes. >>> Could we roll those changes into the main Nexus code to avoid >>> duplication? >> Yeah, these were just copies with your bug fix, and with a few mods I >> used to track crashes. Presumably I don't need these with after a fresh >> download of biopython. > > Cool. It would be great if we could weed these out as well. > >> The API is really just the interface with GBIF. I think developing a >> cookbook entry is pretty easy, I assume you want something like one of >> the entries in the official biopython cookbook? > > Yes, that would work great. What I was thinking of are some examples > where you provide background and motivation: Describe some useful > information you want to get from GBIF, and then show how to do it. > This is definitely the most useful part as it gives people working > examples to start with. From there they can usually browse the lower > level docs or code to figure out other specific things. > >> Re: API documentation...are you just talking about the function >> descriptions that are typically in """ """ strings beneath the function >> definitions? I've got that done. Again, if there is more, an example >> of what it should look like would be useful. > > That looks great for API level docs. You are right on here; for this > week I'd focus on the cookbook examples and cleanup stuff. > > My other suggestion would be to rename these to follow Biopython > conventions, something like: > > gbif_xml -> GbifXml > shpUtils -> ShapefileUtils > geogUtils -> GeographyUtils > dbfUtils -> DbfUtils > > The *Utils might have underscores if they are not intended to be > called directly. > > Thanks for all your hard work, > Brad > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From mhampton at d.umn.edu Mon Aug 17 20:46:42 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 15:46:42 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com> Message-ID: On Mon, 17 Aug 2009, Peter wrote: > On Mon, Aug 17, 2009 at 5:28 PM, Peter wrote: >> >> The others are all ImportErrors (e.g. cannot import name _CKDTree) >> I rather suspect you are running the test suite BEFORE compiling >> the C extensions, and that this may similarly affect Bio.Restriction. > > Also this line is interesting - it suggest you have not installed NumPy, > or not told sage is is a dependency? > test_Cluster ... skipping. If you want to use Bio.Cluster, install > NumPy first and then reinstall Biopython Numpy is included in Sage, so I guess there is some sort of path problem. I'll give it another look. > P.S. Why does this page talk about Biopython version "4.2b"? > http://wiki.sagemath.org/Sage_Spkg_Tracking > > Peter > I have no idea, that was simply wrong. I have corrected that wiki page. Thanks for the feedback! Marshall Hampton From mhampton at d.umn.edu Mon Aug 17 21:25:28 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 16:25:28 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: On Mon, 17 Aug 2009, Peter wrote: > This one should be simple: test_EMBOSS.py > ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in > genbank format: 16 vs 1 records > This is a known regression in EMBOSS 6.1.0 which will be fixed > in their next release. Can you check this by running embossversion? My emboss version is 6.1.0, so that explains that. After copying the Tests folder from the source to my site-packages directory, most of the errors go away, except for the one mentioned above and this one: ERROR: test_SeqIO_online ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 248, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "test_SeqIO_online.py", line 62, in record = SeqIO.read(handle, format) # checks there is exactly one record File "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 485, in read raise ValueError("No records found in handle") ValueError: No records found in handle ...not sure what the problem might be with that. -Marshall From mhampton at d.umn.edu Mon Aug 17 21:31:43 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 16:31:43 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: I hope this isn't too much email, I can just post to the dev list if you'd like. Anyway, I manually ran my last test failure, test_SeqIO_online.py, and when I do that everything looks OK: thorn:16:28:30:site-packages: sage -python Tests/test_SeqIO_online.py Checking Bio.ExPASy.get_sprot_raw() - Fetching O23729 Got MAPAMEEIRQAQRAEGPAA...GAE [5Y08l+HJRDIlhLKzFEfkcKd1dkM] len 394 Checking Bio.Entrez.efetch() - Fetching X52960 from genome as fasta Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 - Fetching X52960 from genome as gb Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 - Fetching 6273291 from nucleotide as fasta Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 - Fetching 6273291 from nucleotide as gb Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 - Fetching 16130152 from protein as fasta Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 - Fetching 16130152 from protein as gb Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 Not sure where to go from here, but it seems that things are basically working correctly. -Marshall Hampton On Mon, 17 Aug 2009, Marshall Hampton wrote: > > On Mon, 17 Aug 2009, Peter wrote: >> This one should be simple: test_EMBOSS.py >> ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in >> genbank format: 16 vs 1 records >> This is a known regression in EMBOSS 6.1.0 which will be fixed >> in their next release. Can you check this by running embossversion? > > My emboss version is 6.1.0, so that explains that. > > After copying the Tests folder from the source to my site-packages directory, > most of the errors go away, except for the one mentioned above and this one: > > ERROR: test_SeqIO_online > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "run_tests.py", line 248, in runTest > suite = unittest.TestLoader().loadTestsFromName(name) > File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, > in loadTestsFromName > module = __import__('.'.join(parts_copy)) > File "test_SeqIO_online.py", line 62, in > record = SeqIO.read(handle, format) # checks there is exactly one record > File > "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", > line 485, in read > raise ValueError("No records found in handle") > ValueError: No records found in handle > > ...not sure what the problem might be with that. > > -Marshall > From biopython at maubp.freeserve.co.uk Mon Aug 17 21:37:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:37:05 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> On Mon, Aug 17, 2009 at 10:25 PM, Marshall Hampton wrote: > > After copying the Tests folder from the source to my site-packages > directory, most of the errors go away, Well that does suggest some sort of path issue, but moving the test directory around that isn't a very good solution. > except for the one mentioned above and this one: Assuming the "one mentioned above" was the EMBOSS one, fine. > ERROR: test_SeqIO_online > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "run_tests.py", line 248, in runTest > ? ?suite = unittest.TestLoader().loadTestsFromName(name) > ?File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, > in loadTestsFromName > ? ?module = __import__('.'.join(parts_copy)) > ?File "test_SeqIO_online.py", line 62, in > ? ?record = SeqIO.read(handle, format) # checks there is exactly one record > ?File > "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", > line 485, in read > ? ?raise ValueError("No records found in handle") > ValueError: No records found in handle > > ...not sure what the problem might be with that. That is an online test using the NCBI's web services. This could be a transient failure due to the network. Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 21:43:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:43:06 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> Message-ID: <320fb6e00908171443g7b9fe780h1024d2f584be7b18@mail.gmail.com> On Mon, Aug 17, 2009 at 10:31 PM, Marshall Hampton wrote: > > I hope this isn't too much email, I can just post to the dev list if > you'd like. Doing it on the mailing list is fine, I'd read it either way ;) >?Anyway, I manually ran my last test failure, test_SeqIO_online.py, > and when I do that everything looks OK: > > thorn:16:28:30:site-packages: sage -python Tests/test_SeqIO_online.py > Checking Bio.ExPASy.get_sprot_raw() > - Fetching O23729 > ?Got MAPAMEEIRQAQRAEGPAA...GAE [5Y08l+HJRDIlhLKzFEfkcKd1dkM] len 394 > Checking Bio.Entrez.efetch() > - Fetching X52960 from genome as fasta > ?Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 > - Fetching X52960 from genome as gb > ?Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248 > - Fetching 6273291 from nucleotide as fasta > ?Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 > - Fetching 6273291 from nucleotide as gb > ?Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902 > - Fetching 16130152 from protein as fasta > ?Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 > - Fetching 16130152 from protein as gb > ?Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367 > > Not sure where to go from here, but it seems that things are basically > working correctly. > > -Marshall Hampton That fits with it being a transient network issue. Some of our units tests like Tests/test_SeqIO_online.py are simple "print and compare" scripts, which are intended to be run via the run_tests.py script to validate their output. You can try this: sage -python Tests/run_tests.py test_SeqIO_online.py Or, manually compare that output to the expected output in file Tests/ouput/test_SeqIO_online - but it looks fine to me by eye. Peter From dalke at dalkescientific.com Mon Aug 17 21:36:41 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 17 Aug 2009 23:36:41 +0200 Subject: [Biopython-dev] old Martel release Message-ID: Hi all, Does anyone here have a copy of my *old* Martel code? Something from the pre-1.0 days? I can't find it anywhere, and it looks like I did things back then on the biopython.org machines. An example URL was: http://www.biopython.org/~dalke/Martel/Martel-0.5.tar.gz I'm specifically looking for the molfile format I developed. That was 9 years ago and several machines back in time. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Aug 17 21:40:11 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 17 Aug 2009 23:40:11 +0200 Subject: [Biopython-dev] old Martel release In-Reply-To: References: Message-ID: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com> On Aug 17, 2009, at 11:36 PM, Andrew Dalke wrote: > Does anyone here have a copy of my *old* Martel code? Ha! archive.org has it. Didn't think they kept .tar.gz files, but they do! Andrew dalke at dalkescientific.com From eric.talevich at gmail.com Mon Aug 17 21:47:22 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 17 Aug 2009 17:47:22 -0400 Subject: [Biopython-dev] GSoC final update: PhyloXML for Biopython Message-ID: <3f6baf360908171447i2e3c592em5960269600e80f1b@mail.gmail.com> Hi all, Here's a final changelog for Aug. 10-14: - Added a 'terminal' argument to the find() method on BaseTree.Tree, for filtering internal/external nodes. This makes get_leaf_nodes() a trivial function, and total_branch_length is pretty simple too. - Updated the example phyloXML files to v1.10 schema-compliant copies from phyloxml.org; couple bug fixes. - Removed the project's README.rst file, so Bio/PhyloXML/ is no longer controlled by Git. I'll merge any useful information from there into the Biopython wiki documentation. - Pulled the Biopython 1.51 release into my master branch, and merged that into the phyloxml branch, so this branch (and the required GSoC patch tarball) will apply cleanly to the publicly released Biopython 1.51 source tree. - Documented most of what's been done on the Biopython wiki: http://www.biopython.org/wiki/PhyloXML http://www.biopython.org/wiki/TreeIO http://www.biopython.org/wiki/Tree *Future plans* There are a few tangential projects that deserve more attention over the next few months, and I'm going to create separate Git branches for each of them, to make it easier to share: - Port the Newick tree parser and methods from Bio.Newick to Bio.Tree and TreeIO. - Improve the graph drawing and networkx integration - BioSQL adapter between Bio.Tree.BaseTree and PhyloDB tables - Possibly, play with other tree representations -- nested-set, as PhyloDB does, and relationship matrix, which could bring NumPy into play (in a separate Bio.Tree.Matrix module) Finally, massive thanks to Brad and Christian for mentoring, Hilmar for overseeing the whole project, Peter and the Biopython folks for their guidance, and the various BioPerl monks and BioRubyists who shared their wisdom. All the best, Eric https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Aug 17 21:48:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:48:19 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A89B411.4090501@berkeley.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <20090811131019.GW12604@sobchak.mgh.harvard.edu> <4A89B411.4090501@berkeley.edu> Message-ID: <320fb6e00908171448l296abbb8yb509893cfbaaaa24@mail.gmail.com> On Mon, Aug 17, 2009 at 8:48 PM, Nick Matzke wrote: > I would like to thank everyone for the opportunity to participate in GSoC, > and to thank everyone for their help. ?For me, this summer turned into more > of a "growing from a scripter to a programmer" summer than I expected > initially. ?As a result I spent a more time refactoring and retracing my > steps than I figured. ?However I think the resulting main product, a GBIF > interface and associated tools, is much better than it would have been > without the advice & encouragement of Brad, Hilmar, etc. ?I will be using > this for my own research and will continue developing it. That sounds like this has been a successful project, and from my Biopython point of view the bit about you planing to continue using and developing the code in your research is especially good news ;) Cheers! Peter From biopython at maubp.freeserve.co.uk Mon Aug 17 21:54:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:54:45 +0100 Subject: [Biopython-dev] old Martel release In-Reply-To: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com> References: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com> Message-ID: <320fb6e00908171454k267ed02djc982bf312b6285bb@mail.gmail.com> On Mon, Aug 17, 2009 at 10:40 PM, Andrew Dalke wrote: > On Aug 17, 2009, at 11:36 PM, Andrew Dalke wrote: >> >> ?Does anyone here have a copy of my *old* Martel code? > > Ha! archive.org has it. > > Didn't think they kept .tar.gz files, but they do! Lucky :) I don't know what is in it, but your dalke user account is still there on biopython.org - which would probably still have all the http://www.biopython.org/~dalke website content. I guess your password has expired or something. Give the OBF guys an email? You might have some other bits and pieces still there... Peter From mhampton at d.umn.edu Mon Aug 17 21:45:07 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 17 Aug 2009 16:45:07 -0500 (CDT) Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> Message-ID: Yep, I tried again and the test_SeqIO_online was ok, so I guess it was a transient failure. I agree that copying my Tests folder isn't a great solution. I will try to increase my understanding of the biopython test framework - I am used to the Sage method of mainly using docstring tests. -Marshall On Mon, 17 Aug 2009, Peter wrote: > On Mon, Aug 17, 2009 at 10:25 PM, Marshall Hampton wrote: >> >> After copying the Tests folder from the source to my site-packages >> directory, most of the errors go away, > > Well that does suggest some sort of path issue, but moving the > test directory around that isn't a very good solution. > >> except for the one mentioned above and this one: > > Assuming the "one mentioned above" was the EMBOSS one, fine. > >> ERROR: test_SeqIO_online >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> ?File "run_tests.py", line 248, in runTest >> ? ?suite = unittest.TestLoader().loadTestsFromName(name) >> ?File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576, >> in loadTestsFromName >> ? ?module = __import__('.'.join(parts_copy)) >> ?File "test_SeqIO_online.py", line 62, in >> ? ?record = SeqIO.read(handle, format) # checks there is exactly one record >> ?File >> "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", >> line 485, in read >> ? ?raise ValueError("No records found in handle") >> ValueError: No records found in handle >> >> ...not sure what the problem might be with that. > > That is an online test using the NCBI's web services. This could > be a transient failure due to the network. > > Peter > From biopython at maubp.freeserve.co.uk Mon Aug 17 21:57:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 22:57:40 +0100 Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion In-Reply-To: References: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com> <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com> Message-ID: <320fb6e00908171457w73ca3699y11dbe255cd2748df@mail.gmail.com> On Mon, Aug 17, 2009 at 10:45 PM, Marshall Hampton wrote: > > Yep, I tried again and the test_SeqIO_online was ok, so I guess it was a > transient failure. Good :) > I agree that copying my Tests folder isn't a great solution. ?I will try to > increase my understanding of the biopython test framework - I am > used to the Sage method of mainly using docstring tests. If it helps, there is a whole chapter in our tutorial, but most of this is aimed at people wanting to write unit tests for us. http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Please point out any typos or things that can be clarified. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Tue Aug 18 10:01:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Aug 2009 06:01:25 -0400 Subject: [Biopython-dev] [Bug 2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py In-Reply-To: Message-ID: <200908181001.n7IA1PWk030525@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2619 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-18 06:01 EST ------- Just to note the Ubuntu/Debian packages for Biopython list flex as a build dependency, and patch our setup.py file to re-enable the Bio.PDB.mmCIF.MMCIFlex extension. This is a neat solution until we can update our setup.py to detect flex on its own. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hlapp at gmx.net Tue Aug 18 16:09:15 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 18 Aug 2009 12:09:15 -0400 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> Message-ID: On Aug 17, 2009, at 11:52 AM, Peter wrote: > My impression from talking to OBF guys is if we really want to we > can do this, but it require us (Biopython) to take care of installing > and running git on an OBF machine. That's how I would put it too. Moreover, if you as people who want this and know more about it already than anyone else among root-l can't be bothered to take the initiative to spearhead this on OBF servers, the argument that OBF "sysadmins" (which in essence is all of us who know how to do this) should do the work is a lot less strong than it might have to be. I.e., if you don't feel this would be time well invested for you, it is probably even less well invested for other OBFers. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Aug 18 16:39:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 18 Aug 2009 17:39:23 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> Message-ID: <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> On Tue, Aug 18, 2009 at 5:09 PM, Hilmar Lapp wrote: > > On Aug 17, 2009, at 11:52 AM, Peter wrote: > >> My impression from talking to OBF guys is if we really want to we >> can do this, but it require us (Biopython) to take care of installing >> and running git on an OBF machine. > > That's how I would put it too. Moreover, if you as people who want this and > know more about it already than anyone else among root-l can't be bothered > to take the initiative to spearhead this on OBF servers, the argument that > OBF "sysadmins" (which in essence is all of us who know how to do this) > should do the work is a lot less strong than it might have to be. I.e., if > you don't feel this would be time well invested for you, it is probably even > less well invested for other OBFers. Sure. Right now I don't think anyone at Biopython knows exactly what would be involved in running a gitserver, and it would take some investment of time to get to that point. In the long term I think running git on an OBF machine would be a good idea, but I don't personally want to spend time learning how to do that right now. By using github, we don't have to invest a lot of upfront effort in configuring a git server right away. I think it makes sense to just move Biopython to github in the short term, in the medium term we can (expertise permitting) get a git mirror running on an OBF machine, and then other tools like the git equivalent of ViewCVS (and if need be then abandon github - we won't be locked into anything permanent). Peter From fkauff at biologie.uni-kl.de Wed Aug 19 07:36:45 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 19 Aug 2009 09:36:45 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> Message-ID: <4A8BAB8D.1000109@biologie.uni-kl.de> Hi all, On 08/17/2009 02:43 PM, Peter wrote: > Hi all, > > Now that Biopython 1.51 is out (thanks Brad), we should > discuss finally moving from CVS to git. This was something > we talked about at BOSC/ISMB 2009, but not everyone was > there. We have two main options: > > (a) Move from CVS (on the OBF servers) to github. All our > developers will need to get github accounts, and be added > as "collaborators" to the existing github repository. I would > want a mechanism in place to backup the repository to the > OBF servers (Bartek already has something that should > work). > > I agree, this sounds at this point like the most feasible way to go. In the long run we can still reconsider to run git on the OBF servers, but t this point running such a server is an additional amount of work that brings no additional benefit. Cheers, Frank > (b) Move from CVS to git (on the OBF servers). All our > developers can continue to use their existing OBF accounts. > Bartek's existing scripts could be modified to push the > updates from this OBF git repository onto github. > > In either case, there will be some "plumbing" work required, > for example I'd like to continue to offer a recent source code > dump at http://biopython.open-bio.org/SRC/biopython/ etc. > > Given we don't really seem to have the expertise "in house" > to run an OBF git server ourselves right now, option (a) is > simplest, and as I recall those of us at BOSC where OK > with this plan. > > Assuming we go down this route (CVS to github), everyone > with an existing CVS account should setup a github account > if they want to continue to have commit access (e.g. Frank, > Iddo). I would suggest that initially you get used to working > with git and github BEFORE trying anything directly on what > would be the "official" repository. It took me a while and I'm > still learning ;) > > Is this agreeable? Are there any other suggestions? > > [Once this is settled, we can talk about things like merge > requests and if they should be accompanied by a Bugzilla > ticket or not.] > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From matzke at berkeley.edu Wed Aug 19 08:56:59 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Wed, 19 Aug 2009 01:56:59 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion In-Reply-To: <4A89B411.4090501@berkeley.edu> References: <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> <4A78696E.8010808@berkeley.edu> <20090804222731.GA12604@sobchak.mgh.harvard.edu> <4A8081B3.2080600@berkeley.edu> <20090811131019.GW12604@sobchak.mgh.harvard.edu> <4A89B411.4090501@berkeley.edu> Message-ID: <4A8BBE5B.10705@berkeley.edu> OK, I nailed the bug, which was stemming from HTML links inside GBIF XML results which in some situations were screwing up parsing etc. So I've updated the tutorial to add the chunk about downloading an arbitrarily large number of records, in user-specified increments, with an appropriate time-delay between server requests. Also added a chunk on classifying records into user-specified geographic areas based on their latitude/longitude. Also updated the test scripts and test results files, and deleted some remaining loose/unnecessary files. Updated tutorial: http://biopython.org/wiki/BioGeography#Tutorial Github commits: http://github.com/nmatzke/biopython/commits/Geography I think I've reached a good stopping point for the moment, I welcome comments on the tutorial and/or on the prospects for turning this into an official biopython module, etc. Thanks again, and cheers! Nick Nick Matzke wrote: > Pencils down update: I have uploaded the relevant test scripts and data > files to git, and deleted old loose files. > http://github.com/nmatzke/biopython/commits/Geography > > Here is a simple draft tutorial: > http://biopython.org/wiki/BioGeography#Tutorial > > Strangely, while working on the tutorial I discovered that I did > something somewhere in the last revision that is messing up the parsing > of automatically downloaded records from GBIF, I am tracking this down > currently and will upload as soon as I find it. > > I would like to thank everyone for the opportunity to participate in > GSoC, and to thank everyone for their help. For me, this summer turned > into more of a "growing from a scripter to a programmer" summer than I > expected initially. As a result I spent a more time refactoring and > retracing my steps than I figured. However I think the resulting main > product, a GBIF interface and associated tools, is much better than it > would have been without the advice & encouragement of Brad, Hilmar, etc. > I will be using this for my own research and will continue developing it. > > Cheers! > Nick > > > Brad Chapman wrote: >> Hi Nick; >> >>> Summary: Major focus is getting the GBIF access/search/parse module >>> into "done"/submittable shape. This primarily requires getting the >>> documentation and testing up to biopython specs. I have a fair bit >>> of documentation and testing, need advice (see below) for specifics >>> on what it should look like. >> >> Awesome. Thanks for working on the cleanup for this. >> >>> OK, I will do this. Should I try and figure out the unittest stuff? >>> I could use a simple example of what this is supposed to look like. >> >> In addition to Peter's pointers, here is a simple example from a >> small thing I wrote: >> >> http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py >> >> You can copy/paste the unit test part to get a base, and then >> replace the t_* functions with your own real tests. >> >> Simple scripts that generate consistent output are also fine; that's >> the print and compare approach. >> >>>> - What is happening with the Nodes_v2 and Treesv2 files? They look >>>> like duplicates of the Nexus Nodes and Trees with some changes. >>>> Could we roll those changes into the main Nexus code to avoid >>>> duplication? >>> Yeah, these were just copies with your bug fix, and with a few mods I >>> used to track crashes. Presumably I don't need these with after a >>> fresh download of biopython. >> >> Cool. It would be great if we could weed these out as well. >> >>> The API is really just the interface with GBIF. I think developing a >>> cookbook entry is pretty easy, I assume you want something like one >>> of the entries in the official biopython cookbook? >> >> Yes, that would work great. What I was thinking of are some examples >> where you provide background and motivation: Describe some useful >> information you want to get from GBIF, and then show how to do it. >> This is definitely the most useful part as it gives people working >> examples to start with. From there they can usually browse the lower >> level docs or code to figure out other specific things. >> >>> Re: API documentation...are you just talking about the function >>> descriptions that are typically in """ """ strings beneath the >>> function definitions? I've got that done. Again, if there is more, >>> an example of what it should look like would be useful. >> >> That looks great for API level docs. You are right on here; for this >> week I'd focus on the cookbook examples and cleanup stuff. >> >> My other suggestion would be to rename these to follow Biopython >> conventions, something like: >> >> gbif_xml -> GbifXml >> shpUtils -> ShapefileUtils >> geogUtils -> GeographyUtils >> dbfUtils -> DbfUtils >> >> The *Utils might have underscores if they are not intended to be >> called directly. >> >> Thanks for all your hard work, >> Brad >> > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From bugzilla-daemon at portal.open-bio.org Wed Aug 19 09:29:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 Aug 2009 05:29:36 -0400 Subject: [Biopython-dev] [Bug 2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py In-Reply-To: Message-ID: <200908190929.n7J9TaR0006301@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2619 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-19 05:29 EST ------- (In reply to comment #5) > Just to note the Ubuntu/Debian packages for Biopython list flex as a build > dependency, and patch our setup.py file to re-enable the Bio.PDB.mmCIF.MMCIFlex > extension. This is a neat solution until we can update our setup.py to detect > flex on its own. > Alex Lancaster has kindly done the same for the latest Fedora RPM package (Biopython 1.51). See https://admin.fedoraproject.org/community/?package=python-biopython#package_maintenance -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Aug 19 09:45:20 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 19 Aug 2009 11:45:20 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> Message-ID: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> Hi guys, On Tue, Aug 18, 2009 at 6:39 PM, Peter wrote: > On Tue, Aug 18, 2009 at 5:09 PM, Hilmar Lapp wrote: >> >> On Aug 17, 2009, at 11:52 AM, Peter wrote: >> >>> My impression from talking to OBF guys is if we really want to we >>> can do this, but it require us (Biopython) to take care of installing >>> and running git on an OBF machine. >> >> That's how I would put it too. Moreover, if you as people who want this and >> know more about it already than anyone else among root-l can't be bothered >> to take the initiative to spearhead this on OBF servers, the argument that >> OBF "sysadmins" (which in essence is all of us who know how to do this) >> should do the work is a lot less strong than it might have to be. I.e., if >> you don't feel this would be time well invested for you, it is probably even >> less well invested for other OBFers. > > Sure. Right now I don't think anyone at Biopython knows exactly > what would be involved in running a gitserver, and it would take > some investment of time to get to that point. > I think there is some grave misunderstanding here. There is nothing magical or difficult in installing git on OBF servers. It's just a package. There is no effort to be spearheaded by anyone. The command "yum install git" needs to be run by someone with root privileges. That's it. It's absolutely enough to allow people with obf developer accounts to use git for development. As for running a git-protocol-server, this is a bit more complicated and can be done in many more ways than with CVS. I don't think that anyone is expecting OBF to provide git repository hosting in a standardized way (currently only BioRuby uses git and they seem to be fine with github, similar for biopython) The importance of having git installed on OBF machines comes from the fact that it can be useful for many things even if we don't host the repository on OBF servers. Most importantly, for doing regular backups of git branch from github to OBF servers we need a machine with git installed. Currently it's my work machine, but I think it would be a much better setup if we could do it directly from an OBF machine. > In the long term I think running git on an OBF machine would be a > good idea, but I don't personally want to spend time learning how to > do that right now. By using github, we don't have to invest a lot of > upfront effort in configuring a git server right away. > > I think it makes sense to just move Biopython to github in the short term, > in the medium term we can (expertise permitting) get a git mirror running > on an OBF machine, and then other tools like the git equivalent of > ViewCVS (and if need be then abandon github - we won't be locked > into anything permanent). > I don't quite understand what do you mean by "running git". Once we have git installed, you can use push and pull over ssh to a branch sitting on OBF machine. We can also make the mirror available for people (read-only) through http (just place the repo in a directory published with apache, no extra software required), But I don't think this makes much sense if we actually want to use collaborative features of github. In my opinion this would only bring confusion: either we make the github branch official or not. The most difficult part is the "viewCVS" replacement. There is the gitweb.cgi script, which is (in my opinion) inferior to github interface. Installing it wouldn't be difficult (it's CGI) so we could do it, but is it better than github here? I'm not sure.. (you can see how it would look on a slightly out-of-date biopython branch on my machine: http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=summary ) To summarize, I think that the only thing we really need from OBF is to have git installed (Hilmar, can you help with this? I tried to even compile it on dev.open-bio.org but there it depends on multiple libraries and I gave up...) best regards Bartek From biopython at maubp.freeserve.co.uk Wed Aug 19 09:58:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 10:58:12 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> Message-ID: <320fb6e00908190258u2262a3b9s256dff5db38ddd41@mail.gmail.com> On Wed, Aug 19, 2009 at 10:45 AM, Bartek Wilczynski wrote: >> Sure. Right now I don't think anyone at Biopython knows exactly >> what would be involved in running a gitserver, and it would take >> some investment of time to get to that point. > > I think there is some grave misunderstanding here. You have certainly clarified a few things for me ;) > There is nothing magical or difficult in installing git on OBF > servers. It's ?just a package. There is no effort to be spearheaded > by anyone. The command "yum install git" needs to be run by > someone with root privileges. That's it. It's absolutely enough > to allow people with obf developer accounts to use git for > development. Oh. That is less complicated than I realised - assuming all the existing dev accounts have SSH access. > As for running a git-protocol-server, this is a bit more complicated > and can be done in many more ways than with CVS. I don't think > that anyone is expecting OBF to provide git repository hosting in > a standardized way (currently only BioRuby uses git and they > seem to be fine with github, similar for biopython) > > The importance of having git installed on OBF machines comes > from the fact that it can be useful for many things even if we don't > host the repository on OBF servers. I had been assuming we would also need the git-protocol-server, and to mess about with the firewall and perhaps webserver, but if I understand you correctly even *just* the core git tool running on the OBF would be useful (even if just for backups). So let's try and do that... > ... > To summarize, I think that the only thing we really need from OBF is > to have git installed Any of the OBF server admins should be able to install the git *package* for us (this should be trivial as long as the Linux OS is fairly up to date). We should probably ask via a support request on on the root-l mailing list... let's just give Hilmar a chance to reply first. Peter From hlapp at gmx.net Wed Aug 19 22:17:20 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Aug 2009 18:17:20 -0400 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> Message-ID: <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> On Aug 19, 2009, at 5:45 AM, Bartek Wilczynski wrote: > To summarize, I think that the only thing we really need from OBF is > to have git installed > (Hilmar, can you help with this? I tried to even compile it on > dev.open-bio.org but there it depends on multiple libraries and I > gave up...) Post to root-l (copied here, for convenience) and ask if someone can set you up with the necessary privileges, assuming that you are volunteering to do the installation? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Aug 20 10:01:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 11:01:54 +0100 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? Message-ID: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> Hi Bartek, With the introduction of Bio.Motif, we declared Bio.AlignAce and Bio.MEME as obsolete as of release 1.50 in the DEPRECATED file. I note we didn't update the module docstrings themselves to make this more prominent. Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for the next release (i.e. put this in their docstrings and issue deprecation warnings)? Peter From barwil at gmail.com Thu Aug 20 10:10:23 2009 From: barwil at gmail.com (Bartek Wilczynski) Date: Thu, 20 Aug 2009 12:10:23 +0200 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? In-Reply-To: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> Message-ID: <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> On Thu, Aug 20, 2009 at 12:01 PM, Peter wrote: > Hi Bartek, > > With the introduction of Bio.Motif, we declared Bio.AlignAce and > Bio.MEME as obsolete as of release 1.50 in the DEPRECATED file. I note > we didn't update the module docstrings themselves to make this more > prominent. > > Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for > the next release (i.e. put this in their docstrings and issue > deprecation warnings)? I think so. Should I change something in the docstrings? Bartek From biopython at maubp.freeserve.co.uk Thu Aug 20 10:20:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 11:20:30 +0100 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? In-Reply-To: <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> Message-ID: <320fb6e00908200320k6d8902e0r4742a92a5956b1ed@mail.gmail.com> On Thu, Aug 20, 2009 at 11:10 AM, Bartek Wilczynski wrote: >> Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for >> the next release (i.e. put this in their docstrings and issue >> deprecation warnings)? > > I think so. ?Should I change something in the docstrings? > The start of the module docstring should be a one line description of the module - just include "(DEPRECATED)" at the end. Then it will show up nicely in the API docs: http://biopython.org/DIST/docs/api/ If you look at that page you should be able to see entries like this: * Bio.Fasta: Utilities for working with FASTA-formatted sequences (DEPRECATED) * Bio.FilteredReader: Code for more fancy file handles (OBSOLETE) Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 11:28:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 12:28:46 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO Message-ID: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> Hi all, You may recall a thread back in June with Cedar Mckay (cc'd - not sure if he follows the dev list or not) about indexing large sequence files - specifically FASTA files but any sequential file format. I posted some rough code which did this building on Bio.SeqIO: http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html I have since generalised this, and have something which I think would be ready for merging into the main trunk for wider testing. The code is on github on my "index" branch at the moment, http://github.com/peterjc/biopython/commits/index This would add a new function to Bio.SeqIO, provisionally called indexed_dict, which takes two arguments: filename and format name (e.g. "fasta"), plus an optional alphabet. This will return a dictionary like object, using SeqRecord identifiers as keys, and SeqRecord objects as values. There is (deliberately) no way to allow the user to choose a different keying mechanism (although I can see how to do this at a severe performance cost). As with the Bio.SeqIO.convert() function, the new addition of Bio.SeqIO.indexed_dict() will be the only public API change. Everything else is deliberately private, allowing us the freedom to change details if required in future. The essential idea is the same my post in June. Nothing about the existing SeqIO framework is changed (so this won't break anything). For each file we scan though it looking for new records, not the file offset, and extract the identifier string. These are stored in a normal (private) Python dictionary. On requesting a record, we seek to the appropriate offset and parse the data into a SeqRecord. For simple file formats we can do this by calling Bio.SeqIO.parse(). For complex file formats (such as SFF files, or anything else with important information in a header), the implementation is a little more complicated - but we can provide the same API to the user. Note that the indexing step does not fully parse the file, and thus may ignore corrupt/invalid records. Only when (if) they are accessed will this trigger a parser error. This is a shame, but means the indexing can (in general) be done very fast. I am proposing to merge all of this (except the SFF file support), but would welcome feedback (even after a merger). I already have basic unit tests, covering the following SeqIO file formats: "ace", "embl", "fasta", "fastq" (all three variants), "genbank"/"gb", "ig", "phd", "pir", and "swiss" (plus "sff" but I don't think that parser is ready to be checked in yet). An example using the new code, this takes just a few seconds to index this 238MB GenBank file, and record access is almost instant: >>> from Bio import SeqIO >>> gb_dict = SeqIO.indexed_dict("gbpln1.seq", "gb") >>> len(gb_dict) 59918 >>> gb_dict.keys()[:5] ['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1', 'AB161026.1'] >>> record = gb_dict["AB433451.1"] >>> print record.id, len(record), len(record.features) AB433451.1 590 2 And using a 1.3GB FASTQ file, indexing is about a minute, and again, record access is almost instant: >>> from Bio import SeqIO >>> fq_dict = SeqIO.indexed_dict("SRR001666_1.fastq", "fastq") >>> len(fq_dict) 7047668 >>> fq_dict.keys()[:4] ['SRR001666.2320093', 'SRR001666.2320092', 'SRR001666.1250635', 'SRR001666.2354360'] >>> record = fq_dict["SRR001666.2765432"] >>> print record.id, record.seq SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 12:24:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 13:24:34 +0100 Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME? In-Reply-To: <8b34ec180908200450r15823d18q87a8cbfccbdc9b13@mail.gmail.com> References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com> <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com> <320fb6e00908200320k6d8902e0r4742a92a5956b1ed@mail.gmail.com> <8b34ec180908200450r15823d18q87a8cbfccbdc9b13@mail.gmail.com> Message-ID: <320fb6e00908200524k126ca330n86c3e8516777113c@mail.gmail.com> On Thu, Aug 20, 2009 at 12:50 PM, Bartek Wilczynski wrote: > > On Thu, Aug 20, 2009 at 12:20 PM, Peter wrote: > >> The start of the module docstring should be a one line description of >> the module - just include "(DEPRECATED)" at the end. Then it will >> show up nicely in the API docs: http://biopython.org/DIST/docs/api/ > > Done. Should be in CVS now. Sorry I was unclear - I was only talking about the docstrings. In addition we need to actually issue a deprecation warning (via the warnings module), and update the DEPRECATED file in the root folder. I've done this in CVS - sorry for any confusion. I've also tried to clarify the procedure on the wiki, http://biopython.org/wiki/Deprecation_policy If you can add a couple of examples to the AlignAce and MEME module docstrings showing a short example using the deprecated module, and the equivalent using Bio.Motif, that would be great. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 12:43:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 13:43:07 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> Message-ID: <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> On Thu, Aug 20, 2009 at 10:14 AM, Bartek Wilczynski wrote: > > Hi all, > > As the biopython project is moving now its development from CVS to > git, it would be very helpful for us if git software was installed on > dev.open-bio.org machine. > > The most convenient for us would be if someone with root privileges on > this machine would install the package (it's in the centos > repository). I can also do the installation myself, as suggested by > Hilmar (assuming I get the permissions required for package > installation, account=bartek). Bartek - do you think we need git on any of the other OBF machines in addition to dev.open-bio.org (current IP 207.154.17.71)? However, I'd like to have http://biopython.org/SRC/biopython kept up to date (also available via www.biopython.org and biopython.open-bio.org - these are all the same machine, IP 207.154.17.70). It might be easiest to do that with git installed on that machine too - or do you think it would be simpler to push the latest files from dev.open-bio.org instead? There is also the public CVS server, cvs.biopython.org aka cvs.open-bio.org (IP 207.154.17.75) but I doubt we will need to worry about that one in future. Peter From bartek at rezolwenta.eu.org Thu Aug 20 13:06:53 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 20 Aug 2009 15:06:53 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> Message-ID: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote: > Bartek - do you think we need git on any of the other OBF machines > in addition to dev.open-bio.org (current IP 207.154.17.71)? What we _need_ is a single machine, where we can run scripts from cron and where git is installed. That's why I requested the installation on dev.open-bio machine (it happens to be the only one I have an account on). The idea is to run something from cron and pull from github to have a backup copy of an up-to-date branch. The scripts can (after each update) push to other machines. > > However, I'd like to have http://biopython.org/SRC/biopython > kept up to date (also available via www.biopython.org and > biopython.open-bio.org - these are all the same machine, > IP 207.154.17.70). It might be easiest to do that with git > installed on that machine too - or do you think it would be > simpler to push the latest files from dev.open-bio.org instead? There is no need for git on the www-server machine if we only want to publish the code, or a read-only git branch over http for download. I think it's easier to have a single place where cron jobs are run. However, If we wanted to hook the scripts to github notifications rather than to cron, then we need some way to trigger scripts by a hit to a webpage, in which case it _might_ be easier to set things up on the machine with a web server. But I think we should be fine with the machinery running on the dev. machine. There is one remaining issue: We would need to have some directory where the branch would be kept. Currently it sits in my home directory whic probably should be changed to something like /home/biopython/git_branch. I am in biopython group, but currently /home/biopython does not even allow me to see /home/biopython, not to mention writing into it. I think it would be the best to set the scripts to run as biopython user. > > There is also the public CVS server, cvs.biopython.org aka > cvs.open-bio.org (IP 207.154.17.75) but I doubt we will need > to worry about that one in future. Certainly. I don't think we need to worry about this one. Bartek From biopython at maubp.freeserve.co.uk Thu Aug 20 13:24:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 14:24:15 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> Message-ID: <320fb6e00908200624i43b650e5q8478b0b5c12af67b@mail.gmail.com> On Thu, Aug 20, 2009 at 2:06 PM, Bartek Wilczynski wrote: > On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote: > >> Bartek - do you think we need git on any of the other OBF machines >> in addition to dev.open-bio.org (current IP 207.154.17.71)? > > What we _need_ is a single machine, where we can run scripts from cron > and where git is installed. That's why I requested the installation on > dev.open-bio machine (it happens to be the only one I have an account > on). The idea is to run something from cron and pull from github to > have a backup copy of an up-to-date branch. The scripts can (after > each update) push to other machines. > > ... > > There is no need for git on the www-server machine if we only want to > publish the code, or a read-only git branch over http for download. I > think it's easier to have a single place where cron jobs are run. So just push a dump of the latest code to http://biopython.org/SRC/biopython or push fresh epydoc api docs to http://biopython.org/DIST/docs/api-live/ or whatever from dev.open-bio.org. That sounds fine to me. > There is one remaining issue: We would need to have some directory > where the branch would be kept. Currently it sits in my home directory > which probably should be changed to something like > /home/biopython/git_branch. I am in biopython group, but currently > /home/biopython does not even allow me to see /home/biopython, not to > mention writing into it. I think it would be the best to set the > scripts to run as biopython user. Yes - we'll need some OBF admin input there... Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 13:28:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 14:28:22 +0100 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> Message-ID: <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> On Thu, Aug 20, 2009 at 2:15 PM, Chris Fields wrote: > > I would be interested in that as well. > > It appears dev.open-bio.org has apt (there is an /etc/apt directory), but > I'm failing to find apt-get in my PATH. ?Haven't installed on it yet, but a > packaged version would probably be easier. If we can have a packaged version of git on dev.open-bio.org from the Linux distro, that would be easiest (especially for keeping it up to date). > Also, are we planning ro mirrors on portal for anon access, or should we > (ab)use github for that purpose? ?To me a ro mirror sorta defeats the > purpose of git... For Biopython we plan to use github (initially at least) for committing changes. This will also allow anonymous access. A public OBF read only mirror of a git repository is still useful for people to clone from, and keep the local copy up to date - plus as a backup for if/when github is congested or unavailable. But not essential. Peter From dag at sonsorol.org Thu Aug 20 13:41:55 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Thu, 20 Aug 2009 09:41:55 -0400 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> Message-ID: Git is now installed via 'yum' on dev.open-bio.org Regards, Chris On Aug 20, 2009, at 9:28 AM, Peter wrote: > On Thu, Aug 20, 2009 at 2:15 PM, Chris Fields > wrote: >> >> I would be interested in that as well. >> >> It appears dev.open-bio.org has apt (there is an /etc/apt >> directory), but >> I'm failing to find apt-get in my PATH. Haven't installed on it >> yet, but a >> packaged version would probably be easier. > > If we can have a packaged version of git on dev.open-bio.org from the > Linux distro, that would be easiest (especially for keeping it up to > date). > >> Also, are we planning ro mirrors on portal for anon access, or >> should we >> (ab)use github for that purpose? To me a ro mirror sorta defeats the >> purpose of git... > > For Biopython we plan to use github (initially at least) for > committing > changes. This will also allow anonymous access. > > A public OBF read only mirror of a git repository is still useful > for people to > clone from, and keep the local copy up to date - plus as a backup for > if/when github is congested or unavailable. But not essential. > > Peter > > _______________________________________________ > Root-l mailing list > Root-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/root-l From mjldehoon at yahoo.com Thu Aug 20 13:58:03 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 20 Aug 2009 06:58:03 -0700 (PDT) Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> Message-ID: <818036.61284.qm@web62408.mail.re1.yahoo.com> I just have two suggestions: Since indexed_dict returns a dictionary-like object, it may make sense for the _IndexedSeqFileDict to inherit from a dict. Another issue is whether we can fold indexed_dict and to_dict into one. Right now we have def to_dict(sequences, key_function=None) : def indexed_dict(filename, format, alphabet=None) : What if we have a single function "dictionary" that can take sequences, a handle, or a filename, and optionally the format, alphabet, key_function, and a parameter "indexed" that indicates if the file should be indexed or kept into memory? Or something like that. Otherwise, the code looks really nice. Thanks! --Michiel --- On Thu, 8/20/09, Peter wrote: > From: Peter > Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO > To: "Biopython-Dev Mailing List" > Cc: "Cedar McKay" > Date: Thursday, August 20, 2009, 7:28 AM > Hi all, > > You may recall a thread back in June with Cedar Mckay (cc'd > - not > sure if he follows the dev list or not) about indexing > large sequence > files - specifically FASTA files but any sequential file > format. I posted > some rough code which did this building on Bio.SeqIO: > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html > > I have since generalised this, and have something which I > think > would be ready for merging into the main trunk for wider > testing. > The code is on github on my "index" branch at the moment, > http://github.com/peterjc/biopython/commits/index > > This would add a new function to Bio.SeqIO, provisionally > called > indexed_dict, which takes two arguments: filename and > format > name (e.g. "fasta"), plus an optional alphabet. This will > return a > dictionary like object, using SeqRecord identifiers as > keys, and > SeqRecord objects as values. There is (deliberately) no way > to > allow the user to choose a different keying mechanism > (although > I can see how to do this at a severe performance cost). > > As with the Bio.SeqIO.convert() function, the new addition > of > Bio.SeqIO.indexed_dict() will be the only public API > change. > Everything else is deliberately private, allowing us the > freedom > to change details if required in future. > > The essential idea is the same my post in June. Nothing > about > the existing SeqIO framework is changed (so this won't > break > anything). For each file we scan though it looking for new > records, > not the file offset, and extract the identifier string. > These are stored > in a normal (private) Python dictionary. On requesting a > record, we > seek to the appropriate offset and parse the data into a > SeqRecord. > For simple file formats we can do this by calling > Bio.SeqIO.parse(). > > For complex file formats (such as SFF files, or anything > else with > important information in a header), the implementation is a > little > more complicated - but we can provide the same API to the > user. > > Note that the indexing step does not fully parse the file, > and > thus may ignore corrupt/invalid records. Only when (if) > they are > accessed will this trigger a parser error. This is a shame, > but > means the indexing can (in general) be done very fast. > > I am proposing to merge all of this (except the SFF file > support), > but would welcome feedback (even after a merger). I > already > have basic unit tests, covering the following SeqIO file > formats: > "ace", "embl", "fasta", "fastq" (all three variants), > "genbank"/"gb", > "ig", "phd", "pir", and "swiss" (plus "sff" but I don't > think that > parser is ready to be checked in yet). > > An example using the new code, this takes just a few > seconds > to index this 238MB GenBank file, and record access is > almost > instant: > > >>> from Bio import SeqIO > >>> gb_dict = SeqIO.indexed_dict("gbpln1.seq", > "gb") > >>> len(gb_dict) > 59918 > >>> gb_dict.keys()[:5] > ['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1', > 'AB161026.1'] > >>> record = gb_dict["AB433451.1"] > >>> print record.id, len(record), > len(record.features) > AB433451.1 590 2 > > And using a 1.3GB FASTQ file, indexing is about a minute, > and > again, record access is almost instant: > > >>> from Bio import SeqIO > >>> fq_dict = > SeqIO.indexed_dict("SRR001666_1.fastq", "fastq") > >>> len(fq_dict) > 7047668 > >>> fq_dict.keys()[:4] > ['SRR001666.2320093', 'SRR001666.2320092', > 'SRR001666.1250635', > 'SRR001666.2354360'] > >>> record = fq_dict["SRR001666.2765432"] > >>> print record.id, record.seq > SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Thu Aug 20 14:13:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 15:13:00 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <818036.61284.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <818036.61284.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> On Thu, Aug 20, 2009 at 2:58 PM, Michiel de Hoon wrote: > > I just have two suggestions: > > Since indexed_dict returns a dictionary-like object, it may make sense > for the _IndexedSeqFileDict to inherit from a dict. We'd have to override things like values() to prevent explosions in memory, and just give a not implemented exception. But yes, good point. > Another issue is whether we can fold indexed_dict and to_dict into one. > Right now we have > > def to_dict(sequences, key_function=None) : > > def indexed_dict(filename, format, alphabet=None) : > > What if we have a single function "dictionary" that can take sequences, a > handle, or a filename, and optionally the format, alphabet, key_function, > and a parameter "indexed" that indicates if the file should be indexed or > kept into memory? Or something like that. I wondered about this, but there are a couple of important differences between my file indexer, and the existing to_dict function. For the Bio.SeqIO.to_dict() function, the optional key_function argument maps a SeqRecord to the desired index (by default the record's id is used). Supporting a key_function for indexing files in the same way would mean every single record in the file must be parsed into a SeqRecord while building the index. This is possible, but would really really slow things down - and while I considered it, I don't like this idea at all. Instead each format indexer has essentially got a "mini parser" which just extracts the id string, so things are much much faster. Also, the to_dict function can be used on any sequences - not just from a file. They could be a list of SeqRecords, or a generator expression filtering output from Bio.SeqIO.parse(). Anything at all really. Finally I had better explain my thoughts on indexing and handles versus filenames. For the SeqIO (and AlignIO etc) parsers, and handle which supports the basic read/readline/iteration functionality can be used. For the indexed_dict() function as written, we need to keep the handle open for as long as the dictionary is kept in memory. We also must have a handle which supports seek and tell (e.g. not a urllib handle, or compressed files). Finally, the mode the file was opened in can be important (e.g. for SFF files universal read lines mode must not be used). So while indexed_dict could take a file handle (instead of a filename) there are a lot of provisos. I felt just taking a filename was the simplest solution here. > Otherwise, the code looks really nice. Thanks! Great - thanks for your comments. Peter From bartek at rezolwenta.eu.org Thu Aug 20 14:19:50 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 20 Aug 2009 16:19:50 +0200 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> Message-ID: <8b34ec180908200719k74ebe1ccqa9cdf61684963997@mail.gmail.com> On Thu, Aug 20, 2009 at 3:41 PM, Chris Dagdigian wrote: > > Git is now installed via 'yum' on dev.open-bio.org > Wonderful, thanks a lot. Bartek From biopython at maubp.freeserve.co.uk Thu Aug 20 14:42:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 15:42:35 +0100 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> Message-ID: <320fb6e00908200742u2e0fbc16v3f17f1e00b13f634@mail.gmail.com> On Thu, Aug 20, 2009 at 3:24 PM, Chris Fields wrote: > > Thanks Chris D! ?Not sure, but can we view repos on dev similar to portal > (via gitweb or similar)? ?Or should we mirror these over to portal for that > purpose? > > chris Again, this falls into the nice to have in the medium/long term, but not essential in the short term (for Biopython to move from CVS to git). We can manage with the github web interface for history etc. Peter From dag at sonsorol.org Thu Aug 20 15:30:31 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Thu, 20 Aug 2009 11:30:31 -0400 Subject: [Biopython-dev] [Root-l] Moving from CVS to git In-Reply-To: <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com> <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu> Message-ID: <7A8D8D5E-4C39-4713-B92F-3A384374DCAC@sonsorol.org> Sure, just need informed advice on the 'best' packages to install and possibly some install help if I get stuck somewhere. -Chris On Aug 20, 2009, at 10:24 AM, Chris Fields wrote: > Thanks Chris D! Not sure, but can we view repos on dev similar to > portal (via gitweb or similar)? Or should we mirror these over to > portal for that purpose? > > chris > > On Aug 20, 2009, at 8:41 AM, Chris Dagdigian wrote: > >> >> Git is now installed via 'yum' on dev.open-bio.org >> >> Regards, >> Chris >> >> >> On Aug 20, 2009, at 9:28 AM, Peter wrote: >> >>> On Thu, Aug 20, 2009 at 2:15 PM, Chris >>> Fields wrote: >>>> >>>> I would be interested in that as well. >>>> >>>> It appears dev.open-bio.org has apt (there is an /etc/apt >>>> directory), but >>>> I'm failing to find apt-get in my PATH. Haven't installed on it >>>> yet, but a >>>> packaged version would probably be easier. >>> >>> If we can have a packaged version of git on dev.open-bio.org from >>> the >>> Linux distro, that would be easiest (especially for keeping it up >>> to date). >>> >>>> Also, are we planning ro mirrors on portal for anon access, or >>>> should we >>>> (ab)use github for that purpose? To me a ro mirror sorta defeats >>>> the >>>> purpose of git... >>> >>> For Biopython we plan to use github (initially at least) for >>> committing >>> changes. This will also allow anonymous access. >>> >>> A public OBF read only mirror of a git repository is still useful >>> for people to >>> clone from, and keep the local copy up to date - plus as a backup >>> for >>> if/when github is congested or unavailable. But not essential. >>> >>> Peter >>> >>> _______________________________________________ >>> Root-l mailing list >>> Root-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/root-l >> From biopython at maubp.freeserve.co.uk Thu Aug 20 16:19:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 17:19:19 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <818036.61284.qm@web62408.mail.re1.yahoo.com> <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> Message-ID: <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com> Peter wrote: > Michiel wrote: >> >> I just have two suggestions: >> >> Since indexed_dict returns a dictionary-like object, it may make sense >> for the _IndexedSeqFileDict to inherit from a dict. > > We'd have to override things like values() to prevent explosions in memory, > and just give a not implemented exception. But yes, good point. Done on github - I also had to override all the writeable dict methods like pop and clear which don't make sense here. The code for the class is now a bit longer, but is certainly more dict-like. I also had to implement __str__ and __repr__ to do something I think is useful and sensible. Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 18:07:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 19:07:38 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <818036.61284.qm@web62408.mail.re1.yahoo.com> <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com> <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com> Message-ID: <320fb6e00908201107u4c09fd7dj1bcc60ceabe0ecf9@mail.gmail.com> On Thu, Aug 20, 2009 at 5:19 PM, Peter wrote: > > Done on github - I also had to override all the writeable dict methods like > pop and clear which don't make sense here. The code for the class is now > a bit longer, but is certainly more dict-like. I also had to implement __str__ > and __repr__ to do something I think is useful and sensible. > I have checked this new indexing functionality into CVS, but the github branch is still there for the SFF file support (parsing and indexing). We can of course still easily tweak the naming or the public side of the API. In the meantime I'll think about updating the tutorial... Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 18:11:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 19:11:16 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> Message-ID: <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> On Thu, Aug 20, 2009 at 2:06 PM, Bartek Wilczynski wrote: > On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote: > >> Bartek - do you think we need git on any of the other OBF machines >> in addition to dev.open-bio.org (current IP 207.154.17.71)? > > What we _need_ is a single machine, where we can run scripts from cron > and where git is installed. That's why I requested the installation on > dev.open-bio machine (it happens to be the only one I have an account > on). The idea is to run something from cron and pull from github to > have a backup copy of an up-to-date branch. The scripts can (after > each update) push to other machines. Bartek, now that Chris D has kindly installed git on dev.open-bio.org, can you look into backing up our github repository onto dev.open-bio.org? Initially just running a cron job using your own user account should be fine. Thanks, Peter From bartek at rezolwenta.eu.org Thu Aug 20 21:07:06 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 20 Aug 2009 23:07:06 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> Message-ID: <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> On Thu, Aug 20, 2009 at 8:11 PM, Peter wrote: > Bartek, now that Chris D has kindly installed git on dev.open-bio.org, > can you look into backing up our github repository onto dev.open-bio.org? > Initially just running a cron job using your own user account should be fine. I've only quickly tested git, and I was able to pull from github with no problems. I will try porting thew scripts from my machine to dev.open-bio tomorrow. In the meantime, I've checked that biopython account on dev.open-bio machine is assigned to Brad Marshall. I haven't seen him posting to the list lately. Does anyone have the access to this account? cheers Barte -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From bugzilla-daemon at portal.open-bio.org Fri Aug 21 12:26:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 08:26:45 -0400 Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid os.cmd In-Reply-To: Message-ID: <200908211226.n7LCQjX7025910@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2867 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 08:26 EST ------- I'm going to assume the attempted fix worked (included with Biopython 1.51 final), and close this bug. Please reopen it if there is still a problem. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 12:52:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 08:52:24 -0400 Subject: [Biopython-dev] [Bug 2544] Bio.GenBank and SeqFeature improvements In-Reply-To: Message-ID: <200908211252.n7LCqOOt026458@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2544 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 08:52 EST ------- (In reply to comment #5) > > I'm leaving this bug open for defining __repr__ for the > Bio.SeqFeature.Reference object ... ONLY. > Done in CVS, marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 13:07:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:07:23 -0400 Subject: [Biopython-dev] [Bug 2711] GenomeDiagram.py: write() and write_to_string() are inefficient and don't check inputs In-Reply-To: Message-ID: <200908211307.n7LD7NoU026962@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2711 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #28 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:07 EST ------- (In reply to comment #27) > So the only remaining issue is a unit test involving at least checks for > the presence of renderPM due to versions of reportlab less than 2.2. Added test_GraphicsBitmaps.py to CVS which will make sure we can output a bitmap, and flag renderPM as a missing (optional) dependency if not found. Marking issue as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 13:11:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:11:48 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200908211311.n7LDBmUT027199@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #26 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:11 EST ------- Marking this old bug as fixed, given the work around. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 13:24:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:24:56 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method In-Reply-To: Message-ID: <200908211324.n7LDOuP7027608@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:24 EST ------- (In reply to comment #3) > Patch for Seq object checked in. > > Leaving bug open for possible similar addition to the SeqRecord object. > Done in Bio/SeqRecord.py CVS revision 1.43, marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 13:24:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:24:59 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200908211324.n7LDOxRW027624@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 Bug 2351 depends on bug 2853, which changed state. Bug 2853 Summary: Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method http://bugzilla.open-bio.org/show_bug.cgi?id=2853 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 13:55:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 09:55:58 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200908211355.n7LDtwvC028668@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:55 EST ------- I've checked in a slightly revised version of Cymon's patch to allow Bio.SeqIO to write "phd" files. Checking in Tests/test_SeqIO_QualityIO.py; /home/repository/biopython/biopython/Tests/test_SeqIO_QualityIO.py,v <-- test_SeqIO_QualityIO.py new revision: 1.14; previous revision: 1.13 done Checking in Tests/output/test_SeqIO; /home/repository/biopython/biopython/Tests/output/test_SeqIO,v <-- test_SeqIO new revision: 1.51; previous revision: 1.50 done Checking in Bio/SeqIO/__init__.py; /home/repository/biopython/biopython/Bio/SeqIO/__init__.py,v <-- __init__.py new revision: 1.58; previous revision: 1.57 done Checking in Bio/SeqIO/PhdIO.py; /home/repository/biopython/biopython/Bio/SeqIO/PhdIO.py,v <-- PhdIO.py new revision: 1.8; previous revision: 1.7 done Cymon - could you double check this please? I made one change regarding the filename/record description, and also you hadn't rounded the Solexa scores to the nearest integer value after they were converted to PHRED scores. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 14:21:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:42 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200908211421.n7LELgY3029289@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 10:21 EST ------- This should be fixed in CVS (pushed to github hourly), although I used a slighlty different style to break up the long test methods. Please reopen this bug if the problem persists. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 14:21:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:44 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200908211421.n7LELi5g029306@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Bug 2895 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 14:21:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:47 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200908211421.n7LELll4029321@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Bug 2893 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Aug 21 14:21:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 21 Aug 2009 10:21:50 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200908211421.n7LELobS029336@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Bug 2892 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dmikewilliams at gmail.com Sun Aug 23 17:47:53 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Sun, 23 Aug 2009 13:47:53 -0400 Subject: [Biopython-dev] how to determine BioPython version number Message-ID: Hi there. About a year ago a message was posted that suggested using Martel.__version__ to determine that BioPython versio number. A couple weeks ago the draft announcment for BioPython 1.51 said that Martel is no longer included. If Martel is no longer included, is there some other way for a program to determine the version number of BioPython that is installed? Tried searching for this, but found nothing relevant. Mike Below are snippets from the two messages referred to above: subject: [Biopython-dev] determining the version Peter biopython at maubp.freeserve.co.uk Wed Sep 24 17:12:24 EDT 2008 > Somewhat related to this, what is the appropriate way to find the version of > BioPython installed within Python? So I'm not the only person to have wondered about this. For now, I can only suggest an ugly workarround: import Martel print Martel.__version__ Since Biopython 1.45, by convention the Martel version has been incremented to match that of Biopython. Of course, in a few releases time we probably won't be including Martel any more. On Thu, Aug 13, 2009 at 6:10 AM, Peter wrote: subject: [Biopython-dev] Draft announcement for Biopython 1.51 ... we no longer include Martel/Mindy, and thus don't have any dependence on mxTextTools. From biopython at maubp.freeserve.co.uk Sun Aug 23 19:58:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 23 Aug 2009 20:58:07 +0100 Subject: [Biopython-dev] how to determine BioPython version number In-Reply-To: References: Message-ID: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com> On Sun, Aug 23, 2009 at 6:47 PM, Mike Williams wrote: > > Hi there. ?About a year ago a message was posted that suggested using > Martel.__version__ to determine that BioPython versio number. You are looking at at old thread, and missed what happened since, try: import Bio Bio.__version__ I think this deserves a FAQ entry in the next release of the tutorial... The Martel version "trick" was a work around for determining the version which worked for a few moderately old versions of Biopython (prior to us adding Bio.__version__). Peter From dmikewilliams at gmail.com Sun Aug 23 20:14:51 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Sun, 23 Aug 2009 16:14:51 -0400 Subject: [Biopython-dev] how to determine BioPython version number In-Reply-To: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com> References: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com> Message-ID: On Sun, Aug 23, 2009 at 3:58 PM, Peter wrote: > You are looking at at old thread, and missed what happened since, try: > > import Bio > Bio.__version__ > > I think this deserves a FAQ entry in the next release of the tutorial... > > The Martel version "trick" was a work around for determining the > version which worked for a few moderately old versions of Biopython > (prior to us adding Bio.__version__). > Thanks Peter. I had two problems, looking at an old thread and having an older versions of BioPython, 1.48 and 1.49 on fedora 10 and 11. The method you supplied works fine with the 1.51b version I just got from cvs. Mike From bugzilla-daemon at portal.open-bio.org Mon Aug 24 15:15:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 11:15:44 -0400 Subject: [Biopython-dev] [Bug 2904] New: Interface for Novoalign Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2904 Summary: Interface for Novoalign Product: Biopython Version: 1.51 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: osvaldo.zagordi at bsse.ethz.ch Hi, I wrote an interface for the short sequence alignment program Novoalign (www.novocraft.com). All I did was to modify the interface for Muscle. I might cover some other aligner in the near future. Hope it's useful to someone. Best, Osvaldo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 15:16:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 11:16:48 -0400 Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign In-Reply-To: Message-ID: <200908241516.n7OFGm98032344@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2904 ------- Comment #1 from osvaldo.zagordi at bsse.ethz.ch 2009-08-24 11:16 EST ------- Created an attachment (id=1361) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1361&action=view) Interface to run novoalign (www.novocraft.com) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Aug 24 15:21:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 11:21:07 -0400 Subject: [Biopython-dev] [Bug 2905] New: Short read alignment format Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2905 Summary: Short read alignment format Product: Biopython Version: 1.51 Platform: All OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: osvaldo.zagordi at bsse.ethz.ch Hi again, is there any plan to develop some parsers for alignment of short reads? There's a lot of formats around, and the most serious proposal for a format I've seen is SAM (http://samtools.sourceforge.net/). I should start writing something to parse this output soon. Any suggestion on where to start from (in order not to depend on some module that will be soon obsolete)? Thanks, Osvaldo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 01:34:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 21:34:02 -0400 Subject: [Biopython-dev] [Bug 2907] New: When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2907 Summary: When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' not 'bp' although the .seq.alphabet is set (correctly, I think) to generic_dna. The background here is that we're annotating some viral genomes computationally (however, the annotation isn't necessary for the problem here, see below) and then writing the output to .gb format. After this we load the file using LaserGene (a commercial sequence editing program) to have a look at it etc. This doesn't work terribly well because of the 'aa' designation in the header line. Apart from this, the export seems ok. I'm using a git download from mid-June 09. here is an example which illustrates this: # load dependencies from Bio import Entrez from Bio import SeqIO from Bio import SeqRecord from Bio.Alphabet import generic_protein, generic_dna # get a sequence from Genbank print "going to recover a sequence from genbank...." ifh = Entrez.efetch(db="nucleotide",id="DQ923122",rettype="gb") # parse the file handle recordlist=[] print "OK, got the records from genbank, parsing ..." for record in SeqIO.parse(ifh, "genbank"): recordlist.append(record) ifh.close() # write it to a file for thisrecord in recordlist: # confirm it's dna assert (type(thisrecord.seq.alphabet)==type(generic_dna)), "We are supposed to be dealing with a DNA sequence, but we aren't, can't continue." # write to gb ofn=thisrecord.id+".gb" print "Writing thisrecord to ",ofn ofh=open(ofn,"w") SeqIO.write([thisrecord], ofh, "gb") ofh.close exit() # top lines of the genbank file reads as follows # #LOCUS DQ923122 34250 aa DNA VRL 01-JAN-1980 #DEFINITION Human adenovirus 52 isolate T03-2244, complete genome. #ACCESSION DQ923122 #VERSION DQ923122.2 GI:124375632 #KEYWORDS #SOURCE Human adenovirus 52 # ORGANISM Human adenovirus 52 # Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus; # unclassified Human adenoviruses #FEATURES Location/Qualifiers # source 1..34250 # /country="USA" # /isolate="T03-2244" # /mol_type="genomic DNA" # /organism="Human adenovirus 52" # /db_xref="taxon:332179 Thank you for any advice you have to offer. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 01:36:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 21:36:48 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908250136.n7P1amWC017814@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 ------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-08-24 21:36 EST ------- Created an attachment (id=1362) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1362&action=view) test case, which is the same as that pasted into the message -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 01:37:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 24 Aug 2009 21:37:48 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908250137.n7P1bm9c017839@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-08-24 21:37 EST ------- Created an attachment (id=1363) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1363&action=view) example of the genbank file written -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 09:40:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 25 Aug 2009 05:40:29 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908250940.n7P9eT7w001376@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-25 05:40 EST ------- Hi David, I spotted this (aa/bp mix up in the LOCUS line) after the beta was out, and it should already be fixed in Biopython 1.51 final. Please update and retest, and if there is still a problem please reopen this bug. Thanks! Note that unless I was going to modify the annotation (which the background use case suggests you are), I would save the raw GenBank record from Entrez directly to disk (since parsing it and then writing it back out with SeqIO isn't yet perfect - e.g. the date in the LOCUS line). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Aug 25 10:09:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 25 Aug 2009 06:09:50 -0400 Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded using eFetch, if it is written to genbank format the header line refers to 'aa' In-Reply-To: Message-ID: <200908251009.n7PA9o4T002461@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2907 ------- Comment #4 from david.wyllie at ndm.ox.ac.uk 2009-08-25 06:09 EST ------- thank you - this is indeed fixed in the latest git version. Best wishes David (In reply to comment #3) > Hi David, > > I spotted this (aa/bp mix up in the LOCUS line) after the beta was out, and it > should already be fixed in Biopython 1.51 final. Please update and retest, and > if there is still a problem please reopen this bug. Thanks! > > Note that unless I was going to modify the annotation (which the background use > case suggests you are), I would save the raw GenBank record from Entrez > directly to disk (since parsing it and then writing it back out with SeqIO > isn't yet perfect - e.g. the date in the LOCUS line). > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Aug 25 10:33:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 11:33:56 +0100 Subject: [Biopython-dev] Command line wrappers for assembly tools Message-ID: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> Hi all, Osvaldo Zagordi has recently offered a Bio.Application style command line wrapper for Novoalign (a commercial short read aligner from Novocraft), see enhancement Bug 2904, and the Novocraft website: http://bugzilla.open-bio.org/show_bug.cgi?id=2904 http://www.novocraft.com/products.html Note that Novocraft do offer a trial/evaluation version, but I have no idea what the terms and conditions are, and I personally do not have access to the commercial tool (e.g. for testing the wrapper). Nevertheless, this would be a nice addition to Biopython. I personally would like to have wrappers for some of the "off instrument" applications from Roche 454 (e.g. the Newbler assembler, read mapper and perhaps their SFF tools), which I have been using. These are Linux only (which is a pain as Windows and Mac OS X are out), but Roche seem relatively relaxed about making the software available to any academics using their sequencer (I'd suggest anyone interested contact your local sequencing centre for this). While some of these tools would fit under Bio.Align.Applications, does creating a similar collection at Bio.Sequencing.Applications make more sense? For example, the Roche sffinfo tool isn't in itself a alignment application - but it is related to DNA sequencing. Peter From mjldehoon at yahoo.com Tue Aug 25 10:41:20 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 25 Aug 2009 03:41:20 -0700 (PDT) Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: Message-ID: <54938.41623.qm@web62405.mail.re1.yahoo.com> I did (3) and (4) below, and I added a __str__ method but I didn't touch the other print functions (2). For (1), maybe a better way is to subclass the SeqMat class for each of the matrix types instead of storing the matrix type in self.mat_type. Any comments or objections (especially Iddo)? --Michiel. --- On Sat, 7/25/09, Iddo Friedberg wrote: > I'm the author of subsmat IIRC. > Everything sounds good, but I would not make 2.6 changes > that will break on 2.5. Ubuntu still uses 2.5 and I imagine > other linux distros do too. > 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2, > SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at > the level of Bio.SubsMat). I think that these should be > class variables of the Bio.SubsMat.SeqMat class. > > > > > 2) The print_mat method. It would be more Pythonic to use > __str__, __format__ for this, though the latter is only > available for Python versions >= 2.6. > > > > 3) The __sum__ method. I guess that this was intended to be > __add__? > > > > 4) The sum_letters attribute. To calculate the sum of all > values for a given letter, currently the following two > functions are involved: > > > > ? def all_letters_sum(self): > > ? ? ?for letter in self.alphabet.letters: > > ? ? ? ? self.sum_letters[letter] = > self.letter_sum(letter) > > > > ? def letter_sum(self,letter): > > ? ? ?assert letter in self.alphabet.letters > > ? ? ?sum = 0. > > ? ? ?for i in self.keys(): > > ? ? ? ? if letter in i: > > ? ? ? ? ? ?if i[0] == i[1]: > > ? ? ? ? ? ? ? sum += self[i] > > ? ? ? ? ? ?else: > > ? ? ? ? ? ? ? sum += (self[i] / 2.) > > ? ? ?return sum > > > > As you can see, the result is not returned, but stored in > an attribute called sum_letters. I suggest to replace this > with the following: > > > > ? ?def sum(self): > > ? ? ? ?result = {} > > ? ? ? ?for letter in self.alphabet.letters: > > ? ? ? ? ? ?result[letter] = 0.0 > > ? ? ? ?for pair, value in self: > > ? ? ? ? ? ?i1, i2 = pair > > ? ? ? ? ? ?if i1==i2: > > ? ? ? ? ? ? ? ?result[i1] += value > > ? ? ? ? ? ?else: > > ? ? ? ? ? ? ? ?result[i1] += value / 2 > > ? ? ? ? ? ? ? ?result[i2] += value / 2 > > ? ? ? ?return result > > > > so without storing the result in an attribute. > > > > > > Any comments, objections? > > > > --Michiel > > > > --- On Fri, 7/24/09, Michiel de Hoon > wrote: > > > > > From: Michiel de Hoon > > > Subject: Re: [Biopython-dev] Calculating motif scores > > > To: "Bartek Wilczynski" > > > Cc: biopython-dev at biopython.org > > > Date: Friday, July 24, 2009, 5:34 AM > > > > > > > As for the PWM being a separate class and used by > the > > > motif: > > > > I don't know. I'm using > Bio.SubsMat.FreqTable for > > > implementing > > > > frequency table, so I understand that the new > PWM > > > class would > > > > be basically a "smarter" FreqTable. > I'm not sure > > > whether it > > > > solves any problems... > > > > > > Wow, I didn't even know the Bio.SubsMat module > existed. > > > As we have several different but related modules > > > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we > should > > > define the purpose and scope of each of these > modules. > > > Maybe a good way to start is the documentation. > Bio.SubsMat > > > is currently divided into two chapters (14.4 and > 16.2). I'll > > > have a look at this over the weekend to see if this > can be > > > cleaned up a bit. > > > > > > --Michiel. > > > > > > > > > ? ? ? > > > _______________________________________________ > > > Biopython-dev mailing list > > > Biopython-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > > > > > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From bartek at rezolwenta.eu.org Tue Aug 25 10:52:24 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 25 Aug 2009 12:52:24 +0200 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: <54938.41623.qm@web62405.mail.re1.yahoo.com> References: <54938.41623.qm@web62405.mail.re1.yahoo.com> Message-ID: <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com> On Tue, Aug 25, 2009 at 12:41 PM, Michiel de Hoon wrote: > I did (3) and (4) below, and I added a __str__ method but I didn't touch the other print functions (2). > > For (1), maybe a better way is to subclass the SeqMat class for each of the matrix types instead of storing the matrix type in self.mat_type. Any comments or objections (especially Iddo)? > Hi, I don't have any objections here. Just for clarification: is it now in CVS or on some git branch? cheers Bartek From biopython at maubp.freeserve.co.uk Tue Aug 25 10:59:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 11:59:38 +0100 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com> References: <54938.41623.qm@web62405.mail.re1.yahoo.com> <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com> Message-ID: <320fb6e00908250359i2c35d8b0pe84d590a9527b8bb@mail.gmail.com> On Tue, Aug 25, 2009 at 11:52 AM, Bartek Wilczynski wrote: > I don't have any objections here. Just for clarification: is it now in > CVS or on some git branch? All on CVS still (and thus being pushed to gitgub). Do you want to give us a git status update on the other thread? http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006620.html Peter From bartek at rezolwenta.eu.org Tue Aug 25 11:58:05 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 25 Aug 2009 13:58:05 +0200 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> Message-ID: <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> Hi all, Time for an update on how things are with git and biopython. On Thu, Aug 20, 2009 at 11:07 PM, Bartek Wilczynski wrote: > I've only quickly tested git, and I was able to pull from github with > no problems. I will try porting thew scripts from my machine to > dev.open-bio tomorrow. That works fine. I've set up a crontab script (/home/bartek/github_backup.sh) on dev.open-bio machine which fetches the current github branch and saves it to /home/bartek/biopython_from_github. Then it creates a "bare repository" (/home/bartek/biopython.git) which can be then used by others. If you have an shell account on the dev machine, you should be able to clone it over ssh with the following command: git clone ssh://_YOUR_USERNAME_ at dev.open-bio.org/~bartek/biopython.git if this is put into a directory accesible via http, one can also clone (anonymously) over http. I don't have an account on biopython www server, but I was able to put it on my server (just to check if it works). You can fetch it like this: git clone http://bartek.rezolwenta.eu.org/biopython.git In conclusion: it works. I would say, that the next important step is to decide when to stop commiting to CVS... I'm just waiting for a signal to terminate the updates from CVS to github and we are done. In the meantime, it would make sense to make it more stable which involves some technical details (mostly related to user accounts) Namely, we need to - set up these scripts on biopython account instead of my own (see below) - decide whether we want other things to be done by these scripts (generating src tarballs, etc) > > In the meantime, I've checked that biopython account on dev.open-bio > machine is assigned to Brad Marshall. I haven't seen him posting to > the list lately. Does anyone have the access to this account? This would come in handy now. Anybody knows how to access this account? cheers Bartek From biopython at maubp.freeserve.co.uk Tue Aug 25 12:13:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 13:13:24 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> Message-ID: <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com> On Tue, Aug 25, 2009 at 12:58 PM, Bartek Wilczynski wrote: > Hi all, > > Time for an update on how things are with git and biopython. > > On Thu, Aug 20, 2009 at 11:07 PM, Bartek > Wilczynski wrote: >> I've only quickly tested git, and I was able to pull from github with >> no problems. I will try porting thew scripts from my machine to >> dev.open-bio tomorrow. > > That works fine. I've set up a crontab script > (/home/bartek/github_backup.sh) on dev.open-bio machine which fetches > the current github branch and saves it to > /home/bartek/biopython_from_github. Then it creates a "bare > repository" (/home/bartek/biopython.git) which can be then used by > others. If you have an shell account on the dev machine, you should be > able to clone it over ssh with the following command: > git clone ssh://_YOUR_USERNAME_ at dev.open-bio.org/~bartek/biopython.git Yes, that works for me (and thus in theory anyone with a dev account). > if this is put into a directory accesible via http, one can also clone > (anonymously) over http. I don't have an account on biopython www > server, but I was able to put it on my server (just to check if it > works). You can fetch it like this: > git clone http://bartek.rezolwenta.eu.org/biopython.git Excellent. We can ask the OBF to give you access to biopython.org (and Brad too since it would have helped when he did the recent release) which would help setting this stuff up [and see below] > In conclusion: it works. I would say, that the next important step is > to decide when to stop commiting to CVS... I'm just waiting for a > signal to terminate the updates from CVS to github and we are done. OK - so the basics are ready (backing up from github to an OBF machine). Good job. > In the meantime, it would make sense to make it more stable which > involves some technical details (mostly related to user accounts) > Namely, we need to > - set up these scripts on biopython account instead of my own (see below) > - decide whether we want other things to be done by these scripts > (generating src tarballs, etc) > >> In the meantime, I've checked that biopython account on dev.open-bio >> machine is assigned to Brad Marshall. I haven't seen him posting to >> the list lately. Does anyone have the access to this account? > > This would come in handy now. Anybody knows how to access this account? I have no idea who Brad Marshall is. We'll have to take this up with the OBF. I'll email you off list... Peter From biopython at maubp.freeserve.co.uk Tue Aug 25 13:23:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 14:23:15 +0100 Subject: [Biopython-dev] Moving from CVS to git In-Reply-To: <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com> References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com> <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com> <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net> <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com> <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com> <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com> <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com> <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com> <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com> <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com> Message-ID: <320fb6e00908250623j19daa0cey429265f8c2bcb4ff@mail.gmail.com> >>> In the meantime, I've checked that biopython account on dev.open-bio >>> machine is assigned to Brad Marshall. I haven't seen him posting to >>> the list lately. Does anyone have the access to this account? >> >> This would come in handy now. Anybody knows how to access this account? > > I have no idea who Brad Marshall is. We'll have to take this up with > the OBF. I'll email you off list... Just for the record, on closer inspection, Brad Marshall has/had a separate account but it included "biopython" in the user's name. I presume he was another former contributor to the project. Peter From biopython at maubp.freeserve.co.uk Tue Aug 25 15:11:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 16:11:34 +0100 Subject: [Biopython-dev] Fwd: More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: <320fb6e00908250811r18aaec6fj6c2f0e40996fda0a@mail.gmail.com> Hi all, This was posted to the OBF cross project mailing list, but if any of you guys have some sample FASTQ data please consider sharing a small sample (e.g. the first ten reads). We would need this to be "no-strings attached" so that it could be used in any of the OBF projects under their assorted open source licences. In addition to the notes below, I would be interested in is any FASTQ files from your local sequence centre, which may use their own conventions for the record title lines (e.g. record names). Thanks, Peter P.S. Rather that trying to send any attachments to the mailing list, please email me personally. ---------- Forwarded message ---------- From: Peter Date: Tue, Aug 25, 2009 at 12:24 PM Subject: More FASTQ examples for cross project testing To: open-bio-l at lists.open-bio.org Cc: Peter Rice , Chris Fields Hi all, I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) off list about this plan. I'm going to co-ordinate putting together a set of valid FASTQ files for shared testing (to supplement the existing set of invalid FASTQ files already done and being used in Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). What I have in mind is: XXX_original_YYY.fastq - sample input XXX_as_sanger.fastq - reference output XXX_as_solexa.fastq - reference output XXX_as_illumina.fastq - reference output where XXX is some name (e.g. wrapped1, wrapped2, shortreads, longreads, sanger_full_range, solexa_full_range ...) and YYY is the FASTQ variant (sanger, solexa or illumina) for the "input" file. For example, we might have: wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping, perhaps repeating the title on the plus lines wrapped1_as_sanger.fastq - The same data but using the consensus of no line wrapping and omitting the repeated title on the plus lines. wrapped1_as_solexa.fastq - As above, but converted in Solexa scores (ASCII offset 64), with capping at Solexa 62 (ASCII 126). wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII offset 64, with capping at PHRED 62 (ASCII 126). Here "wrapped1" would be a Sanger FASTQ file with some line wrapping (e.g. at 60 characters). I will include "sanger_full_range" which would cover all the valid PHRED scores from 0 to 93, and similarly for Solexa and Illumina files - these are important for testing the score conversions. I have some ideas for deliberately tricky (but valid) files which should properly test any parser. The point is we have "perhaps odd but valid" originals, plus the "cleaned up" versions (using the same FASTQ variant), and "cleaned up" versions in the other two FASTQ variants. Ideally asking Biopython/BioPerl/EMBOSS to convert the XXX_original_YYY.fastq files into any of the three FASTQ variants will give exactly the same as the reference outputs. If anyone has any comments or suggestions please speak up (e.g. my suggested naming conventions). Real life examples of FASTQ files anyone has had trouble parsing (even with 3rd party tools) would be particularly useful - although we'd probably want to cut down big example files in order to keep the dataset to a reasonable size. Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Aug 26 11:36:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 12:36:36 +0100 Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures Message-ID: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> Hi all, I've retitled this thread (originally on the main list) to focus on the more general idea of filtering SeqRecord feature list (as that has very little to do with SQLAlchemy) and how this interact with nested SeqFeature objects. On Wed, Aug 26, 2009, Peter wrote: > On Wed, Aug 26, 2009, Kyle Ellrott wrote: >> I've added a new database function lookupFeature to quickly search for >> sequences features without have to load all of them for any particular >> sequence. >> ... > > Interesting - and potentially useful if you are interested in just > part of the genome (e.g. an operon). > > Have you tested this on composite features (e.g. a join)? > Without looking into the details of your code this isn't clear. > > I wonder how well this would scale with a big BioSQL database > ... > > On the other hand, if all the record's features have already been > loaded into memory, there would just be thousands of locations > to look at - it might be quicker. > > This brings me to another idea for how this interface might work, > via the SeqRecord - how about adding a method like this: > > def filtered_features(self, start=None, end=None, type=None): > > Note I think it would also be nice to filter on the feature type (e.g. > CDS or gene). This method would return a sublist of the full > feature list (i.e. a list of those SeqFeature objects within the > range given, and of the appropriate type). This could initially > be implemented with a simple loop, but there would be scope > for building an index or something more clever. > > [Note we are glossing over some potentially ambiguous > cases with complex composite locations, where the "start" > and "end" may differ from the "span" of the feature.] > > The DBSeqRecord would be able to do the same (just inherit > the method), but you could try doing this via an SQL query, ... Brad, it occurred to me this idea (a filtered_features method on the SeqRecord) might cause trouble with what I believe you have in mind for parsing GFF files into nested SeqFeatures. Is that still your plan? In particular, if you have save a CDS feature within a gene feature, and the user asked for all the CDS features, simply scanning the top level features list would miss it. Would it be safe to assume (or even enforce) that subfeatures are always *with* the location spanned by the parent feature? Even with this proviso, a daughter feature may still be small enough to pass a start/end filter, even if the parent feature is not. Again, scanning the top level features list would miss it. All these issues go away if we continue to treat the SeqRecord features list as a flat list, and only use the SeqFeature subfeatures list purely for storing composite locations (i.e. sub regions of the parent feature - not for true subfeatures). There are other downsides to using nested SubFeatures, it will probably require a lot of reworking of the GenBank output due to how composite features like joins are currently stored, and I haven't even looked at the BioSQL side of things. You may have looked at that already though, so I may just be worrying about nothing. Peter From eoc210 at googlemail.com Sun Aug 30 19:33:59 2009 From: eoc210 at googlemail.com (Ed Cannon) Date: Sun, 30 Aug 2009 20:33:59 +0100 Subject: [Biopython-dev] OBO2OWL parser / converter Message-ID: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> Hi All, I would like to thank you guys for all your hard work and effort in making biopython a great piece of open software. I would also like to introduce myself, my name is Ed Cannon, I am a postdoc at Cambridge University working in the fields of chemo/bioinformatics and semantic web technologies in the group of Peter Murray-Rust. Since a fair amount of my work involves ontologies, I have written an open biomedical ontology (.obo) to web ontology language (.owl) converter. The resultant file can be loaded and used from Protege. I was wondering if this software would be of any interest to the biopython community? I have just sent a pull request to biopython on github. The code is located at my branch on my account: http://github.com/eoc21/biopython/tree/eoc21Branch. Thanks, Ed From krother at rubor.de Mon Aug 31 11:19:07 2009 From: krother at rubor.de (Kristian Rother) Date: Mon, 31 Aug 2009 13:19:07 +0200 Subject: [Biopython-dev] RNA module contributions Message-ID: <4A9BB1AB.1070608@rubor.de> Hi, to start work on RNA modules, I'd like to contribute some of our tested modules to BioPython. Before I place them into my GIT branch, it would be great to get some comments: Bio.RNA.SecStruc - represents a RNA secondary structures, - recognizing of SSEs (helix, loop, bulge, junction) - recognizing pseudoknots Bio.RNA.ViennaParser - parses RNA secondary structures in the Vienna format into SecStruc objects. Bio.RNA.BpseqParser - parses RNA secondary structures in the Bpseq format into SecStruc objects. Connected to RNA, but with a wider focus: Bio.???.ChemicalGroupFinder - identifies chemical groups (ribose, carboxyl, etc) in a molecule graph (place to be defined yet) There is a contribution from Bjoern Gruening as well: Bio.PDB.PDBMLParser - creates PDB.Structure objects from PDB-XML files. Comments and suggestions welcome! Best Regards, Kristian Rother From hlapp at gmx.net Mon Aug 31 12:17:43 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 31 Aug 2009 08:17:43 -0400 Subject: [Biopython-dev] OBO2OWL parser / converter In-Reply-To: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> Message-ID: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net> Hi Ed - is your converter operating in a way that is congruent with (or even utilizing) the mapping and the converter provided by the NCBO and Berkeley Ontology projects? http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page If not, I'm not sure how beneficial it is for users to have multiple and possibly conflicting mappings. -hilmar On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote: > Hi All, > > I would like to thank you guys for all your hard work and effort in > making > biopython a great piece of open software. > > I would also like to introduce myself, my name is Ed Cannon, I am a > postdoc > at Cambridge University working in the fields of chemo/ > bioinformatics and > semantic web technologies in the group of Peter Murray-Rust. > > Since a fair amount of my work involves ontologies, I have written > an open > biomedical ontology (.obo) to web ontology language (.owl) > converter. The > resultant file can be loaded and used from Protege. I was wondering > if this > software would be of any interest to the biopython community? I > have just > sent a pull request to biopython on github. The code is located at > my branch > on my account: http://github.com/eoc21/biopython/tree/eoc21Branch. > > Thanks, > > Ed > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon Aug 31 12:42:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 13:42:52 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> Message-ID: <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> On Thu, Aug 20, 2009 at 12:28 PM, Peter wrote: > Hi all, > > You may recall a thread back in June with Cedar Mckay (cc'd - not > sure if he follows the dev list or not) about indexing large sequence > files - specifically FASTA files but any sequential file format. I posted > some rough code which did this building on Bio.SeqIO: > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html The Bio.SeqIO.indexed_dict() functionality is in CVS/github now as I would like some wider testing. My earlier email explained the implementation approach, and gave some example code: http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html This aims to solve a fairly narrow problem - dictionary like random access to any record in a sequence file as a SeqRecord via the record id string as the key. It should works on any sequential file format, and can even works on binary SFF files (code on a branch in github still). Bio.SeqIO.to_dict() has always offered a very simple in memory solution (a python dictionary of SeqRecord objects) which is fine for small files (e.g. a few thousand FASTA entries), but won't scale much more than that. Using a BioSQL database would also allow random access to any SeqRecord (and not just by looking it up by the identifier), but I doubt it would scale well to 10s of millions of short read sequences. It is also non-trivial to install the DB itself, the schema and the Python bindings. The new Bio.SeqIO.indexed_dict() code offers a read only dictionary interface which does work for millions of reads. As implemented, there is still a memory bound as all the keys and their associated file offsets are held in memory. For example, a 7 million record FASTQ file taking 1.3GB on disk seems to need almost 700MB in memory (just a very crude measurement). Although clearly this is much more capable than the naive full dictionary in memory approach (which is out of the question here), this too could become a real bottle-neck before long Biopython's old Martel/Mindy code used to build an on disk index, which avoided this memory constraint. However, we're removed that (due to mxTextTool breakage etc). In any case, it was also much much slower: http://lists.open-bio.org/pipermail/biopython/2009-June/005309.html Using a Bio.SeqIO.indexed_dict() like API, we could of course build an index file on disk to avoid this potential memory problem. As Cedar suggested, this index file could be handled transparently (created and deleted automatically), or indeed could be explicitly persisted/reloaded to avoid re-indexing unnecessarily: http://lists.open-bio.org/pipermail/biopython/2009-June/005265.html Sticking to the narrow use case of (read only) random access to a sequence file, all we really need to store is the lookup table of keys (or their Python hash) and offsets in the original file. If they are fast enough, we might even be able to reuse the old Martel/ Mindy index file format... or the OBDA specification if that is still in use: http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html Another option (like the shelve idea we talked about last month) is to parse the sequence file with SeqIO, and serialise all the SeqRecord objects to disk, e.g. with pickle or some key/value database. This is potentially very complex (e.g. arbitrary Python objects in the annotation), and could lead to a very large "index" file on disk. On the other hand, some possible back ends would allow editing the database... which could be very useful. Brad - do you have any thoughts? I know you did some work with key/value indexers: http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ Peter From chapmanb at 50mail.com Mon Aug 31 12:54:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 08:54:52 -0400 Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures In-Reply-To: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> References: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> Message-ID: <20090831125452.GA75451@sobchak.mgh.harvard.edu> Peter and Kyle; > I've retitled this thread (originally on the main list) to focus on the > more general idea of filtering SeqRecord feature list (as that has > very little to do with SQLAlchemy) and how this interact with > nested SeqFeature objects. Sorry to have missed this thread in real time; I was out of town last week. Generally, it is great we are focusing on standard queries and building up APIs to make them more intuitive. Nice. > Brad, it occurred to me this idea (a filtered_features method > on the SeqRecord) might cause trouble with what I believe you > have in mind for parsing GFF files into nested SeqFeatures. > Is that still your plan? Yes, that was still the idea although I haven't dug into it much beyond last time we discussed this. This is the direct translation of the GFF way of handling multiple transcripts and coding features, and seems like the intuitive way to handle the problem. > In particular, if you have save a CDS feature within a gene > feature, and the user asked for all the CDS features, simply > scanning the top level features list would miss it. I think we'll be okay here. With nesting everything would still be stored in the seqfeature table. The seqfeature_relationship table defines the nesting relationship but for the sake of queries all of the features can be treated as flat directly related to the bioentry of interest. Secondarily, you would need to reconstitute the nested relationship if that is of interest, but for the query example of "give me all features of this type in this region" you could return a simple flat iterator of them. > Would it be safe to assume (or even enforce) that subfeatures > are always *with* the location spanned by the parent feature? > Even with this proviso, a daughter feature may still be small > enough to pass a start/end filter, even if the parent feature > is not. Again, scanning the top level features list would miss > it. The within assumption makes sense to me here. There may be pathological cases that fall outside of this, but no examples are coming to mind right now. > There are other downsides to using nested SubFeatures, > it will probably require a lot of reworking of the GenBank > output due to how composite features like joins are > currently stored, and I haven't even looked at the BioSQL > side of things. You may have looked at that already > though, so I may just be worrying about nothing. Agreed. My thought was to prototype this with GFF and then think further about GenBank features. Initially, I just want to get the GFF parsing documented and in the Biopython repository, and then the BioSQL storage would be a logical next step. Brad From chapmanb at 50mail.com Mon Aug 31 12:58:54 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 08:58:54 -0400 Subject: [Biopython-dev] Command line wrappers for assembly tools In-Reply-To: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> References: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> Message-ID: <20090831125854.GB75451@sobchak.mgh.harvard.edu> Hi all; > Osvaldo Zagordi has recently offered a Bio.Application style command line > wrapper for Novoalign (a commercial short read aligner from Novocraft), see > enhancement Bug 2904, and the Novocraft website: > http://bugzilla.open-bio.org/show_bug.cgi?id=2904 > http://www.novocraft.com/products.html Very nice. I've been meaning to play with Novoalign and have heard some good things. > While some of these tools would fit under Bio.Align.Applications, does > creating a similar collection at Bio.Sequencing.Applications make more > sense? For example, the Roche sffinfo tool isn't in itself a alignment > application - but it is related to DNA sequencing. I like the idea of a Sequencing namespace or at least something different than the current Align, which implicitly refers mostly to multiple alignment programs. Brad From biopython at maubp.freeserve.co.uk Mon Aug 31 13:11:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 14:11:19 +0100 Subject: [Biopython-dev] Command line wrappers for assembly tools In-Reply-To: <20090831125854.GB75451@sobchak.mgh.harvard.edu> References: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com> <20090831125854.GB75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908310611i2ce6a639i550631cb47a02050@mail.gmail.com> On Mon, Aug 31, 2009 at 1:58 PM, Brad Chapman wrote: > Hi all; > >> Osvaldo Zagordi has recently offered a Bio.Application style command line >> wrapper for Novoalign (a commercial short read aligner from Novocraft), see >> enhancement Bug 2904, and the Novocraft website: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2904 >> http://www.novocraft.com/products.html > > Very nice. I've been meaning to play with Novoalign and have heard > some good things. Cool. Do you think you'll be able to try that out, and test Osvaldo's wrapper at the same time? >> While some of these tools would fit under Bio.Align.Applications, does >> creating a similar collection at Bio.Sequencing.Applications make more >> sense? For example, the Roche sffinfo tool isn't in itself a alignment >> application - but it is related to DNA sequencing. > > I like the idea of a Sequencing namespace or at least something > different than the current Align, which implicitly refers mostly to > multiple alignment programs. That sounds like a plan then... Peter From biopython at maubp.freeserve.co.uk Mon Aug 31 13:15:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 14:15:42 +0100 Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures In-Reply-To: <20090831125452.GA75451@sobchak.mgh.harvard.edu> References: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com> <20090831125452.GA75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908310615y23051634sbe6076fa9667296b@mail.gmail.com> On Mon, Aug 31, 2009 at 1:54 PM, Brad Chapman wrote: >> There are other downsides to using nested SubFeatures, >> it will probably require a lot of reworking of the GenBank >> output due to how composite features like joins are >> currently stored, and I haven't even looked at the BioSQL >> side of things. You may have looked at that already >> though, so I may just be worrying about nothing. > > Agreed. My thought was to prototype this with GFF and then > think further about GenBank features. Initially, I just want to > get the GFF parsing documented and in the Biopython > repository, and then the BioSQL storage would be a logical > next step. If (as Michiel and I suggested) your GFF parser returns some generic object (e.g. a GFF record class, or a tuple of basic python types including a dictionary of annotation), then yes, that can be checked in without side effects. However, if your code goes straight to SeqRecord and SeqFeature objects, we are going to have to deal with how BioSQL and the existing SeqIO output code will react (e.g. the GenBank output). Peter From chapmanb at 50mail.com Mon Aug 31 13:24:51 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 09:24:51 -0400 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> Message-ID: <20090831132451.GD75451@sobchak.mgh.harvard.edu> Hi Peter; > The Bio.SeqIO.indexed_dict() functionality is in CVS/github now > as I would like some wider testing. My earlier email explained the > implementation approach, and gave some example code: > http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html Sweet. I pulled this from your branch earlier for something I was doing at work and it's great stuff. My only suggestion would be to change the function name to make it clear it's an in memory index. This will clear us up for similar file based index functions. > Another option (like the shelve idea we talked about last month) > is to parse the sequence file with SeqIO, and serialise all the > SeqRecord objects to disk, e.g. with pickle or some key/value > database. This is potentially very complex (e.g. arbitrary Python > objects in the annotation), and could lead to a very large "index" > file on disk. On the other hand, some possible back ends would > allow editing the database... which could be very useful. My thought here was to use BioSQL and the SQLite mappings for serializing. We build off a tested and existing serialization, and also guide people into using BioSQL for larger projects. Essentially, we would build an API on top of existing BioSQL functionality that creates the index by loading the SQL and then pushes the parsed records into it. > Brad - do you have any thoughts? I know you did some work > with key/value indexers: > http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ I've been using MongoDB (http://www.mongodb.org/display/DOCS/Home) extensively and it rocks; it's fast and scales well. The bit of work that is needed is translating objects into JSON representations. There are object mappers like MongoKit (http://bitbucket.org/namlook/mongokit/) that help with this. Connecting these thoughts together, a rough two step development plan would be: - Modify the underlying Biopython BioSQL representation to be object based, using SQLAlchemy. This is essentially what I'd suggested as a building block from Kyle's implementation. - Use this to provide object mappings for object-based stores, like MongoDB/MongoKit or Google App Engine. Brad From biopython at maubp.freeserve.co.uk Mon Aug 31 13:49:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 14:49:40 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090831132451.GD75451@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> On Mon, Aug 31, 2009 at 2:24 PM, Brad Chapman wrote: > > Hi Peter; > >> The Bio.SeqIO.indexed_dict() functionality is in CVS/github now >> as I would like some wider testing. My earlier email explained the >> implementation approach, and gave some example code: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html > > Sweet. I pulled this from your branch earlier for something I was > doing at work and it's great stuff. Thanks :) What file formats where you working on, and how many records? > My only suggestion would be to > change the function name to make it clear it's an in memory index. > This will clear us up for similar file based index functions. True. Have got any bright ideas for a better name? While the index is in memory, the SeqRecord objects are not (unlike the original Bio.SeqIO.to_dict() function). Or we have one function Bio.SeqIO.indexed_dict() which can either use an in memory index, OR an on disk index, offering the same functionality. >> Another option (like the shelve idea we talked about last month) >> is to parse the sequence file with SeqIO, and serialise all the >> SeqRecord objects to disk, e.g. with pickle or some key/value >> database. This is potentially very complex (e.g. arbitrary Python >> objects in the annotation), and could lead to a very large "index" >> file on disk. On the other hand, some possible back ends would >> allow editing the database... which could be very useful. > > My thought here was to use BioSQL and the SQLite mappings for > serializing. We build off a tested and existing serialization, and > also guide people into using BioSQL for larger projects. > Essentially, we would build an API on top of existing BioSQL > functionality that creates the index by loading the SQL and then > pushes the parsed records into it. Using BioSQL in this way is a much more general tool than simply "indexing a sequence file". It feels like a sledgehammer to crack a nut. Also, do you expect it to scale well for 10 million plus short reads? It may do, but on the other hand it may not. You will also face the (file format specific but potentially significant) up front cost of parsing the full file in order to get the SeqRecord objects which are then mapped into the database. My new Bio.SeqIO.indexed_dict() code (whatever we call it) avoids this and the speed up is very nice (file format specific of course). Also while the current BioSQL mappings are "tried and tested", they don't cover everything, in particular per-letter-annotation such as a set of quality scores (something that needs addressing anyway, probably with JSON or XML serialisation). All the above make me lean towards a less ambitious target (read only dictionary access to a sequence file), which just requires having an (on disk) index of file offsets (which could be done with SQLite or anything else suitable). This choice could even be done on the fly at run time (e.g. we look at the size of the file to decide if we should use an in memory index or on disk - or start out in memory and if the number of records gets too big, switch to on disk). Peter From mjldehoon at yahoo.com Mon Aug 31 13:50:37 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 31 Aug 2009 06:50:37 -0700 (PDT) Subject: [Biopython-dev] Fw: Re: RNA module contributions Message-ID: <444088.91207.qm@web62408.mail.re1.yahoo.com> Forgot to forward this to the list. --- On Mon, 8/31/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [Biopython-dev] RNA module contributions > To: "Kristian Rother" > Date: Monday, August 31, 2009, 9:49 AM > Hi Kristian, > > As I am working in transcriptomics, I'll be happy to see > some more RNA modules in Biopython. Thanks! > Just one comment for now: > Recent parsers in Biopython use a function rather than a > class. > So instead of > > from Bio import ThisOrThatModule > handle = open("myinputfile") > parser = ThisOrThatModule.Parser() > record = parser.parse(handle) > > you would have > > from Bio import ThisOrThatModule > handle = open("myinputfile") > record = ThisOrThatModule.read(handle) > > This assumes that myinputfile contains only one record. If > you have input files with multiple records, you can use > > from Bio import ThisOrThatModule > handle = open("myinputfile") > records = ThisOrThatModule.parse(handle) > > where the parse function is a generator function. > > How about the following for the RNA module? > > from Bio import RNA > handle = open("myinputfile") > record = RNA.read(handle, format="vienna") > # or format="bpseq", as appropriate > > where record will be a Bio.RNA.SecStruc object. > > For consistency with other Biopython modules, you might > also consider to rename Bio.RNA.SecStruc as Bio.RNA.Record. > On the other hand, the name SecStruc is more informative, > and maybe some day there will be other kinds of records in > Bio.RNA. > > Thanks! > > --Michiel. > > --- On Mon, 8/31/09, Kristian Rother > wrote: > > > From: Kristian Rother > > Subject: [Biopython-dev] RNA module contributions > > To: "Biopython-Dev Mailing List" > > Date: Monday, August 31, 2009, 7:19 AM > > > > Hi, > > > > to start work on RNA modules, I'd like to contribute > some > > of our tested modules to BioPython. Before I place > them into > > my GIT branch, it would be great to get some > comments: > > > > Bio.RNA.SecStruc > > ???- represents a RNA secondary structures, > > ???- recognizing of SSEs (helix, loop, > > bulge, junction) > > ???- recognizing pseudoknots > > > > Bio.RNA.ViennaParser? ? ? > > ???- parses RNA secondary structures in the > > Vienna format into SecStruc objects. > > > > Bio.RNA.BpseqParser? ? ? ? ? - > > parses RNA secondary structures in the Bpseq format > into > > SecStruc objects. > > > > Connected to RNA, but with a wider focus: > > > > Bio.???.ChemicalGroupFinder > > ???- identifies chemical groups (ribose, > > carboxyl, etc) in a molecule graph (place to be > defined > > yet) > > > > There is a contribution from Bjoern Gruening as well: > > > > Bio.PDB.PDBMLParser > > ???- creates PDB.Structure objects from > > PDB-XML files. > > > > > > Comments and suggestions welcome! > > > > Best Regards, > > ???Kristian Rother > > > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > From biopython at maubp.freeserve.co.uk Mon Aug 31 17:44:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 18:44:44 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> Message-ID: <320fb6e00908311044h24cd62d9n809582c7d32e5824@mail.gmail.com> On Mon, Aug 31, 2009 at 2:49 PM, Peter wrote: > All the above make me lean towards a less ambitious target > (read only dictionary access to a sequence file), which just > requires having an (on disk) index of file offsets (which could > be done with SQLite or anything else suitable). This choice > could even be done on the fly at run time (e.g. we look at the > size of the file to decide if we should use an in memory index > or on disk - or start out in memory and if the number of records > gets too big, switch to on disk). With the current code (in memory dictionary mapping keys to file offsets), the 7 million record FASTQ file (1.3GB on disk) required almost 700MB in memory. Indexing took about 1 min. This is probably OK for many potential uses. I just did a quick hack to use shelve (default settings) to hold the key to file offset mapping. RAM usage was about 10MB, the index file about 320MB (could have been a little more, my code cleaned up after itself), but indexing took about 12 minutes. http://github.com/peterjc/biopython/tree/index-shelve I also did a proof of principle implementation using SQLite to hold the key to file offset mapping. This also needed only about 10MB of RAM, the SQLite index file was about 400MB and indexing took about 8 minutes. Perhaps this can be sped up... http://github.com/peterjc/biopython/tree/index-sqlite On the bright side, these all work for all the previously supported indexable file formats, even SFF - which is pretty cool. The trade off of 1 minute and 700MB RAM (in memory) versus 8 minutes but only 10MB RAM (using SQLite) means neither solution will suit every use case. So unless the SQLite dict approach can be sped up, it may be worthwhile to support both this and the in memory index - although I haven't worked out how best to arrange my code to achieve this elegantly. Anyway, using SQLite like this seems workable (especially since for Python 2.5+ it is included in the standard library). Another option is the Berkeley DB library (especially if we can do this following the OBF OBDA standard for the index file), but while bsddb was included in Python 2.x it has been deprecated for Python 2.6+ and removed in Python 3.0+ It is still available as a third party install though... Peter