From bugzilla-daemon at portal.open-bio.org Sat Aug 1 16:46:38 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 16:46:38 -0400
Subject: [Biopython-dev] [Bug 2894] New: Jython List difference causes
failed assertion in CondonTable Fix+Patch
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2894
Summary: Jython List difference causes failed assertion in
CondonTable Fix+Patch
Product: Biopython
Version: 1.51b
Platform: Other
OS/Version: Other
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: kellrott at ucsd.edu
Different list behaviour in Jython causes assertion to fail because last to
elements on produced list are swapped. Haven't taken the time to figure out if
this caused by sloppy list usage or Jython list weirdness. At this point, will
assume that list order doesn't matter and simple expand the assertion to allow
both cases...
list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values)
Python : ['TGA', 'TAA', 'TAG', 'TAR', 'TRA']
Jython : ['TGA', 'TAA', 'TAG', 'TRA', 'TAR']
NOTE: Fixing this bug causes setup.py to fail (java.lang.ClassFormatError:
Invalid method Code length) because it exposes previously untested bugs
*** biopython-1.51b_orig/Bio/Data/CodonTable.py 2009-05-08 14:20:19.000000000
-0700
--- biopython-1.51b/Bio/Data/CodonTable.py 2009-08-01 13:30:46.000000000
-0700
***************
*** 615,621 ****
assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TGA']
assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TAA', 'TAR']
assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values)
== ['UAG', 'UAA', 'UAR']
! assert list_ambiguous_codons(['TGA', 'TAA',
'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA']
# Forward translation is "onto", that is, any given codon always maps
# to the same protein, or it doesn't map at all. Thus, I can build
--- 615,623 ----
assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TGA']
assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TAA', 'TAR']
assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values)
== ['UAG', 'UAA', 'UAR']
! #Jython BUG? For some order Jython swaps the order of the last two
elements...
! assert list_ambiguous_codons(['TGA', 'TAA',
'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA']
or\
! list_ambiguous_codons(['TGA', 'TAA',
'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TRA', 'TAR']
# Forward translation is "onto", that is, any given codon always maps
# to the same protein, or it doesn't map at all. Thus, I can build
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 17:16:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 17:16:48 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
assertion in CondonTable Fix+Patch
In-Reply-To:
Message-ID: <200908012116.n71LGmgG031493@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2894
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-01 17:16 EST -------
(In reply to comment #0)
> Different list behaviour in Jython causes assertion to fail because last to
> elements on produced list are swapped. Haven't taken the time to figure out
> if this caused by sloppy list usage or Jython list weirdness. ...
Are you using Biopython 1.51b, or the latest code from CVS/github? This sounds
like a duplicate of Bug 2887 (set order is Python implementation dependent).
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:47 -0400
Subject: [Biopython-dev] [Bug 2895] New:
Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2895
Summary: Bio.Restriction.Restriction_Dictionary Jython Error
Fix+Patch
Product: Biopython
Version: 1.51b
Platform: Other
OS/Version: Other
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: kellrott at ucsd.edu
BugsThisDependsOn: 2891,2892,2893,2894
Jython is limited to JVM method sizes, overly large methods cause JVM
exceptions (java.lang.ClassFormatError: Invalid method Code length ...).
The Bio.Restriction.Restriction_Dictionary module defines to much data in the
base method, by breaking the defined dicts into pieces held in separate
methods, then merging them, the code will correctly compile in Jython.
Patch:
11,12c11,14
< rest_dict = \
< {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'),
---
>
>
> def RestDict1():
> return {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'),
1503,1504c1505,1508
< 'suppl': ('I',)},
< 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'),
---
> 'suppl': ('I',)} }
>
> def RestDict2():
> return { 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'),
3500c3504,3508
< 'suppl': ('X',)},
---
> 'suppl': ('X',)} }
>
>
> def RestDict3():
> return {
4497c4505,4508
< 'suppl': ('I',)},
---
> 'suppl': ('I',)} }
>
> def RestDict4():
> return {
5494,5495c5505,5508
< 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')},
< 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'),
---
> 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')} }
>
> def RestDict5():
> return { 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'),
6479c6492,6495
< 'suppl': ('N',)},
---
> 'suppl': ('N',)} }
>
> def RestDict6():
> return {
7194,7195c7210,7214
< 'suppl': ('N',)},
< 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'),
---
> 'suppl': ('N',)} }
>
>
> def RestDict7():
> return { 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'),
8491c8510,8513
< 'suppl': ()},
---
> 'suppl': ()} }
>
> def RestDict8():
> return {
9608c9630,9634
< 'suppl': ('F',)},
---
> 'suppl': ('F',)} }
>
>
> def RestDict9():
> return {
11992,11993c12018,12051
< suppliers = \
< {'A': ('Amersham Pharmacia Biotech',
---
>
>
> rest_dict = {}
> tmp = RestDict1()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict2()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict3()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict4()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict5()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict6()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict7()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict8()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict9()
> for a in tmp:
> rest_dict[a] = tmp[a]
>
>
> def Suppliers():
> return {'A': ('Amersham Pharmacia Biotech',
13626,13627c13684,13692
< typedict = \
< {'type145': (('NonPalindromic',
---
>
>
> suppliers = Suppliers()
>
>
>
>
> def TypeDict():
> return {'type145': (('NonPalindromic',
14498a14564,14567
>
> typedict = TypeDict()
>
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:49 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To:
Message-ID: <200908020246.n722knhV005000@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2891
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:50 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:50 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To:
Message-ID: <200908020246.n722koqM005006@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2892
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:51 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:51 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To:
Message-ID: <200908020246.n722kpGh005015@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2893
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 22:46:52 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:52 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
assertion in CondonTable Fix+Patch
In-Reply-To:
Message-ID: <200908020246.n722kq8g005021@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2894
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Mon Aug 3 10:57:59 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 3 Aug 2009 10:57:59 -0400
Subject: [Biopython-dev] GSoC Weekly Update 11: PhyloXML for Biopython
Message-ID: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
Hi all,
Previously (July 27-31) I:
- Added the remaining checks for restricted tokens
- Modified the tree, parser and writer for phyloXML 1.10 support -- it
validates now, and unit tests pass. PhyloXML 1.00 validation breaks,
but
that won't affect anyone except BioPerl, and they said they can deal
with
it on their end
- Changed how the Parser and Writer classes work to resemble other
Biopython parser classes more closely
- Picked standard attributes for BaseTree's Tree and Node objects
(informed
by PhyloDB, though the names are slightly different); added
properties to PhyloXML's Clade to mimic both types
- Made SeqRecord conversion actually work (with reasonable
round-tripping
capability); added a unit test
- Changed __str__ methods to not include the object's class name if
there's
another representative label to use (e.g. name) -- that's easy enough
to
add in the caller
- Sorted out the TreeIO read/parse/write API and added some support for
the Newick format, as recommended by Peter on biopython-dev
- Split some "plumbing" (depth_first_search) off from the Tree.find()
method. Since there are a lot of potentially useful methods to have on
phylogenetic tree objects, I think it's best to distinguish between
"porcelain" (specific, easy-to-use methods for common operations) and
"plumbing" (generalized or low-level methods/algorithms that porcelain
can rely on) in the Tree class in Bio.Tree.BaseTree.
- Started a function for networkx export. The edges are screwy right
now, so I haven't checked it in yet.
This week (Aug. 3-7) I will:
Scan the code base for lingering TODO/ENH/XXX comments
Discuss merging back upstream
Work on enhancements (time permitting):
- Clean up the Parser class a bit more, to resemble Writer
- Finish networkx export
- Port common methods to Bio.Tree.BaseTree (from Bio.Nexus.Trees and
other
packages)
Run automated testing:
- Re-run performance benchmarks
- Run tests and benchmarks on alternate platforms
- Check epydoc's generated API documentation and fix docstrings
Update wiki documentation with new features:
- Tree: base classes, find() etc.,
- TreeIO: 'phyloxml', 'nexus', 'newick' wrappers; PhyloXMLIO extras;
warn
that Nexus/Newick wrappers don't return Bio.Tree objects yet
- PhyloXML: singular properties, improved str()
Remarks:
- Most of the work done this week and last, shuffling base classes and
adding various checks, actually made the I/O functions a little
slower.
I don't think this will be a big deal, and the changes were necessary,
but it's still a little disappointing.
- The networkx export will look pretty cool. After exporting a Biopython
tree to a networkx graph, it takes a couple more imports and commands
to
draw the tree to the screen or a file. Would anyone find it handy to
have
a short function in Bio.Tree or Bio.Graphics to go straight from a
tree
to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe
graphviz)
- I have to admit this: I don't know anything about BioSQL. How would I
use
and test the PhyloDB extension, and what's involved in writing a
Biopython interface for it?
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
From krother at rubor.de Mon Aug 3 11:11:15 2009
From: krother at rubor.de (Kristian Rother)
Date: Mon, 03 Aug 2009 17:11:15 +0200
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
Message-ID: <4A76FE13.6050203@rubor.de>
Hi,
We have created a lot of code that works on RNA structures in Poznan,
Poland. There are some jewels that I consider useful and mature enough
to meet a wider audience. I'd be interested in refactorizing and
packaging them as a RNAStructure package and contribute it to BioPython.
I just discussed the possibilities with Magdalena Musielak & Tomasz
Puton who wrote & tested significant portions of the code. They came up
with a list of 'most wanted' Use Cases:
- Calculate RNA base pairs
- Generate RNA secondary structures from 3D structures
- Recognize pseudoknots
- Recognize modified nucleotides in RNA 3D structures.
- Superimpose two RNA molecules.
The existing code massively uses Bio.PDB already, and has little
dependancies apart from that.
Any comments how this kind of functionality would fit into BioPython are
welcome.
Best Regards,
Kristian Rother
www.rubor.de
Structural Bioinformatics Group
UAM Poznan
From bugzilla-daemon at portal.open-bio.org Mon Aug 3 12:28:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 3 Aug 2009 12:28:39 -0400
Subject: [Biopython-dev] [Bug 2896] New: BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
Summary: BLAST XML parser: stripped leading/trailing spaces in
Hsp_midline
Product: Biopython
Version: 1.50
Platform: All
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: volkmer at mpi-cbg.de
Parsing a XML output file from NCBI BLAST using blastp & complexity filters on
omits leading/trailing spaces in the hsp match line:
hsp.query
u'XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV'
hsp.match
u'P+ T L S PPL S+S + PQ+ L+ + R+K+ + + RHHH R LDL ++V'
This makes it more awkward to evaluate the alignment. It would be the best when
query, subject and alignment always have the same length. The BLAST XML output
file at least has the correct Hsp_midline:
<,Hsp_qseq>XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV</Hsp_qseq>
<<Hsp_hseq>>EFFEPAITGLYYS-PPLFSVSRLTGLLHLLERPQETLF-TNYRNKIKRLDIPLRHHHIRHLDLEQLV</Hsp_hseq>
<Hsp_midline> P+ T L S PPL S+S + PQ+ L+ + R+K+ + + RHHH R
LDL ++V</Hsp_midline>
And as the plaintext parser gives the complete alignment line it would be nice
to get the same behaviour.
Thanks,
Michael
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 3 13:20:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 3 Aug 2009 13:20:24 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908031720.n73HKOFr019079@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-03 13:20 EST -------
Could you attach a complete XML file we could use for a unit test please?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Aug 3 16:48:49 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 3 Aug 2009 21:48:49 +0100
Subject: [Biopython-dev] Deprecating Bio.Fasta?
In-Reply-To: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com>
References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com>
Message-ID: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com>
On 22 June 2009, I wrote:
> ...
> I'd like to officially deprecate Bio.Fasta for the next release (Biopython
> 1.51), which means you can continue to use it for a couple more
> releases, but at import time you will see a warning message. See also:
> http://biopython.org/wiki/Deprecation_policy
>
> Would this cause anyone any problems? If you are still using Bio.Fasta,
> it would be interesting to know if this is just some old code that hasn't
> been updated, or if there is some stronger reason for still using it.
No one replied, so I plan to make this change in CVS shortly, meaning
that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work
but will trigger a deprecation warning at import.
Please speak up ASAP if this concerns you.
Thanks,
Peter
From chapmanb at 50mail.com Mon Aug 3 18:38:47 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 3 Aug 2009 18:38:47 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
Message-ID: <20090803223847.GM8112@sobchak.mgh.harvard.edu>
Hi Eric;
Thanks for the update. Things are looking in great shape as we get
towards the home stretch.
> - Most of the work done this week and last, shuffling base classes and
> adding various checks, actually made the I/O functions a little slower.
> I don't think this will be a big deal, and the changes were necessary,
> but it's still a little disappointing.
The unfortunate influence of generalization. I think the adjustment
to the generalized Tree is a big win and gives a solid framework for
any future phylogenetic modules. I don't know what the numbers are
but as long as performance is reasonable, few people will complain.
This is always something to go back around on if it becomes a hangup
in the future.
> - The networkx export will look pretty cool. After exporting a Biopython
> tree to a networkx graph, it takes a couple more imports and commands to
> draw the tree to the screen or a file. Would anyone find it handy to have
> a short function in Bio.Tree or Bio.Graphics to go straight from a tree
> to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz)
Awesome. Looking forward to seeing some trees that come out of this.
It's definitely worthwhile to formalize the functionality to go
straight from a tree to png or pdf. This will add some more
localized dependencies, so I'm torn as to whether it would be best
as a utility function or an example script. Peter might have an
opinion here.
Either way, this would be really useful as a cookbook example with a
final figure. Being able to produce some pretty is a good way to
convince people to store trees in a reasonable format like PhyloXML.
> - I have to admit this: I don't know anything about BioSQL. How would I use
> and test the PhyloDB extension, and what's involved in writing a
> Biopython interface for it?
BioSQL and the PhyloDB extension are a set of relational database
tables. Looking at the SVN logs, it appears as if the main work on
PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps
lagging behind, so my suggestion is to start with PostgreSQL.
Hilmar, please feel free to correct me here.
The schemas are available from SVN:
http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/sql
You'd want biosqldb-pg.sql and presumably also biosqldb-views-pg.sql
for BioSQL and biosql-phylodb-pg.sql and biosql-phylodata-pg.sql.
The Biopython docs are pretty nice on this -- you create the empty tables:
http://biopython.org/wiki/BioSQL#PostgreSQL
>From there you should be able to browse to get a sense of what is
there. In terms of writing an interface, the first step is loading
the data where you can mimic what is done with SeqIO and BioSQL:
http://biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database
Pass the database an iterator of trees and they are stored.
Secondarily is retrieving and querying persisted trees. Here you
would want TreeDB objects that act like standard trees, but
retrieve information from the database on demand. Here are
Seq/SeqRecord models in BioSQL:
http://github.com/biopython/biopython/tree/master/BioSQL/BioSeq.py
So it's a bit of an extended task. Time frames being what they are,
any steps in this direction are useful. If you haven't played with
BioSQL before, it's worth a look for your own interest. The underlying
key/value model is really flexible and kind of models RDF triplets. I've
used BioSQL here recently as the backend for a web app that differs a
bit from the standard GenBank like thing, and found it very flexible.
Again, great stuff. Let me know if I can add to any of that,
Brad
From bugzilla-daemon at portal.open-bio.org Tue Aug 4 04:45:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 4 Aug 2009 04:45:03 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908040845.n748j36R015856@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #2 from volkmer at mpi-cbg.de 2009-08-04 04:45 EST -------
Created an attachment (id=1353)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1353&action=view)
blastp xml sample
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Tue Aug 4 08:32:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 4 Aug 2009 08:32:39 -0400
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To: <4A76FE13.6050203@rubor.de>
References: <4A76FE13.6050203@rubor.de>
Message-ID: <20090804123239.GN8112@sobchak.mgh.harvard.edu>
Hi Kristian;
> We have created a lot of code that works on RNA structures in Poznan,
> Poland. There are some jewels that I consider useful and mature enough
> to meet a wider audience. I'd be interested in refactorizing and
> packaging them as a RNAStructure package and contribute it to BioPython.
This sounds great. I don't know enough about the area to comment
directly on your use cases -- my experience is limited to folding
structures with RNAFold and the like -- but it sounds like a solid
feature set.
> I just discussed the possibilities with Magdalena Musielak & Tomasz
> Puton who wrote & tested significant portions of the code. They came up
> with a list of 'most wanted' Use Cases:
>
> - Calculate RNA base pairs
> - Generate RNA secondary structures from 3D structures
> - Recognize pseudoknots
> - Recognize modified nucleotides in RNA 3D structures.
> - Superimpose two RNA molecules.
>
> The existing code massively uses Bio.PDB already, and has little
> dependancies apart from that.
You may also want to have a look at PyCogent, which has wrappers and
parsers for several command line programs involved with RNA structure,
along with a representation of RNA secondary structure:
http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/struct/rna2d.py?view=markup
It would be great to complement this functionality, and interact
with PyCogent where feasible.
We could offer more specific suggestions as you get rolling with this
and there is code to review. Glad to have you interested,
Brad
From tiagoantao at gmail.com Tue Aug 4 11:29:36 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 4 Aug 2009 16:29:36 +0100
Subject: [Biopython-dev] 1.52
Message-ID: <6d941f120908040829g6531804dpe51e9f24720dab78@mail.gmail.com>
Hi,
I am currently working on the implementation of Genepop support on Bio.PopGen.
Genepop support will allow calculation of basic frequentist
statistics. This is the biggest addition to Bio.PopGen and makes the
module useful for a wide range of applications. In fact I never tried
to publicize Bio.PopGen in the population genetics community, but with
this addon, that will change.
The status is as follows:
1. Code done 90% done.
Check http://github.com/tiagoantao/biopython/tree/genepop
2. Test code around 30% coverage
3. Documentation 50%
Check http://biopython.org/wiki/PopGen_dev_Genepop for a tutorial
under development.
This will be ready for 1.52. And I would like to make the code
available after the Summer vacation.
And it is about 1.52 that this mail is about ;)
I remember Peter writing about 1.52 being ad-hoc scheduled for fall. I
have September blocked with work, but I managed to have October clear
mostly just for this. So my request is: if there is more or less a
Fall release please don't schedule it for the first week in the Fall
(which is still in September) ;) . Mid-October or somewhere around
that time would be good.
Thanks a lot,
Tiago
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From matzke at berkeley.edu Tue Aug 4 13:01:34 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 04 Aug 2009 10:01:34 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To:
References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel>
<86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org>
<20090707130248.GM17086@sobchak.mgh.harvard.edu>
<3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com>
<320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
Message-ID: <4A78696E.8010808@berkeley.edu>
Hi all, update:
Major improvements/fixes:
- removed any reliance on lagrange tree module, refactored all phylogeny
code to use the revised Bio.Nexus.Tree module
- tree functions put in TreeSum (tree summary) class
- added functions for calculating phylodiversity measures, including
necessary subroutines like subsetting trees, randomly selecting tips
from a larger pool
- Code dealing with GBIF xml output completely refactored into the
following classes:
* ObsRecs (observation records & search results/summary)
* ObsRec (an individual observation record)
* XmlString (functions for cleaning xml returned by Gbif)
* GbifXml (extention of capabilities for ElementTree xml trees, parsed
from GBIF xml returns.
- another suggestion implemented: dependencies on tempfiles eliminated
by using cStringIO (temporary file-like strings, not stored as temporary
files) file_str objects instead
- another suggestion implemented: the _open method from biopython's ncbi
www functionality has been copied & modified so that it is now a method
of ObsRecs, and doesn't contain NCBI-specific defaults etc. (it does
still include a 3-second waiting time between GBIF requests, figuring
that is good practice).
- function to download large numbers of records in increments
implemented as method of ObsRecs.
This week:
- Put GIS functions in a class (easy), allowing each ObsRec to be
classified into an are (easy)
- Improve extraction of data from GBIF xmltree -- my Utricularia
"practice XML file" didn't have problems, but with running online
searches, I am discovering some fields are not always filled in, etc.
This shouldn't be too hard, using the GbifXml xmltree searching
functions, and including defaults for exceptions.
- Function for converting points to KML for Google Earth display.
Code uploaded here:
http://github.com/nmatzke/biopython/commits/Geography
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From matzke at berkeley.edu Tue Aug 4 14:28:33 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 04 Aug 2009 11:28:33 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu>
References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel>
<86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org>
<20090707130248.GM17086@sobchak.mgh.harvard.edu>
<3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com>
<320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu>
Message-ID: <4A787DD1.40301@berkeley.edu>
Hilmar Lapp wrote:
>
> On Aug 4, 2009, at 1:01 PM, Nick Matzke wrote:
>
>> * ObsRecs (observation records & search results/summary)
>> * ObsRec (an individual observation record)
>
>
> I'll let the Biopython folks make the call on this, but in general I'd
> recommend to everyone trying to write reusable code to spell out names,
> especially non-local names.
>
> The days in which the length of a variable or class name was somehow
> limited or affected the speed of a program are definitely over since
> more than a decade. I know the temptation is big to save on a few
> keystrokes every time you have to type the name, but the time that you
> will cause your fellow programmers who will later try to understand your
> code is vastly greater. What prevents me from thinking that ObsRec is a
> class for an obsolete recording?
Good point, this is easy to fix, I will put it on the list. Cheers!
Nick
>
> Just my $0.02 :-)
>
> -hilmar
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From biopython at maubp.freeserve.co.uk Tue Aug 4 14:44:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 4 Aug 2009 19:44:29 +0100
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To: <4A76FE13.6050203@rubor.de>
References: <4A76FE13.6050203@rubor.de>
Message-ID: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
On Mon, Aug 3, 2009 at 4:11 PM, Kristian Rother wrote:
> Hi,
>
> We have created a lot of code that works on RNA structures in Poznan,
> Poland. There are some jewels that I consider useful and mature enough to
> meet a wider audience. I'd be interested in refactorizing and packaging them
> as a RNAStructure package and contribute it to BioPython.
I remember we talked about this briefly at BOSC/ISMB - it sounds good.
Did you get a chance to talk to Thomas Hamelryck about this?
> I just discussed the possibilities with Magdalena Musielak & Tomasz Puton
> who wrote & tested significant portions of the code. They came up with a
> list of 'most wanted' Use Cases:
>
> - Calculate RNA base pairs
> - Generate RNA secondary structures from 3D structures
> - Recognize pseudoknots
> - Recognize modified nucleotides in RNA 3D structures.
> - Superimpose two RNA molecules.
>
> The existing code massively uses Bio.PDB already, and has little
> dependancies apart from that.
>
> Any comments how this kind of functionality would fit into BioPython are
> welcome.
I see you have already started a github branch, which is great:
http://github.com/krother/biopython/tree/rol
Am I right in thinking all of this code is for 3D RNA work? Maybe that
might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA?
Did you have something in mind?
Peter
P.S. Who won the ISMB Art and Science Exhibition prize?
http://www.iscb.org/ismbeccb2009/artscience.php
From biopython at maubp.freeserve.co.uk Tue Aug 4 15:29:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 4 Aug 2009 20:29:47 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
Message-ID: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote:
> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote:
>> How about adding a function like "run_arguments" to the
>> commandlines that returns the commandline as a list.
>
> That would be a simple alternative to my vague idea "Maybe we
> can make the command line wrapper object more list like to make
> subprocess happy without needing to create a string?", which may
> not be possible. Either way, this will require a bit of work on the
> Bio.Application parameter objects...
By defining an __iter__ method, we can make the Biopython
application wrapper object sufficiently list-like that it can be
passed directly to subprocess. I think I have something working
(only tested on Linux so far), at least for the case where none
of the arguments have spaces or quotes in them.
If this works, it should make things a little easier in that we don't
have to do str(cline), and also I think it avoids the OS specific
behaviour of the shell argument as Brad noted earlier:
>> This avoids the shell nastiness with the argument list, is as
>> simple as it gets with subprocess, and gives users an easy
>> path to getting stdout, stderr and the return codes.
i.e. I am hoping we can replace this:
child = subprocess.Popen(str(cline), shell(sys.platform!="win32"), ...)
with just:
child = subprocess.Popen(cline, ...)
where the "..." represents any messing about with stdin, stdout
and stderr.
Peter
From chapmanb at 50mail.com Tue Aug 4 18:27:31 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 4 Aug 2009 18:27:31 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A78696E.8010808@berkeley.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
Message-ID: <20090804222731.GA12604@sobchak.mgh.harvard.edu>
Hi Nick;
Thanks for the update -- great to see things moving along.
> - removed any reliance on lagrange tree module, refactored all phylogeny
> code to use the revised Bio.Nexus.Tree module
Awesome -- glad this worked for you. Are the lagrange_* files in
Bio.Geography still necessary? If not, we should remove them from
the repository to clean things up.
More generally, it would be really helpful if we could do a bit of
housekeeping on the repository. The Geography namespace has a lot of
things in it which belong in different parts of the tree:
- The test code should move to the 'Tests' directory as a set of
test_Geography* files that we can use for unit testing the code.
- Similarly there are a lot of data files in there which are
appear to be test related; these could move to Tests/Geography
- What is happening with the Nodes_v2 and Treesv2 files? They look
like duplicates of the Nexus Nodes and Trees with some changes.
Could we roll those changes into the main Nexus code to avoid
duplication?
> - Code dealing with GBIF xml output completely refactored into the
> following classes:
>
> * ObsRecs (observation records & search results/summary)
> * ObsRec (an individual observation record)
> * XmlString (functions for cleaning xml returned by Gbif)
> * GbifXml (extention of capabilities for ElementTree xml trees, parsed
> from GBIF xml returns.
I'm agreed with Hilmar -- the user classes would probably benefit from expanded
naming. There is a art to naming to get them somewhere between the hideous
RidicuouslyLongNamesWithEverythingSpecified names and short truncated names.
Specifically, you've got a lot of filler in the names -- dbfUtils,
geogUtils, shpUtils. The Utils probably doesn't tell the user much
and makes all of the names sort of blend together, just as the Rec/Recs
pluralization hides a quite large difference in what the classes hold.
Something like Observation and ObservationSearchResult would make it
clear immediately what they do and the information they hold.
> This week:
What are your thoughts on documentation? As a naive user of these
tools without much experience with the formats, I could offer better
feedback if I had an idea of the public APIs and how they are
expected to be used. Moreover, cookbook and API documentation is something
we will definitely need to integrate into Biopython. How does this fit
in your timeline for the remaining weeks?
Thanks again. Hope this helps,
Brad
From hlapp at gmx.net Tue Aug 4 19:34:26 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 4 Aug 2009 19:34:26 -0400
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
References: <4A76FE13.6050203@rubor.de>
<320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
Message-ID:
On Aug 4, 2009, at 2:44 PM, Peter wrote:
> P.S. Who won the ISMB Art and Science Exhibition prize?
> http://www.iscb.org/ismbeccb2009/artscience.php
Guess who - Kristian did :-)
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From krother at rubor.de Wed Aug 5 04:07:12 2009
From: krother at rubor.de (Kristian Rother)
Date: Wed, 05 Aug 2009 10:07:12 +0200
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
Message-ID: <4A793DB0.5000805@rubor.de>
Hi Peter,
I remember we talked about this briefly at BOSC/ISMB - it sounds good.
Did you get a chance to talk to Thomas Hamelryck about this?
We talked on ISMB, but no details yet.
Am I right in thinking all of this code is for 3D RNA work? Maybe that
might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA?
Did you have something in mind?
I was thinking of 'RNAStructure' - I also like 'RNA' as long as it does
not violate any
claims.
P.S. Who won the ISMB Art and Science Exhibition prize?
http://www.iscb.org/ismbeccb2009/artscience.php
The winning picture can be found here:
http://www.rubor.de/twentycharacters_en.html
Best Regards,
Kristian
From biopython at maubp.freeserve.co.uk Wed Aug 5 04:15:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 5 Aug 2009 09:15:36 +0100
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To:
References: <4A76FE13.6050203@rubor.de>
<320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
Message-ID: <320fb6e00908050115y612d89b2h757f5aa59fbb99ed@mail.gmail.com>
On Wed, Aug 5, 2009 at 12:34 AM, Hilmar Lapp wrote:
>
> On Aug 4, 2009, at 2:44 PM, Peter wrote:
>
>> P.S. Who won the ISMB Art and Science Exhibition ?prize?
>> http://www.iscb.org/ismbeccb2009/artscience.php
>
> Guess who - Kristian did :-)
>
> ? ? ? ?-hilmar
Ha! That's cool. Congratulations Kristian!
Peter
From biopython at maubp.freeserve.co.uk Wed Aug 5 06:29:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 5 Aug 2009 11:29:45 +0100
Subject: [Biopython-dev] Deprecating Bio.Fasta?
In-Reply-To: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com>
References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com>
<320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com>
Message-ID: <320fb6e00908050329m44fa2596ife06917306ae44ab@mail.gmail.com>
On Mon, Aug 3, 2009 at 9:48 PM, Peter wrote:
> On 22 June 2009, I wrote:
>> ...
>> I'd like to officially deprecate Bio.Fasta for the next release (Biopython
>> 1.51), which means you can continue to use it for a couple more
>> releases, but at import time you will see a warning message. See also:
>> http://biopython.org/wiki/Deprecation_policy
>> ...
>
> No one replied, so I plan to make this change in CVS shortly, meaning
> that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work
> but will trigger a deprecation warning at import.
>
> Please speak up ASAP if this concerns you.
I've just committed the deprecation of Bio.Fasta to CVS. This could be
reverted if anyone has a compelling reason (and tells us before we do
the final release of Biopython 1.51).
The docstring for Bio.Fasta should cover the typical situations for moving
from Bio.Fasta to Bio.SeqIO, but please feel free to ask on the mailing
list if you have a more complicated bit of old code that needs to be ported.
Thanks,
Peter
From bugzilla-daemon at portal.open-bio.org Wed Aug 5 07:29:41 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 5 Aug 2009 07:29:41 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908051129.n75BTf8i026537@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 07:29 EST -------
Thanks for the sample XML file. I could reproduce this, I think I have fixed
it.
hsp.query, hsp.match and hsp.sbjct should all be the same length.
Previously, at the end of each tag our XML parser strips the leading/trailing
white space from the tag's value before processing it. In the case of
Hsp_midline this is a very bad idea. However, the reason it did this was that
the way the current tag value was built up wasn't context aware. In particular
case, there was white space outside tags like Hsp_midline, which really belong
to the parent tag (Hsp), but was wrongly being combined.
Would you be able to test this please? All you really need to try this is the
new Bio/Blast/NCBIXML.py file (CVS revision 1.23). It might be easiest just to
update to the latest code in CVS (or on github), but I could attach the file
here if you like.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Aug 5 09:13:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 5 Aug 2009 09:13:40 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908051313.n75DDeFt031305@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #4 from volkmer at mpi-cbg.de 2009-08-05 09:13 EST -------
Hi Peter,
could you please attach the file?
The latest version of NCBIXML.py I get from cvs at code.open-bio.org still seems
to be from April 2009. When I try to specify revision 1.23 I get a checkout
warning and no file. Or is there a testing branch for this?
Michael
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Aug 5 09:27:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 5 Aug 2009 09:27:45 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908051327.n75DRjjg031915@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 09:27 EST -------
Created an attachment (id=1357)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1357&action=view)
Updated version of NCBIXML.py as in CVS revision 1.23
(In reply to comment #4)
> Hi Peter,
>
> could you please attach the file?
Sure.
> The latest version of NCBIXML.py I get from cvs at code.open-bio.org
> still seems to be from April 2009. When I try to specify revision
> 1.23 I get a checkout warning and no file. Or is there a testing
> branch for this?
Using code.open-bio.org (or its various aliases like cvs.biopython.org)
actually gives you access to a read only mirror of the real CVS data,
which is on dev.open-bio.org (for use by those with commit rights).
I'm not sure exactly how often the public mirror is updated, but I would
guess hourly. I would guess if you try again later it would work, but
in the meantime I have attached the new file to this bug.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Wed Aug 5 18:31:31 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 5 Aug 2009 18:31:31 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <20090803223847.GM8112@sobchak.mgh.harvard.edu>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
On Mon, Aug 3, 2009 at 6:38 PM, Brad Chapman wrote:
> Hi Eric;
> Thanks for the update. Things are looking in great shape as we get
> towards the home stretch.
>
> > - Most of the work done this week and last, shuffling base classes
> and
> > adding various checks, actually made the I/O functions a little
> slower.
> > I don't think this will be a big deal, and the changes were
> necessary,
> > but it's still a little disappointing.
>
> The unfortunate influence of generalization. I think the adjustment
> to the generalized Tree is a big win and gives a solid framework for
> any future phylogenetic modules. I don't know what the numbers are
> but as long as performance is reasonable, few people will complain.
> This is always something to go back around on if it becomes a hangup
> in the future.
>
The complete unit test suite used to take about 4.5 seconds, and now it
takes 5.8 seconds, though I've added a few more tests since then. I don't
think it will feel like it's hanging for most operations, besides parsing or
searching a huge tree.
> - The networkx export will look pretty cool. After exporting a
> Biopython
> > tree to a networkx graph, it takes a couple more imports and
> commands to
> > draw the tree to the screen or a file. Would anyone find it handy
> to have
> > a short function in Bio.Tree or Bio.Graphics to go straight from a
> tree
> > to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe
> graphviz)
>
> Awesome. Looking forward to seeing some trees that come out of this.
> It's definitely worthwhile to formalize the functionality to go
> straight from a tree to png or pdf. This will add some more
> localized dependencies, so I'm torn as to whether it would be best
> as a utility function or an example script. Peter might have an
> opinion here.
>
> Either way, this would be really useful as a cookbook example with a
> final figure. Being able to produce some pretty is a good way to
> convince people to store trees in a reasonable format like PhyloXML.
>
OK, it works now but the resulting trees look a little odd. The options
needed to get a reasonable tree representation are fiddly, so I made
draw_graphviz() a separate function that basically just handles the RTFM
work (not trivial), while the graph export still happens in to_networkx().
Here are a few recipes and a taste of each dish. The matplotlib engine seems
usable for interactive exploration, albeit cluttered -- I can't hide the
internal clade identifiers since graphviz needs unique labels, though maybe
I could make them less prominent. Drawing directly to PDF gets cluttered for
big files, and if you stray from the default settings (I played with it a
bit to get it right), it can look surreal. There would still be some benefit
to having a reportlab-based tree module in Bio.Graphics, and maybe one day
I'll get around to that.
$ ipython -pylab
from Bio import Tree, TreeIO
apaf = TreeIO.read('apaf.xml', 'phyloxml')
Tree.draw_graphviz(apaf)
# http://etal.myweb.uga.edu/phylo-nx-apaf.png
Tree.draw_graphviz(apaf, 'apaf.pdf')
# http://etal.myweb.uga.edu/apaf.pdf
Tree.draw_graphviz(apaf, 'apaf.png', format='png', prog='dot')
# http://etal.myweb.uga.edu/apaf.png -- why it's best to leave the defaults
alone
Thoughts: the internal node labels could be clear instead of red; if a node
doesn't have a name, it could check its taxonomy attribute to see if
anything's there; there's probably a way to make pygraphviz understand
distinct nodes that happen to have the same label, although I haven't found
it yet. Is PDF a good default format, or would PNG or PostScript be better?
> - I have to admit this: I don't know anything about BioSQL. How would
> I use
> > and test the PhyloDB extension, and what's involved in writing a
> > Biopython interface for it?
>
> BioSQL and the PhyloDB extension are a set of relational database
> tables. Looking at the SVN logs, it appears as if the main work on
> PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps
> lagging behind, so my suggestion is to start with PostgreSQL.
> Hilmar, please feel free to correct me here.
>
> [...]
>
> So it's a bit of an extended task. Time frames being what they are,
> any steps in this direction are useful. If you haven't played with
> BioSQL before, it's worth a look for your own interest. The underlying
> key/value model is really flexible and kind of models RDF triplets. I've
> used BioSQL here recently as the backend for a web app that differs a
> bit from the standard GenBank like thing, and found it very flexible.
>
>
I think I've seen that app, but I thought it was backed by AppEngine. Neat
stuff. I will learn BioSQL for my own benefit, but I don't think there's
enough time left in GSoC for me to add a useful PhyloDB adapter to
Biopython. So that, along with refactoring Nexus.Trees to use
Bio.Tree.BaseTree, would be a good project to continue with in the fall, at
a slower pace and with more discussion along the way.
Cheers,
Eric
From bugzilla-daemon at portal.open-bio.org Thu Aug 6 03:56:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 6 Aug 2009 03:56:25 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908060756.n767uPk1031552@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #6 from volkmer at mpi-cbg.de 2009-08-06 03:56 EST -------
(In reply to comment #3)
> I could reproduce this, I think I have fixed
> it.
> hsp.query, hsp.match and hsp.sbjct should all be the same length.
>
> Previously, at the end of each tag our XML parser strips the leading/trailing
> white space from the tag's value before processing it. In the case of
> Hsp_midline this is a very bad idea.
Ok, the fix seems to solve the problem.
Well I guess the only time when this problem appears is when you have
filtered/masked residues at the beginning/end of the query hsp. Otherwise the
hsp would just start with the first match and end with the last one.
Thanks,
Michael
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Aug 6 04:03:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 6 Aug 2009 04:03:03 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908060803.n76833YJ032257@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-06 04:03 EST -------
(In reply to comment #6)
>
> Ok, the fix seems to solve the problem.
>
Great - I'm marking this bug as fixed, thanks for your time reporting and then
testing this.
> Well I guess the only time when this problem appears is when you have
> filtered/masked residues at the beginning/end of the query hsp. Otherwise
> the hsp would just start with the first match and end with the last one.
I suspect there are other situations it might happen, but the fix is
general.
Cheers,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Thu Aug 6 04:06:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 09:06:43 +0100
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
<3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
Message-ID: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com>
On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich wrote:
> OK, it works now but the resulting trees look a little odd. The options
> needed to get a reasonable tree representation are fiddly, so I made
> draw_graphviz() a separate function that basically just handles the RTFM
> work (not trivial), while the graph export still happens in to_networkx().
>
> Here are a few recipes and a taste of each dish. The matplotlib engine seems
> usable for interactive exploration, albeit cluttered -- I can't hide the
> internal clade identifiers since graphviz needs unique labels, though maybe
> I could make them less prominent. ...
Graphviv does need unique names, and the node labels default to the
node name - but you can override this and use a blank label if you want.
How are you calling Graphviz? There are several Python wrappers out
there, or you could just write a dot file directly and call the graphviz
command line tools.
Peter
From eric.talevich at gmail.com Thu Aug 6 08:47:47 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 6 Aug 2009 08:47:47 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
<3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
<320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com>
Message-ID: <3f6baf360908060547r8f299dao413b3657966fe9f4@mail.gmail.com>
On Thu, Aug 6, 2009 at 4:06 AM, Peter wrote:
> On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich
> wrote:
>
> > OK, it works now but the resulting trees look a little odd. The options
> > needed to get a reasonable tree representation are fiddly, so I made
> > draw_graphviz() a separate function that basically just handles the RTFM
> > work (not trivial), while the graph export still happens in
> to_networkx().
> >
> > Here are a few recipes and a taste of each dish. The matplotlib engine
> seems
> > usable for interactive exploration, albeit cluttered -- I can't hide the
> > internal clade identifiers since graphviz needs unique labels, though
> maybe
> > I could make them less prominent. ...
>
> Graphviv does need unique names, and the node labels default to the
> node name - but you can override this and use a blank label if you want.
> How are you calling Graphviz? There are several Python wrappers out
> there, or you could just write a dot file directly and call the graphviz
> command line tools.
>
I'm using the networkx and pygraphviz wrappers, since networkx already
partly wraps pygraphviz.
The direct networkx->matplotlib rendering engine figures out the
associations correctly when I pass a LabeledDiGraph instance, using Clade
objects as nodes and the str() representation as the label -- so
networkx.draw(tree) shows a tree with the internal nodes all labeled as
"Clade". But networkx.draw_graphviz(tree), while otherwise working the same
as the other networkx drawing functions, seems to convert nodes to strings
earlier, and then treats all "Clade" strings as the same node.
Surely there's a way to fix this through the networkx or pygraphviz API, but
I couldn't figure it out yesterday from the documentation and source code.
I'll poke at it some more today and try using blank labels.
Thanks,
Eric
From chapmanb at 50mail.com Thu Aug 6 09:14:42 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 6 Aug 2009 09:14:42 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
<3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
Message-ID: <20090806131442.GG12604@sobchak.mgh.harvard.edu>
Hi Eric;
> OK, it works now but the resulting trees look a little odd. The options
> needed to get a reasonable tree representation are fiddly, so I made
> draw_graphviz() a separate function that basically just handles the RTFM
> work (not trivial), while the graph export still happens in to_networkx().
>
> Here are a few recipes and a taste of each dish. The matplotlib engine seems
> usable for interactive exploration, albeit cluttered -- I can't hide the
> internal clade identifiers since graphviz needs unique labels, though maybe
> I could make them less prominent. Drawing directly to PDF gets cluttered for
> big files, and if you stray from the default settings (I played with it a
> bit to get it right), it can look surreal. There would still be some benefit
> to having a reportlab-based tree module in Bio.Graphics, and maybe one day
> I'll get around to that.
This is great start. I remember pygraphviz and the networkx
representation being a bit finicky last I used it. In the end, I ended
up making a pygraphviz AGraph directly. Either way, if you can remove
the unneeded labels and change colorization as you suggested, this is
a great quick visualizations of trees.
Something reportlab based that looks like biologists expect a
phylogenetic tree to look would also be very useful. There is a
benefit in familiarity of display. Building something generally
usable like that is a longer term project.
> I think I've seen that app, but I thought it was backed by AppEngine. Neat
> stuff. I will learn BioSQL for my own benefit, but I don't think there's
> enough time left in GSoC for me to add a useful PhyloDB adapter to
> Biopython. So that, along with refactoring Nexus.Trees to use
> Bio.Tree.BaseTree, would be a good project to continue with in the fall, at
> a slower pace and with more discussion along the way.
Yes, the AppEngine display is also BioSQL on the backend; I ported over
some of the tables to the object representation used in AppEngine. I
also have used the relational schema in work projects -- it
generally is just a good place to get started.
Agreed on the timelines for GSoC. We'd be very happy to have you
continue that on those projects into the fall. Both are very useful
additions to the great work you've already done.
Brad
From biopython at maubp.freeserve.co.uk Thu Aug 6 10:39:33 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 15:39:33 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
Message-ID: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
On Tue, Aug 4, 2009 at 8:29 PM, Peter wrote:
> On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote:
>> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote:
>>> How about adding a function like "run_arguments" to the
>>> commandlines that returns the commandline as a list.
>>
>> That would be a simple alternative to my vague idea "Maybe we
>> can make the command line wrapper object more list like to make
>> subprocess happy without needing to create a string?", which may
>> not be possible. Either way, this will require a bit of work on the
>> Bio.Application parameter objects...
>
> By defining an __iter__ method, we can make the Biopython
> application wrapper object sufficiently list-like that it can be
> passed directly to subprocess. I think I have something working
> (only tested on Linux so far), at least for the case where none
> of the arguments have spaces or quotes in them.
The current Bio.Application code works around generating command line
strings, and works fine cross platform. Making the Bio.Application
objects "list like" and getting this to work cross platform isn't
looking easy. Spaces on Windows are causing me big headaches.
Switching to lists of arguments appears to work fine on Unix
(specifically tested on Linux and Mac OS X), but things are more
complicated Windows. Basically using an array/list of arguments is
normal on Unix, but on Windows things get passed as strings. The
upshot is different Windows tools (or libraries used to compile them)
have to parse their command line string themselves, so different tools
do it differently. The result is you *may* need to adopt different
spaces/quotes escaping for different command line tools on Windows.
Now, if you give subprocess a list, on Windows it must first be turned
into a string, before subprocess can use the Windows API to run it.
The subprocess function list2cmdline does this, but the conventions it
follows are not universal.
I have examples of working command line strings for ClustalW and PRANK
where both the executable and some of the arguments have spaces in
them. It seems the quoting I was using to make ClustalW (or PRANK)
happy cannot be achieved via subprocess.list2cmdline (and I suspect
this applies to other tools too).
I will try and look into this further. However, even if it is
possible, I don't think we can implement the list approach in time for
Biopython 1.51, as there are just too many potential pitfalls.
I have in the meantime extended the command line tool unit tests
somewhat to include more examples with spaces in the filenames
[I'm beginning to think replacing Bio.Application.generic_run with a
simpler helper function would be easier in the short term, continuing
to just using a string with subprocess, but haven't given up yet.]
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 6 11:48:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 16:48:12 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
<320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
Message-ID: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
On Thu, Aug 6, 2009 at 3:39 PM, Peter wrote:
> Now, if you give subprocess a list, on Windows it must first be turned
> into a string, before subprocess can use the Windows API to run it.
> The subprocess function list2cmdline does this, but the conventions it
> follows are not universal.
>
> I have examples of working command line strings for ClustalW and PRANK
> where both the executable and some of the arguments have spaces in
> them. It seems the quoting I was using to make ClustalW (or PRANK)
> happy cannot be achieved via subprocess.list2cmdline (and I suspect
> this applies to other tools too).
e.g. This is a valid and working command line for PRANK, which works
both at the command line, or in Python via subprocess when given as
a string:
C:\repository\biopython\Tests>"C:\Program Files\prank.exe"
-d=Quality/example.fasta -o="temp with space" -f=11 -convert
Now, breaking up the arguments according to the description given in
the subprocess.list2cmdline docstring, I think the arguments are:
"C:\Program Files\prank.exe"
-d=Quality/example.fasta
-o="temp with space"
-f=11
-convert
Of these, the middle guy causes problems. By my reading of
the subprocess.list2cmdline docstring this is valid:
>> 2) A string surrounded by double quotation marks is
>> interpreted as a single argument, regardless of white
>> space or pipe characters contained within. A quoted
>> string can be embedded in an argument.
The example -o="temp with space" is a string surrounded by
double quotes, "temp with space", embedded in an argument.
Unfortunately, giving these five strings to subprocess.list2cmdline
results in a mess as it never checks to see if the arguments are
already quoted (as we have done for the program name and also
the output filename base). We can pass the program name in
without the quotes, and list2cmdline will do the right thing. But
there is no way for the -o argument to be handled that I can see.
This may be a bug in subprocess.list2cmdline, but it is certainly
a real limitation in my opinion.
So, it would appear that (on Windows) making our command line
wrappers act like lists (by defining __iter__) will not work in general.
The other approach which would allow our command line wrappers
to be passed directly to subprocess is to make them more string
like - but the subprocess code checks for string command lines
using isinstance(args, types.StringTypes) which means we would
have to subclass str (or unicode). I'm not sure if this can be made
to work yet...
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 6 12:05:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 17:05:24 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
<320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
<320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
Message-ID: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com>
On Thu, Aug 6, 2009 at 4:48 PM, Peter wrote:
> The other approach which would allow our command line wrappers
> to be passed directly to subprocess is to make them more string
> like - but the subprocess code checks for string command lines
> using isinstance(args, types.StringTypes) which means we would
> have to subclass str (or unicode). I'm not sure if this can be made
> to work yet...
Thinking about it a bit more, str and unicode are immutable objects,
but we want the command line wrapper to be mutable (e.g. to add,
change or remove parameters and arguments). So it won't work.
Going back to my the original email, we could replace
Bio.Application.generic_run instead:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006344.html
>
> Possible helper functions that come to mind are:
> (a) Returns the return code (integer) only. This would basically
> be a cross-platform version of os.system using the subprocess
> module internally.
> (b) Returns the return code (integer) plus the stdout and stderr
> (which would have to be StringIO handles, with the data in
> memory). This would be a direct replacement for the current
> Bio.Application.generic_run function.
> (c) Returns the stdout (and stderr) handles. This basically is
> recreating a deprecated Python popen*() function, which seems
> silly.
Or we just declare both Bio.Application.generic_run and
ApplicationResult obsolete, and simply recommend using subprocess with
str(cline) as before. Would someone like to proof read (and test) the
tutorial in CVS where I switched all the generic_run usage to
subprocess?
Peter
From biopython at maubp.freeserve.co.uk Sat Aug 8 07:14:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 12:14:18 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
Message-ID: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
On Wed, Jul 29, 2009 at 8:43 AM, Peter wrote:
> On Tue, Jul 28, 2009 at 11:09 PM, Brad Chapman wrote:
>> Extending this to AlignIO and TreeIO as Eric suggested is
>> also great.
>
> Whatever we do for Bio.SeqIO, we can follow the same pattern
> for Bio.AlignIO etc.
>
>> So +1 from me,
>> Brad
>
> And we basically had a +0 from Michiel, and a +1 from Eric.
> And I like the idea but am not convinced we need it. Maybe
> we should put the suggestion forward on the main discussion
> list for debate?
I've stuck a branch up on github which (thus far) simply defines
the Bio.SeqIO.convert and Bio.AlignIO.convert functions.
Adding optimised code can come later.
http://github.com/peterjc/biopython/commits/convert
Right now (based on the other thread), I've experimented
with making the convert functions accept either handles
or filenames. This will make the convert function even
more of a convenience wrapper, in addition to its role as a
standardised API to allow file format specific optimisations.
Taking handles and/or filenames does rather complicate
things, and not just for remembering to close the handles.
There are issues like should we silently replace any existing
output file (I went for yes), and should the output file be
deleted if the conversion fails part way (I went for no)?
Dealing with just handles would free us from all these
considerations.
You could even consider using Python's temporary file support
to write the file to a temp location, and only at the end move
it to the desired location. However that is getting far too
complicated for my liking (and may runs into permissions
issues on Unix). If anyone wants to do this, they can do it
explicitly in the calling script.
How does this look so far?
Peter
From biopython at maubp.freeserve.co.uk Sat Aug 8 15:41:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 20:41:20 +0100
Subject: [Biopython-dev] Unit tests for deprecated modules?
In-Reply-To: <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com>
References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com>
<48AACE23.3050107@biologie.uni-kl.de>
<320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com>
Message-ID: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com>
Last year we talked about what to do with the unit tests for deprecated modules,
http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html
On Tue, Aug 19, 2008, Peter wrote:
> Are there any strong views about when to remove unit tests for
> deprecated modules? I can see two main approaches:
>
> (a) Remove the unit test when the code is deprecated, as this avoids
> warning messages from the test suite.
> (b) Remove the unit test only when the deprecated code is actually
> removed, as continuing to test the code will catch any unexpected
> breakage of the deprecated code.
>
> I lean towards (b), but wondered what other people think.
>
> Peter
On Tue, Aug 19, 2008, Michiel de Hoon wrote:
> I would say (a). In my opinion, deprecated means that the module
> is in essence no longer part of Biopython; we just keep it around
> to give people time to change. Also, deprecation warnings distract
> from real warnings and errors in the unit tests, are likely to confuse
> users, and give the impression that Biopython is not clean. I don't
> remember a case where we had to resurrect a deprecated module,
> so we may as well remove the unit test right away.
>
> --Michiel
On Tue, Aug 19, 2008, Frank Kauff wrote:
> I favor option a. Deprecated modules are no longer under development,
> so there's not much need for a unit test. A failed test would probably
> not trigger any action anyway, because nobody's going to do much
> bugfixing in deprecated modules.
>
> Frank
So, what we agreed last year was to remove tests for deprecated
modules. This issue has come up again with the deprecation of
Bio.Fasta, and the question of what to do with test_Fasta.py
I'd like to suggest a third option: Keep the tests for deprecated
modules, but silence the deprecation warning. e.g. make can
test_Fasta.py silence the Bio.Fasta deprecation warning. Hiding
the warning would prevent the likely user confusion on running
the test suite (an issue Michiel pointed out last year). Keeping
the test will prevent us accidentally breaking Bio.Fasta during
the phasing out period.
Any thoughts?
Peter
From biopython at maubp.freeserve.co.uk Sat Aug 8 15:50:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 20:50:47 +0100
Subject: [Biopython-dev] Unit tests for deprecated modules?
In-Reply-To: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com>
References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com>
<48AACE23.3050107@biologie.uni-kl.de>
<320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com>
<320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com>
Message-ID: <320fb6e00908081250j189ba590o5cd9c6e98f596193@mail.gmail.com>
On Sat, Aug 8, 2009 at 8:41 PM, Peter wrote:
> Last year we talked about what to do with the unit tests for deprecated modules,
> http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html
> ...
> I'd like to suggest a third option: Keep the tests for deprecated
> modules, but silence the deprecation warning. e.g. make
> test_Fasta.py silence the Bio.Fasta deprecation warning.
I've done that in CVS as a proof of principle, replacing:
from Bio import Fasta
with:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from Bio import Fasta
warnings.resetwarnings()
There may be a more elegant way to do this, but it works.
Peter
From bugzilla-daemon at portal.open-bio.org Mon Aug 10 09:43:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 10 Aug 2009 09:43:15 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
files in Bio.SeqIO
In-Reply-To:
Message-ID: <200908101343.n7ADhF4c020240@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2837
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1303 is|0 |1
obsolete| |
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-10 09:43 EST -------
(From update of attachment 1303)
This file is already a tiny bit out of date - I've started working on this on a
git branch.
http://github.com/peterjc/biopython/commits/sff
See also James Casbon's parser, also on github:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006456.html
http://github.com/jamescasbon/biopython/tree/sff
It looks like we could try and merge the two. James' code looks like it doesn't
need seek/tell, which means it should work on any input handle (not just an
open file).
Note neither parser yet copes with paired end data (and I have not yet found
any test files to work on).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Aug 10 12:46:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 17:46:16 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
<320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
Message-ID: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote:
> I've stuck a branch up on github which (thus far) simply defines
> the Bio.SeqIO.convert and Bio.AlignIO.convert functions.
> Adding optimised code can come later.
>
> http://github.com/peterjc/biopython/commits/convert
There is now a new file Bio/SeqIO/_convert.py on this
branch, and a few optimised conversions have been done.
In particular GenBank/EMBL to FASTA, any FASTQ to
FASTA, and inter-conversion between any of the three
FASTQ formats.
In terms of speed, this new code takes under a minute to
convert a 7 million short read FASTQ file to another FASTQ
variant, or to a (line wrapped) FASTA file. In comparison,
using Bio.SeqIO parse/write takes over five minutes.
In terms of code organisation within Bio/SeqIO/_convert.py
I am (as with Bio.SeqIO etc for parsing and writing) just
using a dictionary of functions, keyed on the format names.
Initially, as you can tell from the code history, I was thinking
about having each sub-function potentially dealing with more
than one conversion (e.g. GenBank to anything not needing
features), but have removed this level of complication in the
most recent commit.
The current Bio/SeqIO/_convert.py file actually looks very
long and complicated - but if you ignore the doctests (which
I would probably more to a dedicated unit test), it isn't that
much code at all.
Would anyone like to try this out?
Peter
From eric.talevich at gmail.com Mon Aug 10 13:44:31 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 10 Aug 2009 13:44:31 -0400
Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython
Message-ID: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
Hi folks,
Previously I (Aug. 3-7):
- Refactored the PhyloXML parser somewhat, to behave more like the other
Biopython parsers and also handle 'other' elements better
- Reorganized Bio.Tree a bit, generalizing the Tree base class and
improving BaseTree-PhyloXML interoperability
- Worked on networkx export and graphviz display
- Added some more tests (thanks, Diana!)
- Added TreeIO.convert(), to match the AlignIO and SeqIO modules
Next week (Aug. 10-14) I will:
- Update the wiki documentation
- Fix any surprises that come up during testing
Automated testing:
- Check unit tests for complete coverage
- Re-run performance benchmarks
- Run tests and benchmarks on alternate platforms
- Check epydoc's generated API documentation
Remarks:
- Performance of the I/O functions is close to what it was before, in
the
best of times; parsing Taxonomy nodes incrementally seems to have
helped.
- Drawing trees with Graphviz is still ugly. Hopefully I can fix it this
week, but if not, I'll probably do it after GSoC because I like pretty
things.
- Presumably, any discussion of merging with Biopython will have to wait
until after the biopython-1.51 release. I'll be around. For GSoC
requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO
modules along with the unit test suite as standalone files, rather
than
as a patch set since the last upstream revision I pulled was just a
random untagged one around the time of the last beta release.
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
From matzke at berkeley.edu Mon Aug 10 16:23:15 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 10 Aug 2009 13:23:15 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <20090804222731.GA12604@sobchak.mgh.harvard.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
Message-ID: <4A8081B3.2080600@berkeley.edu>
Hi all...updates...
Summary: Major focus is getting the GBIF access/search/parse module into
"done"/submittable shape. This primarily requires getting the
documentation and testing up to biopython specs. I have a fair bit of
documentation and testing, need advice (see below) for specifics on what
it should look like.
Brad Chapman wrote:
> Hi Nick;
> Thanks for the update -- great to see things moving along.
>
>> - removed any reliance on lagrange tree module, refactored all phylogeny
>> code to use the revised Bio.Nexus.Tree module
>
> Awesome -- glad this worked for you. Are the lagrange_* files in
> Bio.Geography still necessary? If not, we should remove them from
> the repository to clean things up.
Ah, they had been deleted locally but it took an extra command to delete
on git. Done.
>
> More generally, it would be really helpful if we could do a bit of
> housekeeping on the repository. The Geography namespace has a lot of
> things in it which belong in different parts of the tree:
>
> - The test code should move to the 'Tests' directory as a set of
> test_Geography* files that we can use for unit testing the code.
OK, I will do this. Should I try and figure out the unittest stuff? I
could use a simple example of what this is supposed to look like.
> - Similarly there are a lot of data files in there which are
> appear to be test related; these could move to Tests/Geography
Will do.
> - What is happening with the Nodes_v2 and Treesv2 files? They look
> like duplicates of the Nexus Nodes and Trees with some changes.
> Could we roll those changes into the main Nexus code to avoid
> duplication?
Yeah, these were just copies with your bug fix, and with a few mods I
used to track crashes. Presumably I don't need these with after a fresh
download of biopython.
>> - Code dealing with GBIF xml output completely refactored into the
>> following classes:
>>
>> * ObsRecs (observation records & search results/summary)
>> * ObsRec (an individual observation record)
>> * XmlString (functions for cleaning xml returned by Gbif)
>> * GbifXml (extention of capabilities for ElementTree xml trees, parsed
>> from GBIF xml returns.
>
> I'm agreed with Hilmar -- the user classes would probably benefit from expanded
> naming. There is a art to naming to get them somewhere between the hideous
> RidicuouslyLongNamesWithEverythingSpecified names and short truncated names.
> Specifically, you've got a lot of filler in the names -- dbfUtils,
> geogUtils, shpUtils. The Utils probably doesn't tell the user much
> and makes all of the names sort of blend together, just as the Rec/Recs
> pluralization hides a quite large difference in what the classes hold.
Will work on this, these should be made part of the
GbifObservationRecord() object or be accessed by it, basically they only
exist to classify lat/long points into user-specified areas.
> Something like Observation and ObservationSearchResult would make it
> clear immediately what they do and the information they hold.
Agreed, here is a new scheme for the names (changes already made):
=============
class GbifSearchResults():
GbifSearchResults is a class for holding a series of
GbifObservationRecord records, and processing them e.g. into classified
areas.
Also can hold a GbifDarwincoreXmlString record (the raw output returned
from a GBIF search) and a GbifXmlTree (a class for holding/processing
the ElementTree object returned by parsing the GbifDarwincoreXmlString).
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation
at an individual lat/long point.
class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by
a GBIF search, & processing it to plain text, then an xmltree (an
ElementTree).
GbifDarwincoreXmlString inherits string methods from str (class String).
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
=============
...description of methods below...
>
>> This week:
>
> What are your thoughts on documentation? As a naive user of these
> tools without much experience with the formats, I could offer better
> feedback if I had an idea of the public APIs and how they are
> expected to be used. Moreover, cookbook and API documentation is something
> we will definitely need to integrate into Biopython. How does this fit
> in your timeline for the remaining weeks?
The API is really just the interface with GBIF. I think developing a
cookbook entry is pretty easy, I assume you want something like one of
the entries in the official biopython cookbook?
Re: API documentation...are you just talking about the function
descriptions that are typically in """ """ strings beneath the function
definitions? I've got that done. Again, if there is more, an example
of what it should look like would be useful.
Documentation for the GBIF stuff below.
============
gbif_xml.py
Functions for accessing GBIF, downloading records, processing them into
a class, and extracting information from the xmltree in that class.
class GbifObservationRecord(Exception): pass
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation
at an individual lat/long point.
__init__(self):
This is an instantiation class for setting up new objects of this class.
latlong_to_obj(self, line):
Read in a string, read species/lat/long to GbifObservationRecord object
This can be slow, e.g. 10 seconds for even just ~1000 records.
parse_occurrence_element(self, element):
Parse a TaxonOccurrence element, store in OccurrenceRecord
fill_occ_attribute(self, element, el_tag, format='str'):
Return the text found in matching element matching_el.text.
find_1st_matching_subelement(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag.
record_to_string(self):
Print the attributes of a record to a string
class GbifDarwincoreXmlString(Exception): pass
class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by
a GBIF search, & processing it to plain text, then an xmltree (an
ElementTree).
GbifDarwincoreXmlString inherits string methods from str (class String).
__init__(self, rawstring=None):
This is an instantiation class for setting up new objects of this class.
fix_ASCII_lines(self, endline=''):
Convert each line in an input string into pure ASCII
(This avoids crashes when printing to screen, etc.)
_fix_ASCII_line(self, line):
Convert a single string line into pure ASCII
(This avoids crashes when printing to screen, etc.)
_unescape(self, text):
#
Removes HTML or XML character references and entities from a text string.
@param text The HTML (or XML) source text.
@return The plain text, as a Unicode string, if necessary.
source: http://effbot.org/zone/re-sub.htm#unescape-html
_fix_ampersand(self, line):
Replaces "&" with "&" in a string; this is otherwise
not caught by the unescape and unicodedata.normalize functions.
class GbifXmlTreeError(Exception): pass
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
__init__(self, xmltree=None):
This is an instantiation class for setting up new objects of this class.
print_xmltree(self):
Prints all the elements & subelements of the xmltree to screen (may require
fix_ASCII to input file to succeed)
print_subelements(self, element):
Takes an element from an XML tree and prints the subelements tag & text, and
the within-tag items (key/value or whatnot)
_element_items_to_dictionary(self, element_items):
If the XML tree element has items encoded in the tag, e.g. key/value or
whatever, this function puts them in a python dictionary and returns
them.
extract_latlongs(self, element):
Create a temporary pseudofile, extract lat longs to it,
return results as string.
Inspired by: http://www.skymind.com/~ocrow/python_string/
(Method 5: Write to a pseudo file)
_extract_latlong_datum(self, element, file_str):
Searches an element in an XML tree for lat/long information, and the
complete name. Searches recursively, if there are subelements.
file_str is a string created by StringIO in extract_latlongs() (i.e., a
temp filestr)
extract_all_matching_elements(self, start_element, el_to_match):
Returns a list of the elements, picking elements by TaxonOccurrence;
this should
return a list of elements equal to the number of hits.
_recursive_el_match(self, element, el_to_match, output_list):
Search recursively through xmltree, starting with element, recording all
instances of el_to_match.
find_to_elements_w_ancs(self, el_tag, anc_el_tag):
Burrow into XML to get an element with tag el_tag, return only those
el_tags underneath a particular parent element parent_el_tag
xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
match_el_list):
Recursively burrows down to find whatever elements with el_tag exist
inside a parent_el_tag.
create_sub_xmltree(self, element):
Create a subset xmltree (to avoid going back to irrelevant parents)
_xml_burrow_up(self, element, anc_el_tag, found_anc):
Burrow up xml to find anc_el_tag
_xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
Burrow up from element of interest, until a cousin is found with
cousin_el_tag
_return_parent_in_xmltree(self, child_to_search_for):
Search through an xmltree to get the parent of child_to_search_for
_return_parent_in_element(self, potential_parent, child_to_search_for,
returned_parent):
Search through an XML element to return parent of child_to_search_for
find_1st_matching_element(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag
extract_numhits(self, element):
Search an element of a parsed XML string and find the
number of hits, if it exists. Recursively searches,
if there are subelements.
class GbifSearchResults(Exception): pass
class GbifSearchResults():
GbifSearchResults is a class for holding a series of
GbifObservationRecord records, and processing them e.g. into classified
areas.
__init__(self, gbif_recs_xmltree=None):
This is an instantiation class for setting up new objects of this class.
print_records(self):
Print all records in tab-delimited format to screen.
print_records_to_file(self, fn):
Print the attributes of a record to a file with filename fn
latlongs_to_obj(self):
Takes the string from extract_latlongs, puts each line into a
GbifObservationRecord object.
Return a list of the objects
Functions devoted to accessing/downloading GBIF records
access_gbif(self, url, params):
Helper function to access various GBIF services
choose the URL ("url") from here:
http://data.gbif.org/ws/rest/occurrence
params are a dictionary of key/value pairs
"self._open" is from Bio.Entrez.self._open, online here:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open
Get the handle of results
(looks like e.g.: > )
(open with results_handle.read() )
_get_hits(self, params):
Get the actual hits that are be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).
get_xml_hits(self, params):
Returns hits like _get_hits, but returns a parsed XML tree.
get_record(self, key):
Given the key, get a single record, return xmltree for it.
get_numhits(self, params):
Get the number of hits that will be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).
xmlstring_to_xmltree(self, xmlstring):
Take the text string returned by GBIF and parse to an XML tree using
ElementTree.
Requires the intermediate step of saving to a temporary file (required
to make
ElementTree.parse work, apparently)
tempfn = 'tempxml.xml'
fh = open(tempfn, 'w')
fh.write(xmlstring)
fh.close()
get_all_records_by_increment(self, params, inc):
Download all of the records in stages, store in list of elements.
Increments of e.g. 100 to not overload server
extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):
Extract all of the 'TaxonOccurrence' elements to a list, store them in a
GbifObservationRecord.
_paramsdict_to_string(self, params):
Converts the python dictionary of search parameters into a text
string for submission to GBIF
_open(self, cgi, params={}):
Function for accessing online databases.
Modified from:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html
Helper function to build the URL and open a handle to it (PRIVATE).
Open a handle to GBIF. cgi is the URL for the cgi script to access.
params is a dictionary with the options to pass to it. Does some
simple error checking, and will raise an IOError if it encounters one.
This function also enforces the "three second rule" to avoid abusing
the GBIF servers (modified after NCBI requirement).
============
>
> Thanks again. Hope this helps,
> Brad
Very much, thanks!!
Nick
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From matzke at berkeley.edu Mon Aug 10 16:25:10 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 10 Aug 2009 13:25:10 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A8081B3.2080600@berkeley.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
Message-ID: <4A808226.5020302@berkeley.edu>
PS: Evidence of interest in this GBIF functionality already, see fwd
below...
PPS: Commit with updates names, deleted old files here:
http://github.com/nmatzke/biopython/commits/Geography
-------- Original Message --------
Subject: Re: biogeopython
Date: Fri, 07 Aug 2009 16:34:26 -0700
From: Nick Matzke
Reply-To: matzke at berkeley.edu
Organization: Dept. Integ. Biology, UC Berkeley
To: James Pringle
References:
<4A7C6DEE.1000305 at berkeley.edu>
Coolness, let me know how it works for you, feedback appreciated at this
stage. Cheers!
Nick
James Pringle wrote:
> Thanks!
> Jamie
>
> On Fri, Aug 7, 2009 at 2:09 PM, Nick Matzke > wrote:
>
> Hi Jamie!
>
> It's still under development, eventually it will be a biopython
> module, but what I've got should do exactly what you need.
>
> Just take the files from the most recent commit here:
> http://github.com/nmatzke/biopython/commits/Geography
>
> ...and run test_gbif_xml.py to get the idea, it will search on a
> taxon name, count/download all hits, parse the xml to a set of
> record objects, output each record to screen or tab-delimited file,
> etc.
>
> Cheers!
> Nick
>
>
>
>
>
> James Pringle wrote:
>
> Dear Mr. Matzke--
>
> I am an oceanographer at the University of New Hampshire, and
> with my colleagues John Wares and Jeb Byers am looking at the
> interaction of ocean circulation and species ranges. As part
> of that effort, I am using GBIF data, and was looking at your
> Summer-of-Code project. I want to start from a species name
> and get lat/long of occurance data. Is you toolbox in usable
> shape (I am an ok pythonista)? What is the best way to download
> a tested version of it (I can figure out how to get code from
> CVS/GIT, etc, so I am just looking for a pointer to a stable-ish
> tree)?
>
> Cheers,
> & Thanks
> Jamie Pringle
>
>
> --
> ====================================================
> Nicholas J. Matzke
> Ph.D. Candidate, Graduate Student Researcher
> Huelsenbeck Lab
> Center for Theoretical Evolutionary Genomics
> 4151 VLSB (Valley Life Sciences Building)
> Department of Integrative Biology
> University of California, Berkeley
>
> Lab websites:
> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> http://fisher.berkeley.edu/cteg/hlab.html
> Dept. personal page:
> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> Lab personal page:
http://fisher.berkeley.edu/cteg/members/matzke.html
> Lab phone: 510-643-6299
> Dept. fax: 510-643-6264
> Cell phone: 510-301-0179
> Email: matzke at berkeley.edu
>
> Mailing address:
> Department of Integrative Biology
> 3060 VLSB #3140
> Berkeley, CA 94720-3140
>
> -----------------------------------------------------
> "[W]hen people thought the earth was flat, they were wrong. When
> people thought the earth was spherical, they were wrong. But if you
> think that thinking the earth is spherical is just as wrong as
> thinking the earth is flat, then your view is wronger than both of
> them put together."
>
> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical
> Inquirer, 14(1), 35-44. Fall 1989.
> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> ====================================================
>
>
Nick Matzke wrote:
> Hi all...updates...
>
> Summary: Major focus is getting the GBIF access/search/parse module into
> "done"/submittable shape. This primarily requires getting the
> documentation and testing up to biopython specs. I have a fair bit of
> documentation and testing, need advice (see below) for specifics on what
> it should look like.
>
>
> Brad Chapman wrote:
>> Hi Nick;
>> Thanks for the update -- great to see things moving along.
>>
>>> - removed any reliance on lagrange tree module, refactored all
>>> phylogeny code to use the revised Bio.Nexus.Tree module
>>
>> Awesome -- glad this worked for you. Are the lagrange_* files in
>> Bio.Geography still necessary? If not, we should remove them from
>> the repository to clean things up.
>
>
> Ah, they had been deleted locally but it took an extra command to delete
> on git. Done.
>
>>
>> More generally, it would be really helpful if we could do a bit of
>> housekeeping on the repository. The Geography namespace has a lot of
>> things in it which belong in different parts of the tree:
>>
>> - The test code should move to the 'Tests' directory as a set of
>> test_Geography* files that we can use for unit testing the code.
>
> OK, I will do this. Should I try and figure out the unittest stuff? I
> could use a simple example of what this is supposed to look like.
>
>
>> - Similarly there are a lot of data files in there which are
>> appear to be test related; these could move to Tests/Geography
>
> Will do.
>
>> - What is happening with the Nodes_v2 and Treesv2 files? They look
>> like duplicates of the Nexus Nodes and Trees with some changes.
>> Could we roll those changes into the main Nexus code to avoid
>> duplication?
>
> Yeah, these were just copies with your bug fix, and with a few mods I
> used to track crashes. Presumably I don't need these with after a fresh
> download of biopython.
>
>
>
>>> - Code dealing with GBIF xml output completely refactored into the
>>> following classes:
>>>
>>> * ObsRecs (observation records & search results/summary)
>>> * ObsRec (an individual observation record)
>>> * XmlString (functions for cleaning xml returned by Gbif)
>>> * GbifXml (extention of capabilities for ElementTree xml trees,
>>> parsed from GBIF xml returns.
>>
>> I'm agreed with Hilmar -- the user classes would probably benefit from
>> expanded
>> naming. There is a art to naming to get them somewhere between the
>> hideous RidicuouslyLongNamesWithEverythingSpecified names and short
>> truncated names.
>> Specifically, you've got a lot of filler in the names -- dbfUtils,
>> geogUtils, shpUtils. The Utils probably doesn't tell the user much
>> and makes all of the names sort of blend together, just as the
>> Rec/Recs pluralization hides a quite large difference in what the
>> classes hold.
>
> Will work on this, these should be made part of the
> GbifObservationRecord() object or be accessed by it, basically they only
> exist to classify lat/long points into user-specified areas.
>
>> Something like Observation and ObservationSearchResult would make it
>> clear immediately what they do and the information they hold.
>
>
> Agreed, here is a new scheme for the names (changes already made):
>
> =============
> class GbifSearchResults():
>
> GbifSearchResults is a class for holding a series of
> GbifObservationRecord records, and processing them e.g. into classified
> areas.
>
> Also can hold a GbifDarwincoreXmlString record (the raw output returned
> from a GBIF search) and a GbifXmlTree (a class for holding/processing
> the ElementTree object returned by parsing the GbifDarwincoreXmlString).
>
>
>
> class GbifObservationRecord():
>
> GbifObservationRecord is a class for holding an individual observation
> at an individual lat/long point.
>
>
>
> class GbifDarwincoreXmlString(str):
>
> GbifDarwincoreXmlString is a class for holding the xmlstring returned by
> a GBIF search, & processing it to plain text, then an xmltree (an
> ElementTree).
>
> GbifDarwincoreXmlString inherits string methods from str (class String).
>
>
>
> class GbifXmlTree():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
> =============
>
> ...description of methods below...
>
>
>>
>>> This week:
>>
>> What are your thoughts on documentation? As a naive user of these
>> tools without much experience with the formats, I could offer better
>> feedback if I had an idea of the public APIs and how they are
>> expected to be used. Moreover, cookbook and API documentation is
>> something we will definitely need to integrate into Biopython. How
>> does this fit in your timeline for the remaining weeks?
>
> The API is really just the interface with GBIF. I think developing a
> cookbook entry is pretty easy, I assume you want something like one of
> the entries in the official biopython cookbook?
>
> Re: API documentation...are you just talking about the function
> descriptions that are typically in """ """ strings beneath the function
> definitions? I've got that done. Again, if there is more, an example
> of what it should look like would be useful.
>
> Documentation for the GBIF stuff below.
>
> ============
> gbif_xml.py
> Functions for accessing GBIF, downloading records, processing them into
> a class, and extracting information from the xmltree in that class.
>
>
> class GbifObservationRecord(Exception): pass
> class GbifObservationRecord():
> GbifObservationRecord is a class for holding an individual observation
> at an individual lat/long point.
>
>
> __init__(self):
>
> This is an instantiation class for setting up new objects of this class.
>
>
>
> latlong_to_obj(self, line):
>
> Read in a string, read species/lat/long to GbifObservationRecord object
> This can be slow, e.g. 10 seconds for even just ~1000 records.
>
>
> parse_occurrence_element(self, element):
>
> Parse a TaxonOccurrence element, store in OccurrenceRecord
>
>
> fill_occ_attribute(self, element, el_tag, format='str'):
>
> Return the text found in matching element matching_el.text.
>
>
>
> find_1st_matching_subelement(self, element, el_tag, return_element):
>
> Burrow down into the XML tree, retrieve the first element with the
> matching tag.
>
>
> record_to_string(self):
>
> Print the attributes of a record to a string
>
>
>
>
>
>
>
> class GbifDarwincoreXmlString(Exception): pass
>
> class GbifDarwincoreXmlString(str):
> GbifDarwincoreXmlString is a class for holding the xmlstring returned by
> a GBIF search, & processing it to plain text, then an xmltree (an
> ElementTree).
>
> GbifDarwincoreXmlString inherits string methods from str (class String).
>
>
>
> __init__(self, rawstring=None):
>
> This is an instantiation class for setting up new objects of this class.
>
>
>
> fix_ASCII_lines(self, endline=''):
>
> Convert each line in an input string into pure ASCII
> (This avoids crashes when printing to screen, etc.)
>
>
> _fix_ASCII_line(self, line):
>
> Convert a single string line into pure ASCII
> (This avoids crashes when printing to screen, etc.)
>
>
> _unescape(self, text):
>
> #
> Removes HTML or XML character references and entities from a text string.
>
> @param text The HTML (or XML) source text.
> @return The plain text, as a Unicode string, if necessary.
> source: http://effbot.org/zone/re-sub.htm#unescape-html
>
>
> _fix_ampersand(self, line):
>
> Replaces "&" with "&" in a string; this is otherwise
> not caught by the unescape and unicodedata.normalize functions.
>
>
>
>
>
>
>
> class GbifXmlTreeError(Exception): pass
> class GbifXmlTree():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
>
> __init__(self, xmltree=None):
>
> This is an instantiation class for setting up new objects of this class.
>
>
> print_xmltree(self):
>
> Prints all the elements & subelements of the xmltree to screen (may require
> fix_ASCII to input file to succeed)
>
>
> print_subelements(self, element):
>
> Takes an element from an XML tree and prints the subelements tag & text,
> and
> the within-tag items (key/value or whatnot)
>
>
> _element_items_to_dictionary(self, element_items):
>
> If the XML tree element has items encoded in the tag, e.g. key/value or
> whatever, this function puts them in a python dictionary and returns
> them.
>
>
> extract_latlongs(self, element):
>
> Create a temporary pseudofile, extract lat longs to it,
> return results as string.
>
> Inspired by: http://www.skymind.com/~ocrow/python_string/
> (Method 5: Write to a pseudo file)
>
>
>
>
> _extract_latlong_datum(self, element, file_str):
>
> Searches an element in an XML tree for lat/long information, and the
> complete name. Searches recursively, if there are subelements.
>
> file_str is a string created by StringIO in extract_latlongs() (i.e., a
> temp filestr)
>
>
>
> extract_all_matching_elements(self, start_element, el_to_match):
>
> Returns a list of the elements, picking elements by TaxonOccurrence;
> this should
> return a list of elements equal to the number of hits.
>
>
>
> _recursive_el_match(self, element, el_to_match, output_list):
>
> Search recursively through xmltree, starting with element, recording all
> instances of el_to_match.
>
>
> find_to_elements_w_ancs(self, el_tag, anc_el_tag):
>
> Burrow into XML to get an element with tag el_tag, return only those
> el_tags underneath a particular parent element parent_el_tag
>
>
> xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
> match_el_list):
>
> Recursively burrows down to find whatever elements with el_tag exist
> inside a parent_el_tag.
>
>
>
> create_sub_xmltree(self, element):
>
> Create a subset xmltree (to avoid going back to irrelevant parents)
>
>
>
> _xml_burrow_up(self, element, anc_el_tag, found_anc):
>
> Burrow up xml to find anc_el_tag
>
>
>
> _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
>
> Burrow up from element of interest, until a cousin is found with
> cousin_el_tag
>
>
>
>
> _return_parent_in_xmltree(self, child_to_search_for):
>
> Search through an xmltree to get the parent of child_to_search_for
>
>
>
> _return_parent_in_element(self, potential_parent, child_to_search_for,
> returned_parent):
>
> Search through an XML element to return parent of child_to_search_for
>
>
> find_1st_matching_element(self, element, el_tag, return_element):
>
> Burrow down into the XML tree, retrieve the first element with the
> matching tag
>
>
>
>
> extract_numhits(self, element):
>
> Search an element of a parsed XML string and find the
> number of hits, if it exists. Recursively searches,
> if there are subelements.
>
>
>
>
>
>
>
>
>
>
>
>
> class GbifSearchResults(Exception): pass
>
> class GbifSearchResults():
>
> GbifSearchResults is a class for holding a series of
> GbifObservationRecord records, and processing them e.g. into classified
> areas.
>
>
>
> __init__(self, gbif_recs_xmltree=None):
>
> This is an instantiation class for setting up new objects of this class.
>
>
>
> print_records(self):
>
> Print all records in tab-delimited format to screen.
>
>
>
>
> print_records_to_file(self, fn):
>
> Print the attributes of a record to a file with filename fn
>
>
>
> latlongs_to_obj(self):
>
> Takes the string from extract_latlongs, puts each line into a
> GbifObservationRecord object.
>
> Return a list of the objects
>
>
> Functions devoted to accessing/downloading GBIF records
> access_gbif(self, url, params):
>
> Helper function to access various GBIF services
>
> choose the URL ("url") from here:
> http://data.gbif.org/ws/rest/occurrence
>
> params are a dictionary of key/value pairs
>
> "self._open" is from Bio.Entrez.self._open, online here:
> http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open
>
> Get the handle of results
> (looks like e.g.: object at 0x48117f0>> )
>
> (open with results_handle.read() )
>
>
> _get_hits(self, params):
>
> Get the actual hits that are be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
>
> It will return the LAST non-none instance (in a standard search result
> there
> should be only one, anyway).
>
>
>
>
> get_xml_hits(self, params):
>
> Returns hits like _get_hits, but returns a parsed XML tree.
>
>
>
>
> get_record(self, key):
>
> Given the key, get a single record, return xmltree for it.
>
>
>
> get_numhits(self, params):
>
> Get the number of hits that will be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
>
> It will return the LAST non-none instance (in a standard search result
> there
> should be only one, anyway).
>
>
> xmlstring_to_xmltree(self, xmlstring):
>
> Take the text string returned by GBIF and parse to an XML tree using
> ElementTree.
> Requires the intermediate step of saving to a temporary file (required
> to make
> ElementTree.parse work, apparently)
>
>
>
> tempfn = 'tempxml.xml'
> fh = open(tempfn, 'w')
> fh.write(xmlstring)
> fh.close()
>
>
>
>
>
> get_all_records_by_increment(self, params, inc):
>
> Download all of the records in stages, store in list of elements.
> Increments of e.g. 100 to not overload server
>
>
>
> extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):
>
> Extract all of the 'TaxonOccurrence' elements to a list, store them in a
> GbifObservationRecord.
>
>
>
> _paramsdict_to_string(self, params):
>
> Converts the python dictionary of search parameters into a text
> string for submission to GBIF
>
>
>
> _open(self, cgi, params={}):
>
> Function for accessing online databases.
>
> Modified from:
> http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html
>
> Helper function to build the URL and open a handle to it (PRIVATE).
>
> Open a handle to GBIF. cgi is the URL for the cgi script to access.
> params is a dictionary with the options to pass to it. Does some
> simple error checking, and will raise an IOError if it encounters one.
>
> This function also enforces the "three second rule" to avoid abusing
> the GBIF servers (modified after NCBI requirement).
> ============
>
>
>>
>> Thanks again. Hope this helps,
>> Brad
>
> Very much, thanks!!
> Nick
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From biopython at maubp.freeserve.co.uk Mon Aug 10 16:49:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 21:49:29 +0100
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A8081B3.2080600@berkeley.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
Message-ID: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com>
On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote:
> Hi all...updates...
>
> Summary: Major focus is getting the GBIF access/search/parse module into
> "done"/submittable shape. ?This primarily requires getting the documentation
> and testing up to biopython specs. ?I have a fair bit of documentation and
> testing, need advice (see below) for specifics on what it should look like.
>
>> - The test code should move to the 'Tests' directory as a set of
>> ?test_Geography* files that we can use for unit testing the code.
>
> OK, I will do this. ?Should I try and figure out the unittest stuff? ?I
> could use a simple example of what this is supposed to look like.
You can either go for "unittest" based tests (generally better, but more
of a learning curve - but useful for any python project), or our own
Biopython specific "print and compare" tests (basically sample scripts
with their expected output).
Read the tests chapter in the Biopython Tutorial if you haven't already.
(And if you think anything could be clearer, or you spot a typo, let us
know please - feedback would be great).
Peter
From matzke at berkeley.edu Mon Aug 10 17:10:26 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 10 Aug 2009 14:10:26 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
<320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com>
Message-ID: <4A808CC2.6000308@berkeley.edu>
Peter wrote:
> On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote:
>> Hi all...updates...
>>
>> Summary: Major focus is getting the GBIF access/search/parse module into
>> "done"/submittable shape. This primarily requires getting the documentation
>> and testing up to biopython specs. I have a fair bit of documentation and
>> testing, need advice (see below) for specifics on what it should look like.
>>
>>> - The test code should move to the 'Tests' directory as a set of
>>> test_Geography* files that we can use for unit testing the code.
>> OK, I will do this. Should I try and figure out the unittest stuff? I
>> could use a simple example of what this is supposed to look like.
>
> You can either go for "unittest" based tests (generally better, but more
> of a learning curve - but useful for any python project), or our own
> Biopython specific "print and compare" tests (basically sample scripts
> with their expected output).
>
> Read the tests chapter in the Biopython Tutorial if you haven't already.
> (And if you think anything could be clearer, or you spot a typo, let us
> know please - feedback would be great).
Thanks!
Nick
>
> Peter
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From biopython at maubp.freeserve.co.uk Tue Aug 11 08:19:25 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Aug 2009 13:19:25 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
<320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
<320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
Message-ID: <320fb6e00908110519k313d6d34g40502fd2578326e1@mail.gmail.com>
On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote:
> In terms of speed, this new code takes under a minute to
> convert a 7 million short read FASTQ file to another FASTQ
> variant, or to a (line wrapped) FASTA file. In comparison,
> using Bio.SeqIO parse/write takes over five minutes.
If anyone is interested in the details, here I am using a 7 million
entry FASTQ file of short reads (length 36bp) from a Solexa FASTQ
format file (downloaded from the NCBI and then converted from the
Sanger FASTQ format). I'm timing conversion from Solexa to Sanger
FASTQ as it is a more common operation, and I can include the MAQ
script for comparison. I pipe the output via grep and word count as a
check on the conversion.
Using a (patched) version of MAQ's fq_all2std.pl we get about 4 mins:
$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
7047668
real 3m58.978s
user 4m13.475s
sys 0m3.705s
And using a patched version of EMBOSS 6.1.0 (without the optimisations
Peter Rice has mentioned), we get 3m42s.
$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
7047668
real 3m41.625s
user 3m56.753s
sys 0m4.091s
Using the latest Biopython in CVS (or the git master branch), with
Bio.SeqIO.parse/write, takes about twice this, 7m11s:
$ time python biopython_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
7047668
real 7m10.706s
user 7m27.597s
sys 0m3.850s
This is at least a marked improvement over Biopython 1.51b with
Bio.SeqIO.parse/write, which took about 17 minutes! The bad news is
while the Bio.SeqIO FASTQ read/write in CVS is faster than in
Biopython 1.51b, it is also much less elegant. I'm think once I've
finished adding test cases (and probably after 1.51 is out) it might
be worth while trying to make it more beautiful without sacrificing
too much of the speed gain.
Now to the good news, using my github branch with the convert function
we get a massive reduction to under a minute (52s):
$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
7047668
real 0m51.618s
user 1m7.735s
sys 0m3.162s
We have a winner! Assuming of course there are no mistakes ;)
In fact, these measurements are a little misleading because I am
including grep (to check the record count) and the output isn't
actually going to disk. Doing the grep on its own takes about 15s:
$ time grep "^@SRR" SRR001666_1.fastq_solexa | wc -l
7047668
real 0m15.318s
user 0m17.890s
sys 0m1.087s
However, if you actually output to a file the disk speed itself
becomes important when the conversion is this fast:
$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq
real 1m3.448s
user 0m49.672s
sys 0m4.826s
$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa > temp.fastq
real 3m55.086s
user 3m39.548s
sys 0m5.998s
$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa > temp.fastq
real 4m10.245s
user 3m54.880s
sys 0m5.085s
$ time python ../biopython/Tests/Quality/biopython_solexa2sanger.py <
SRR001666_1.fastq_solexa > temp.fastq
real 7m27.879s
user 7m9.084s
sys 0m6.008s
Nevertheless, the Bio.SeqIO.convert(...) function still wins for now.
Peter
For those interested, here are the tiny little Biopython scripts I'm using:
# biopython_solexa2sanger.py
#FASTQ conversion using Bio.SeqIO, needs Biopython 1.50 or later.
import sys
from Bio import SeqIO
records = SeqIO.parse(sys.stdin, "fastq-solexa")
SeqIO.write(records, sys.stdout, "fastq")
and:
#convert_solexa2sanger.py
#High performance FASTQ conversion using Bio.SeqIO.convert(...)
#function likely to be in Biopython 1.52 onwards.
import sys
from Bio import SeqIO
SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq")
From chapmanb at 50mail.com Tue Aug 11 09:10:19 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 11 Aug 2009 09:10:19 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A8081B3.2080600@berkeley.edu>
References: <20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
Message-ID: <20090811131019.GW12604@sobchak.mgh.harvard.edu>
Hi Nick;
> Summary: Major focus is getting the GBIF access/search/parse module into
> "done"/submittable shape. This primarily requires getting the
> documentation and testing up to biopython specs. I have a fair bit of
> documentation and testing, need advice (see below) for specifics on what
> it should look like.
Awesome. Thanks for working on the cleanup for this.
> OK, I will do this. Should I try and figure out the unittest stuff? I
> could use a simple example of what this is supposed to look like.
In addition to Peter's pointers, here is a simple example from a
small thing I wrote:
http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py
You can copy/paste the unit test part to get a base, and then
replace the t_* functions with your own real tests.
Simple scripts that generate consistent output are also fine; that's
the print and compare approach.
> > - What is happening with the Nodes_v2 and Treesv2 files? They look
> > like duplicates of the Nexus Nodes and Trees with some changes.
> > Could we roll those changes into the main Nexus code to avoid
> > duplication?
>
> Yeah, these were just copies with your bug fix, and with a few mods I
> used to track crashes. Presumably I don't need these with after a fresh
> download of biopython.
Cool. It would be great if we could weed these out as well.
> The API is really just the interface with GBIF. I think developing a
> cookbook entry is pretty easy, I assume you want something like one of
> the entries in the official biopython cookbook?
Yes, that would work great. What I was thinking of are some examples
where you provide background and motivation: Describe some useful
information you want to get from GBIF, and then show how to do it.
This is definitely the most useful part as it gives people working
examples to start with. From there they can usually browse the lower
level docs or code to figure out other specific things.
> Re: API documentation...are you just talking about the function
> descriptions that are typically in """ """ strings beneath the function
> definitions? I've got that done. Again, if there is more, an example
> of what it should look like would be useful.
That looks great for API level docs. You are right on here; for this
week I'd focus on the cookbook examples and cleanup stuff.
My other suggestion would be to rename these to follow Biopython
conventions, something like:
gbif_xml -> GbifXml
shpUtils -> ShapefileUtils
geogUtils -> GeographyUtils
dbfUtils -> DbfUtils
The *Utils might have underscores if they are not intended to be
called directly.
Thanks for all your hard work,
Brad
From chapmanb at 50mail.com Tue Aug 11 09:20:57 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 11 Aug 2009 09:20:57 -0400
Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython
In-Reply-To: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
Message-ID: <20090811132057.GX12604@sobchak.mgh.harvard.edu>
Hi Eric;
All sounds great -- looks like you are in good shape for finishing
things up this week. Really great work.
> - Presumably, any discussion of merging with Biopython will have to wait
> until after the biopython-1.51 release. I'll be around. For GSoC
> requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO
> modules along with the unit test suite as standalone files, rather than
> as a patch set since the last upstream revision I pulled was just a
> random untagged one around the time of the last beta release.
We were discussing a release at the end of this week or over the
weekend. I think we should roll this in soon after that so anyone
can get it from the main trunk. I don't see any major issues with
integrating it.
How did you like the Git/GitHub experience? One thing we should push
after this release is moving over to that as the official
repository. Since you have been doing full time Git work this
summer, your experience will be really helpful. I still rely on CVS
as a bit of a crutch, but should learn to do things fully in Git.
Brad
From biopython at maubp.freeserve.co.uk Tue Aug 11 12:13:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Aug 2009 17:13:58 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
<320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
<320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
<320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com>
Message-ID: <320fb6e00908110913x6cfe7826xa683a6dc130da26e@mail.gmail.com>
On Thu, Aug 6, 2009 at 5:05 PM, Peter wrote:
> Or we just declare both Bio.Application.generic_run and
> ApplicationResult obsolete, and simply recommend using
> subprocess with str(cline) as before. Would someone like to
> proof read (and test) the tutorial in CVS where I switched all
> the generic_run usage to subprocess?
>
I've just marked Bio.Application.generic_run and ApplicationResult as
obsolete in CVS.
I am content to wait for a consensus about any replacement for
generic_run once more people have tried using subprocess directly.
Peter
From biopython at maubp.freeserve.co.uk Tue Aug 11 12:44:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Aug 2009 17:44:11 +0100
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
Message-ID: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
Hi David & John,
Would either of you be able to draft a release announcement for
Biopython 1.51? We're aiming for the end of this week... touch wood.
I'm pretty sure the NEWS and DEPRECATED files are up to date (if
anyone can spot any omissions, please let us know), these try and
summarise changes for each release. Unless you have CVS or git
installed, the easiest way to read these files is currently from the
github website:
http://github.com/biopython/biopython/tree/master
Thanks,
Peter
P.S. Don't be afraid to repeat things from the Biopython 1.51 beta announcement:
http://news.open-bio.org/news/2009/06/biopython-151-beta-released/
http://lists.open-bio.org/pipermail/biopython-announce/2009-June/000057.html
From eric.talevich at gmail.com Tue Aug 11 14:50:02 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 11 Aug 2009 14:50:02 -0400
Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython
In-Reply-To: <20090811132057.GX12604@sobchak.mgh.harvard.edu>
References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
<20090811132057.GX12604@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf360908111150q495e541bv405b25f0d74127fd@mail.gmail.com>
On Tue, Aug 11, 2009 at 9:20 AM, Brad Chapman wrote:
>
> How did you like the Git/GitHub experience? One thing we should push
> after this release is moving over to that as the official
> repository. Since you have been doing full time Git work this
> summer, your experience will be really helpful. I still rely on CVS
> as a bit of a crutch, but should learn to do things fully in Git.
>
>
I liked it a lot! I've spent some time with Subversion, Bazaar, Mercurial
and Git now, and I'm confident that Git was the right choice for Biopython.
My commit history shows a quick flurry of activity on each of the past few
Fridays -- that's from a couple days of exploration toward the end of the
week, then repeated calls to "git add -i" to pick out the parts that are
worth keeping. I'm careful with git-rebase, but "git commit --amend" gets a
fair amount of use. I could add a section on the Biopython wiki's GitUsage
page, called something like "Managing Commits", giving some examples of
this.
GitHub has been down briefly a few times. It was only a problem because it
happened on Monday mornings, when I wanted to push an updated README to my
public fork at the same time as my weekly update e-mail to this list. Having
a mirror on GitHub is great for getting started with Biopython development,
but I'm still unclear on how changes should propagate back upstream after
Biopython switches from CVS to Git. Pull requests? Core devs pushing to a
central Git repository on OBF servers? Maybe the BioRuby folks have advice;
if this has been settled on biopython-dev, I've missed it.
Anyway.
To create the final patch tarball next Monday for GSoC, I believe the right
incantation looks like this:
git format-patch -o gsoc-phyloxml master...phyloxml
tar czf gsoc-phyloxml.tgz gsoc-phyloxml
That's cleaner than I expected it to be. Neat.
Cheers,
Eric
From winda002 at student.otago.ac.nz Wed Aug 12 01:47:13 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 12 Aug 2009 17:47:13 +1200
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
Message-ID: <4A825761.10106@student.otago.ac.nz>
Peter wrote:
> Hi David & John,
>
> Would either of you be able to draft a release announcement for
> Biopython 1.51? We're aiming for the end of this week... touch wood.
>
We'll definitely aim to have something for the list to check out in the
next 24hrs. I guess the main points are all the Cool New Stuff from the
beta being in a stable release for the first time, FASTQ has been shown
to play nicely with across a bunch of projects and
Application.generic_run() is now on the deprecation path?
On that note, would it be useful to have a cookbook example or even a
blog-post ready to go showing a few of the ways one might use subprocess
to run commands defined with Biopython? I'm happy to put something
together that others can evaluate.
Cheers,
David
From biopython at maubp.freeserve.co.uk Wed Aug 12 05:49:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 12 Aug 2009 10:49:50 +0100
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <4A825761.10106@student.otago.ac.nz>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
Message-ID: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
On Wed, Aug 12, 2009 at 6:47 AM, David
Winter wrote:
>
> Peter wrote:
>>
>> Hi David & John,
>>
>> Would either of you be able to draft a release announcement for
>> Biopython 1.51? We're aiming for the end of this week... touch wood.
>
> We'll definitely aim to have something for the list to check out in the next
> 24hrs. I guess the main points are all the Cool New Stuff from the beta
> being in a stable release for the first time, FASTQ has been shown to play
> nicely with across a bunch of projects and Application.generic_run() is now
> on the deprecation path?
Historically we haven't made a big thing about deprecations in the
release announcements. Maybe we should - in which case also
note that Bio.Fasta has finally been deprecated.
> On that note, would it be useful to have a cookbook example or even a
> blog-post ready to go showing a few of the ways one might use subprocess to
> run commands defined with Biopython? I'm happy to put something together
> that others can evaluate.
The tutorial has several examples at the end of the chapter on
alignments (because lots of the wrappers at the moment are for
alignment tools). I've just updated the copy online to the current
version from CVS (dated 10 August 2009). If you can spot any
errors in the next couple of days we can get them fixed before
the release.
Peter
From biopython at maubp.freeserve.co.uk Wed Aug 12 08:54:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 12 Aug 2009 13:54:15 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
Message-ID: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
On Thu, Jul 23, 2009 at 10:34 AM, Peter wrote:
> On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote:
>> I don't think there is much in it really. ?You have a factored
>> BinaryFile class, I have classes for the components of the SFF file.
>> Both are based around struct.
I have now written a third variant (loosely based on Jose's code).
This is just a single generator function (also based on struct).
Right now it is a slightly long function, but it can be refactored
easily enough. Is also a lot faster than Jose's code which is a
big plus point for large files. See:
http://github.com/peterjc/biopython/tree/sff
I haven't compared my new code against yours for speed yet
James, because your parser didn't like my large SFF file. You
have hard coded it to expect read names of length 14, and
400 flows per read. I have some data from Sanger where the
read names are length 14, but there are 800 flows per read.
Having the two reference parsers to look at was educational,
so thank you both (James and Jose) for sharing your code.
I now understand the SFF file format much better, and am now
confident I could design an indexer to provide dictionary like
access to it - a possible addition to Bio.SeqIO - see this thread:
http://lists.open-bio.org/pipermail/biopython/2009-June/005312.html
> Jose's code uses seek/tell which means it has to have a handle
> to an actual file. He also used binary read mode - I'm not sure if
> this was essential or not.
Binary more was not essential - opening an SFF file in default
mode also seemed to work fine with Jose's code.
> James' code seems to make a single pass though the file handle,
> without using seek/tell to jump about. I think this is nicer, as it is
> consistent with the other SeqIO parsers, and should work on
> more types of handles (e.g. from gzip, StringIO, or even a
> network connection).
I've also avoided using seek/tell in my rewrite.
> It looks like you (James) construct Seq objects using the full
> untrimmed sequence as is. I was undecided on if trimmed or
> untrimmed should be the default, but the idea of some kind of
> masked or trimmed Seq object had come up on the mailing list
> which might be useful here (and in contig alignments). i.e.
> something which acts like a Seq object giving the trimmed
> sequence, but which also contains the full sequence and trim
> positions.
I'm still thinking about this. One simplistic option (as used on
my branch) would be to have two input formats in Bio.SeqIO,
one untrimmed and one trimmed, e.g. "sff" and "sff-trim".
Peter
From winda002 at student.otago.ac.nz Wed Aug 12 20:32:55 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 13 Aug 2009 12:32:55 +1200
Subject: [Biopython-dev] Draft announcement for Biopython 1.51
In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
Message-ID: <4A835F37.5040907@student.otago.ac.nz>
Hi all, here is a draft announcement to go out when 1.51 is built and
ready to go. Comments and corrections are very welcome (should we keep
the deprecation paragraph in?)
I've also added a draft post to the OBF blog with this text marked up
with links and ready to go, hopefully that way whoever builds the
release can just ask someone with an account there (Brad and Peter at
least) to push post once everything is ready.
++
We are pleased to announce the release of Biopython 1.51.This new stable
release enhances version 1.50 (released in April) by extending the
functionality of existing modules, adding a set of application wrappers
for popular alignment programs and fixing a number of minor bugs.
In particular, the SeqIO module can now write genbank files that include
features and deal with FASTQ files created by Illumina 1.3+. Support for
this format allows interconversion between FASTQ files using Sloexa,
Sanger and Ilumina quality scores and has been validated against the the
BioPerl and EMBOSS implementations of this format.
Biopython 1.51 is the first stable release to include the
Align.Applications module which allows users to define command line
wrappers for popular alignment programs including ClustalW, Muscle and
T-Coffee.
??
This new release also spells the beginning of the end for some of
Biopython's older tools. Bio.Fasta and the application tools
ApplicationResult and generic_run() have been marked as deprecated which
means they can still be imported but doing who warn the user that these
functions will be removed in the future. Bio.Fasta has been superseded
by SeqIO's support for the Fasta format while we now suggest using the
subprocess module from the Python Standard Library to call applications
- use of this module is extensively documented in section 6.3 of the
Biopython Tutorial and Cookbook.
??
As always the Tutorial and Cookbook has been updated to document the
other changes made since the last release.
Thank you to everyone who tested our 1.51 beta or submitted bugs since
out last stable release and to all of our contributors
Sources and Windows Installer for the new release are available from the
downloads page.
++
From winda002 at student.otago.ac.nz Wed Aug 12 20:37:12 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 13 Aug 2009 12:37:12 +1200
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
Message-ID: <4A836038.1060609@student.otago.ac.nz>
>> On that note, would it be useful to have a cookbook example or even a
>> blog-post ready to go showing a few of the ways one might use subprocess to
>> run commands defined with Biopython? I'm happy to put something together
>> that others can evaluate.
>>
>
> The tutorial has several examples at the end of the chapter on
> alignments (because lots of the wrappers at the moment are for
> alignment tools). I've just updated the copy online to the current
> version from CVS (dated 10 August 2009). If you can spot any
> errors in the next couple of days we can get them fixed before
> the release.
>
> Peter
>
>
OK, I had only looked at the doc strings (my editor chokes on long
text files and I don't have anything to set Tex docs with) so didn't
know that existed. That looks really good (and the feeding output into
handles bit is pretty wizardly!)
Cheers,
David
From biopython at maubp.freeserve.co.uk Thu Aug 13 06:00:49 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 11:00:49 +0100
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <4A836038.1060609@student.otago.ac.nz>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
<4A836038.1060609@student.otago.ac.nz>
Message-ID: <320fb6e00908130300x3b4f1eb7m7711b76e0e03fd8a@mail.gmail.com>
On Thu, Aug 13, 2009 at 1:37 AM, David
Winter wrote:
>
> OK, I had only looked at the doc strings (my editor chokes on long
> text files and I don't have anything to set Tex docs with) so didn't
> know that existed.
TeX or LaTeX files are just plain text with some magic markup
e.g. \emph{text to emphasise}. Any decent text editor should
be able to load them, and some will even colour code things.
Even if you don't understand the markup, most of the time you
can actually read the raw files directly and understand them.
But yeah, the PDF or HTML output is what most people will
want to look at ;)
> That looks really good (and the feeding output into handles
> bit is pretty wizardly!)
Yeah - it is pretty cool. Sadly not all command line tools will
accept input via stdin, so this kind of thing isn't always
possible.
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 13 06:10:44 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 11:10:44 +0100
Subject: [Biopython-dev] Draft announcement for Biopython 1.51
In-Reply-To: <4A835F37.5040907@student.otago.ac.nz>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
<4A835F37.5040907@student.otago.ac.nz>
Message-ID: <320fb6e00908130310n8efa09dv81963277e607da52@mail.gmail.com>
Thanks for the first draft David,
On Thu, Aug 13, 2009 at 1:32 AM, David
Winter wrote:
> In particular, the SeqIO module can now write genbank files that include
> features and deal with FASTQ files created by Illumina 1.3+. Support for
> this format allows interconversion between FASTQ files using Sloexa, Sanger
> and Ilumina quality scores and has been validated against the the BioPerl
> and EMBOSS implementations of this format.
Typo: Sloexa -> Solexa. I would probably rephrase the rest a little, there
are some subtleties with 3 container formats but only 2 scoring systems...
In particular, the SeqIO module can now write GenBank with features, and
deal with FASTQ files created by Illumina 1.3+. Support for this format
allows interconversion between FASTQ files using the Sanger, Solexa
or Illumina 1.3+ FASTQ variants, using conventions agreed with the
BioPerl and EMBOSS projects.
[BioPerl and EMBOSS are still working on the FASTQ variants, so we
haven't actually got everything cross validated yet.]
> ??
> This new release also spells the beginning of the end for some of
> Biopython's older tools. Bio.Fasta and the application tools
> ApplicationResult and generic_run() have been marked as deprecated which
> means they can still be imported but doing who warn the user that these
> functions will be removed in the future. Bio.Fasta has been superseded by
> SeqIO's support for the Fasta format while we now suggest using the
> subprocess module from the Python Standard Library to call applications -
> use of this module is extensively documented in section 6.3 of the Biopython
> Tutorial and Cookbook.
> ??
I would omit that, or at least cut it down a lot. It might also be worth
mentioning we no longer include Martel/Mindy, and thus don't have
any dependence on mxTextTools. Also we don't support Python 2.3
anymore.
P.S. I try and avoid referring to sections of the Tutorial by number, as
these often change from release to release.
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 13 09:02:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 14:02:17 +0100
Subject: [Biopython-dev] [Biopython] Trimming adaptors sequences
In-Reply-To: <20090813124432.GB90165@sobchak.mgh.harvard.edu>
References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com>
<20090810131650.GP12604@sobchak.mgh.harvard.edu>
<320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com>
<20090813124432.GB90165@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00908130602n607add6fme67f7934234a5540@mail.gmail.com>
On Thu, Aug 13, 2009 at 1:44 PM, Brad Chapman wrote:
>> However, if you just want speed AND you really want to have a FASTQ
>> input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator
>> parser which gives plain strings, and handle the output yourself. Working
>> directly with Python strings is going to be faster than using Seq and
>> SeqRecord objects. You can even opt for outputting FASTQ files - as
>> long as you leave the qualities as an encoded string, you can just slice
>> that too. The downside is the code will be very specific. e.g. something
>> along these lines:
>>
>> from Bio.SeqIO.QualityIO import FastqGeneralIterator
>> in_handle = open(input_fastq_filename)
>> out_handle = open(output_fastq_filename, "w")
>> for title, seq, qual in FastqGeneralIterator(in_handle) :
>> ? ? #Do trim logic here on the string seq
>> ? ? if trim :
>> ? ? ? ? seq = seq[start:end]
>> ? ? ? ? qual = qual[start:end] # kept as ASCII string!
>> ? ? #Save the (possibly trimmed) FASTQ record:
>> ? ? out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
>> out_handle.close()
>> in_handle.close()
>
> Nice -- I will have to play with this. I hadn't dug into the current
> SeqRecord slicing code at all but I wonder if there is a way to keep
> the SeqRecord interface but incorporate some of these speed ups
> for common cases like this FASTQ trimming.
I suggest we continue this on the dev mailing list (this reply
is cross posted), as it is starting to get rather technical.
When you really care about speed, any object creation becomes
an issue. Right now for *any* record we have at least the
following objects being created: SeqRecord, Seq, two lists (for
features and dbxrefs), two dicts (for annotation and the per letter
annotation), and the restricted dict (for per letter annotations),
and at least four strings (sequence, id, name and description).
Perhaps some lazy instantiation might be worth exploring... for
example make dbxref, features, annotations or letter_annotations
into properties where the underlying object isn't created unless
accessed. [Something to try after Biopython 1.51 is out?]
I would guess (but haven't timed it) that for trimming FASTQ
SeqRecords, a bit part of the overhead is that we are using
Python lists of integers (rather than just a string) for the scores.
So sticking with the current SeqRecord object as is, one speed
up we could try would be to leave the FASTQ quality string as an
encoded string (rather than turning into integer quality scores,
and back again on output). It would be a hack, but adding this
as another SeqIO format name, e.g. "fastq-raw" or "fastq-ascii",
might work. We'd still need a new letter_annotations key, say
"fastq_qual_ascii". This idea might work, but it does seem ugly.
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 13 13:33:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 18:33:41 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <1250099579.4a83017b4e97c@webmail.upv.es>
References: <200904161146.28203.jblanca@btc.upv.es>
<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
<1250099579.4a83017b4e97c@webmail.upv.es>
Message-ID: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
[Jose - you didn't CC the list with your reply]
On Wed, Aug 12, 2009 at 6:52 PM, Blanca Postigo Jose
Miguel wrote:
>
> Hi:
>
> I just love free software :) It's great to watch how the code is being improved
> by the work of so many people. I hope to get some time to get a look at the
> latest sff reader.
You'll probably be interested to know I've made some excellent progress
with the (optional) SFF index block. I note that the specifications (both
on the NCBI page and in the Roche manual) appear to suggest that the
index block could appear in the middle of the the read data. However,
in all the examples I have looked at, the index is actually at the end.
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff
Sadly the format of the index isn't documented, but I think I have
reverse engineered the format that Roche SFF files are using. In a
slight twist of the specification they are actually using the index bock
for both XML meta data AND and index of the read offsets.
This will dovetail nicely with the indexing support in Bio.SeqIO
which I am working on for Biopython 1.52, branch on github.
I expect to have fast random access to reads in an SFF file
very soon. See http://github.com/peterjc/biopython/tree/convert
>> > It looks like you (James) construct Seq objects using the full
>> > untrimmed sequence as is. I was undecided on if trimmed or
>> > untrimmed should be the default, but the idea of some kind of
>> > masked or trimmed Seq object had come up on the mailing list
>> > which might be useful here (and in contig alignments). i.e.
>> > something which acts like a Seq object giving the trimmed
>> > sequence, but which also contains the full sequence and trim
>> > positions.
>>
>> I'm still thinking about this. One simplistic option (as used on
>> my branch) would be to have two input formats in Bio.SeqIO,
>> one untrimmed and one trimmed, e.g. "sff" and "sff-trim".
>
> I think that some way to mask the SeqRecord or Seq object
> would be great. It would be useful for many tasks, not just this
> one.
Sure - if we can come up with a suitable design...
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 13 13:38:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 18:38:43 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
<1250099579.4a83017b4e97c@webmail.upv.es>
<320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
Message-ID: <320fb6e00908131038v567ed86fjb775d810fb69e7d@mail.gmail.com>
Peter wrote:
>
> Sadly the format of the index isn't documented, but I think I have
> reverse engineered the format that Roche SFF files are using. In a
> slight twist of the specification they are actually using the index bock
> for both XML meta data AND and index of the read offsets.
I'm not the first to notice this, see for example the Celera Assembler
looks in a Roche SFF file's XML meta data to determine how the
quality scores were called:
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Roche_454_Platforms
Peter
From jblanca at btc.upv.es Fri Aug 14 02:01:42 2009
From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel)
Date: Fri, 14 Aug 2009 08:01:42 +0200
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
<1250099579.4a83017b4e97c@webmail.upv.es>
<320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
<1250192416.4a846c2045f94@webmail.upv.es>
<320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com>
Message-ID: <1250229702.4a84fdc6c403a@webmail.upv.es>
Mensaje citado por Peter :
> On Thu, Aug 13, 2009 at 8:40 PM, Blanca Postigo Jose
> Miguel wrote:
> >
> >> This will dovetail nicely with the indexing support in Bio.SeqIO
> >> which I am working on for Biopython 1.52, branch on github.
> >> I expect to have fast random access to reads in an SFF file
> >> very soon. See http://github.com/peterjc/biopython/tree/convert
> >
> > I've written some code to solve a similar problem. Maybe you
> > could take a look to it. It's in the classes FileIndex and
> > FileSequenceIndex at:
> >
> >
> http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/biolib_seqio_utils.py
> >
>
> Did you see this thread?
> http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html
>
> The coding style is quite different, but it looks the essential idea
> is the same - we both scan the file to find each record, and use
> a dictionary to record the offset. Interestingly you and Peio also
> keeps the record's length in the dictionary, which will double the
> memory requirements - for something you don't actually need.
>
> Peter
>
> P.S. You can forward or CC this back to the list if you like.
We keep the record length to be able to return the record without having to scan
the file again.
Jose Blanca
From biopython at maubp.freeserve.co.uk Fri Aug 14 05:36:31 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 14 Aug 2009 10:36:31 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <1250229702.4a84fdc6c403a@webmail.upv.es>
References: <200904161146.28203.jblanca@btc.upv.es>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
<1250099579.4a83017b4e97c@webmail.upv.es>
<320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
<1250192416.4a846c2045f94@webmail.upv.es>
<320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com>
<1250229702.4a84fdc6c403a@webmail.upv.es>
Message-ID: <320fb6e00908140236v547ea056g965c7b7cd61d555c@mail.gmail.com>
On Fri, Aug 14, 2009 at 7:01 AM, Blanca Postigo Jose
Miguel wrote:
>
>> The coding style is quite different, but it looks the essential idea
>> is the same - we both scan the file to find each record, and use
>> a dictionary to record the offset. Interestingly you and Peio also
>> keeps the record's length in the dictionary, which will double the
>> memory requirements - for something you don't actually need.
>
> We keep the record length to be able to return the record without
> having to scan the file again.
If you want to be able to extract the raw record, that makes sense.
It is still a trade off between memory usage and speed of access,
and depending on your requirements either way makes sense.
For Bio.SeqIO, I want to parse the raw record on access via the
key in order to return a SeqRecord, so I have no need to keep
the raw record length in memory. I'm using this github branch:
http://github.com/peterjc/biopython/commits/index
Peter
From biopython at maubp.freeserve.co.uk Fri Aug 14 07:57:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 14 Aug 2009 12:57:26 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
<1250099579.4a83017b4e97c@webmail.upv.es>
<320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
Message-ID: <320fb6e00908140457k66e747dep881abf0b044ab9c1@mail.gmail.com>
On Thu, Aug 13, 2009 at 6:33 PM, Peter wrote:
>
> You'll probably be interested to know I've made some excellent progress
> with the (optional) SFF index block. I note that the specifications (both
> on the NCBI page and in the Roche manual) appear to suggest that the
> index block could appear in the middle of the the read data. However,
> in all the examples I have looked at, the index is actually at the end.
>
> http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff
>
> Sadly the format of the index isn't documented, but I think I have
> reverse engineered the format that Roche SFF files are using. In
> a slight twist of the specification they are actually using the index
> block for both XML meta data AND an index of the read offsets.
>
> This will dovetail nicely with the indexing support in Bio.SeqIO
> which I am working on for Biopython 1.52, branch on github.
> I expect to have fast random access to reads in an SFF file
> very soon. See http://github.com/peterjc/biopython/tree/convert
Sorry, wrong branch - my "index" branch has the indexing (as well
as SFF files and the Bio.SeqIO.convert() functionality):
http://github.com/peterjc/biopython/tree/index
I've got this code working nicely for reading or indexing SFF files.
Testing with a 2GB SFF file with 660808 Roche 454 reads, using
the Roche index I can load this in under 3 seconds and retrieve
any single record almost instantly. If the index is missing (or not
in the expected format) I have to scan the file to build my own
index, and that takes about 11 seconds - which is still fine :)
Peter
From biopython at maubp.freeserve.co.uk Fri Aug 14 08:00:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 14 Aug 2009 13:00:15 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
Message-ID: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com>
On Wed, Aug 12, 2009 at 1:54 PM, Peter wrote:
>
>> Jose's code uses seek/tell which means it has to have a handle
>> to an actual file. He also used binary read mode - I'm not sure if
>> this was essential or not.
>
> Binary mode was not essential - opening an SFF file in default
> mode also seemed to work fine with Jose's code.
Having worked on this more, default mode or binary mode are fine.
However, as you might expect, you can't use Python's universal
read lines mode when parsing SFF files.
Peter
From biopython at maubp.freeserve.co.uk Fri Aug 14 09:25:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 14 Aug 2009 14:25:43 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <1250252775.4a8557e7d9ae4@webmail.upv.es>
References: <200904161146.28203.jblanca@btc.upv.es>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
<1250099579.4a83017b4e97c@webmail.upv.es>
<320fb6e00908131033x3dae2bcclb36a74b6d84a9680@mail.gmail.com>
<1250192416.4a846c2045f94@webmail.upv.es>
<320fb6e00908131456o2103578cs3ed1130c622307a2@mail.gmail.com>
<1250229702.4a84fdc6c403a@webmail.upv.es>
<320fb6e00908140236v547ea056g965c7b7cd61d555c@mail.gmail.com>
<1250252775.4a8557e7d9ae4@webmail.upv.es>
Message-ID: <320fb6e00908140625v5f5bd338qc081e0e5091df9bf@mail.gmail.com>
Jose wrote:
>>> We keep the record length to be able to return the record without
>>> having to scan the file again.
Peter wrote:
>> If you want to be able to extract the raw record, that makes sense.
>> It is still a trade off between memory usage and speed of access,
>> and depending on your requirements either way makes sense.
>>
>> For Bio.SeqIO, I want to parse the raw record on access via the
>> key in order to return a SeqRecord, so I have no need to keep
>> the raw record length in memory. I'm using this github branch:
>> http://github.com/peterjc/biopython/commits/index
Jose wrote:
> We want the raw record because we plan to use this FileIndex on several
> different files, not just for sequences. In fact you have an example on how to
> use it for sequences in SequenceFileIndex, a class that uses the general
> FileIndex. I think that this FileIndex class will be able even to index xml
> files. This is the motivation for the design.
I see - that makes sense.
Peter
From biopython at maubp.freeserve.co.uk Fri Aug 14 11:20:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 14 Aug 2009 16:20:21 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
<320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
<320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
Message-ID: <320fb6e00908140820uf86603bh408dc93f99a3641a@mail.gmail.com>
On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote:
> On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote:
>> I've stuck a branch up on github which (thus far) simply defines
>> the Bio.SeqIO.convert and Bio.AlignIO.convert functions.
>> Adding optimised code can come later.
>>
>> http://github.com/peterjc/biopython/commits/convert
>
> There is now a new file Bio/SeqIO/_convert.py on this
> branch, and a few optimised conversions have been done.
> In particular GenBank/EMBL to FASTA, any FASTQ to
> FASTA, and inter-conversion between any of the three
> FASTQ formats.
>
> The current Bio/SeqIO/_convert.py file actually looks very
> long and complicated - but if you ignore the doctests (which
> I would probably move to a dedicated unit test), it isn't that
> much code at all.
I have now moved all the test code to a new unit test file,
test_SeqIO_convert.py, and think this code is ready for public
testing/review, with a the aim of inclusion in Biopython 1.52
(i.e. it can wait until after 1.51 is done). I would still need
to add this to the tutorial, but that won't take very long.
Peter
From bugzilla-daemon at portal.open-bio.org Fri Aug 14 11:23:14 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 14 Aug 2009 11:23:14 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
files in Bio.SeqIO
In-Reply-To:
Message-ID: <200908141523.n7EFNExJ014906@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2837
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-14 11:23 EST -------
(In reply to comment #2)
> (From update of attachment 1303 [details])
> This file is already a tiny bit out of date - I've started working on this
> on a git branch.
>
> http://github.com/peterjc/biopython/commits/sff
Actually, I got rid of that branch after merging it into my work on Bio.SeqIO
indexing. I can now parse the Roche SFF index, allowing fast random access to
the reads. See:
http://github.com/peterjc/biopython/commits/index
http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Fri Aug 14 17:08:32 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 14 Aug 2009 17:08:32 -0400
Subject: [Biopython-dev] Biopython 1.51 code freeze
Message-ID: <20090814210832.GL90165@sobchak.mgh.harvard.edu>
Hey all;
I'll be doing the 1.51 release this weekend, so am declaring an
official code freeze until things get finished. If you have any last
minute bugs or issues please check them in this evening; otherwise
no more CVS commits until 1.51 is officially rolled and announced.
Like, um, go outside this weekend or something.
David -- thanks for writing up the release announcement.
Everyone -- thanks for all your hard work on getting things ready for
the release. After this is rolled we should be able to start checking in
new functionality for 1.52 and beyond.
Have a great weekend,
Brad
From biopython at maubp.freeserve.co.uk Sat Aug 15 08:09:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 15 Aug 2009 13:09:39 +0100
Subject: [Biopython-dev] Biopython 1.51 code freeze
In-Reply-To: <20090814210832.GL90165@sobchak.mgh.harvard.edu>
References: <20090814210832.GL90165@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00908150509m322bfbd5yc55ab67b2af733a@mail.gmail.com>
On Fri, Aug 14, 2009 at 10:08 PM, Brad Chapman wrote:
> Hey all;
> I'll be doing the 1.51 release this weekend, so am declaring an
> official code freeze until things get finished. If you have any last
> minute bugs or issues please check them in this evening; otherwise
> no more CVS commits until 1.51 is officially rolled and announced.
> Like, um, go outside this weekend or something.
Cool - now that it has stopped raining, I might do that ;)
Peter
From tiagoantao at gmail.com Sat Aug 15 14:05:40 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sat, 15 Aug 2009 19:05:40 +0100
Subject: [Biopython-dev] Biopython 1.51 code freeze
In-Reply-To: <20090814210832.GL90165@sobchak.mgh.harvard.edu>
References: <20090814210832.GL90165@sobchak.mgh.harvard.edu>
Message-ID: <6d941f120908151105l7144f806ub43b6aa761ed22a8@mail.gmail.com>
Outside in this case means 37C and planting trees under heavy sun
(with a short break for checking email on my mobile behind a shadow).
Congratz on 1.51.
I intend to start checking in new functionality in around 2 weeks. If
someone wants to have a look at the code that is on git(genepop
branch) and criticize, feel free.
back to the trees now.
2009/8/14, Brad Chapman :
> Hey all;
> I'll be doing the 1.51 release this weekend, so am declaring an
> official code freeze until things get finished. If you have any last
> minute bugs or issues please check them in this evening; otherwise
> no more CVS commits until 1.51 is officially rolled and announced.
> Like, um, go outside this weekend or something.
>
> David -- thanks for writing up the release announcement.
>
> Everyone -- thanks for all your hard work on getting things ready for
> the release. After this is rolled we should be able to start checking in
> new functionality for 1.52 and beyond.
>
> Have a great weekend,
> Brad
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
--
Enviada a partir do meu dispositivo m?vel
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From chapmanb at 50mail.com Sun Aug 16 20:48:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Sun, 16 Aug 2009 20:48:26 -0400
Subject: [Biopython-dev] Biopython 1.51 status
Message-ID: <20090817004826.GA4221@kunkel>
Hey all;
1.51 is all checked and prepped and ready to go. However, I don't
appear to have a user account on portal.open-bio.org, so can't
transfer the new tarballs and api over there. Peter, you had
mentioned you could do the windows installers. When you do those,
could you also transfer over these tarballs and stick them in the
right places:
http://chapmanb.50mail.com/biopython-1.51.tar.gz
http://chapmanb.50mail.com/biopython-1.51.zip
http://chapmanb.50mail.com/api.tar.gz
If you can do that I'll update the website and send out
announcements in the morning. Thanks much.
1.51 on the way,
Brad
From biopython at maubp.freeserve.co.uk Mon Aug 17 05:04:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 10:04:16 +0100
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <20090817004826.GA4221@kunkel>
References: <20090817004826.GA4221@kunkel>
Message-ID: <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
On Mon, Aug 17, 2009 at 1:48 AM, Brad Chapman wrote:
> Hey all;
> 1.51 is all checked and prepped and ready to go. However, I don't
> appear to have a user account on portal.open-bio.org, so can't
> transfer the new tarballs and api over there.
Right - your old account probably as had its password reset
or something - do you want to contact the OBF or should I?
> Peter, you had mentioned you could do the windows installers.
> When you do those, could you also transfer over these tarballs
> and stick them in the right places:
>
> http://chapmanb.50mail.com/biopython-1.51.tar.gz
> http://chapmanb.50mail.com/biopython-1.51.zip
> http://chapmanb.50mail.com/api.tar.gz
Will do...
> If you can do that I'll update the website and send out
> announcements in the morning. Thanks much.
Give me an hour or so ;)
Peter
From biopython at maubp.freeserve.co.uk Mon Aug 17 06:01:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 11:01:47 +0100
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
References: <20090817004826.GA4221@kunkel>
<320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
Message-ID: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
On Mon, Aug 17, 2009 at 10:04 AM, Peter wrote:
>> If you can do that I'll update the website and send out
>> announcements in the morning. Thanks much.
>
> Give me an hour or so ;)
OK, all uploaded, including the new tutorial. I also did the wiki
(as it was simple for me to get the new file sizes), and added
version 1.51 to bugzilla (not sure if you have the relevent
permissions there or not - could you check?).
Over to you now Brad for the release announcements (OBF
blog, email) and PyPi, http://pypi.python.org/pypi/biopython/
and anything else on the list.
Thanks,
Peter
From chapmanb at 50mail.com Mon Aug 17 08:16:18 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 17 Aug 2009 08:16:18 -0400
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
References: <20090817004826.GA4221@kunkel>
<320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
<320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
Message-ID: <20090817121618.GD12768@sobchak.mgh.harvard.edu>
Peter;
Thanks for the help with this. Everything else is all finished up --
news posted and message sent to the lists. The announcement e-mail
only needs to be approved on biopython-announce.
I wrote a message to open-bio support to get my password reset on portal,
so hopefully we'll get that all sorted.
It's great to have this out. Thanks again to everyone for all the hard
work,
Brad
> On Mon, Aug 17, 2009 at 10:04 AM, Peter wrote:
> >> If you can do that I'll update the website and send out
> >> announcements in the morning. Thanks much.
> >
> > Give me an hour or so ;)
>
> OK, all uploaded, including the new tutorial. I also did the wiki
> (as it was simple for me to get the new file sizes), and added
> version 1.51 to bugzilla (not sure if you have the relevent
> permissions there or not - could you check?).
>
> Over to you now Brad for the release announcements (OBF
> blog, email) and PyPi, http://pypi.python.org/pypi/biopython/
> and anything else on the list.
>
> Thanks,
>
> Peter
From biopython at maubp.freeserve.co.uk Mon Aug 17 08:17:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 13:17:53 +0100
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <20090817121618.GD12768@sobchak.mgh.harvard.edu>
References: <20090817004826.GA4221@kunkel>
<320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
<320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
<20090817121618.GD12768@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00908170517s7b1a6fb4h37bd2d22046dc3a@mail.gmail.com>
On Mon, Aug 17, 2009 at 1:16 PM, Brad Chapman wrote:
> Peter;
> Thanks for the help with this. Everything else is all finished up --
> news posted and message sent to the lists. The announcement e-mail
> only needs to be approved on biopython-announce.
Done.
> I wrote a message to open-bio support to get my password reset on portal,
> so hopefully we'll get that all sorted.
Cool.
> It's great to have this out. Thanks again to everyone for all the hard
> work,
> Brad
And thank you Brad :)
Peter
From biopython at maubp.freeserve.co.uk Mon Aug 17 08:43:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 13:43:01 +0100
Subject: [Biopython-dev] Moving from CVS to git
Message-ID: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
Hi all,
Now that Biopython 1.51 is out (thanks Brad), we should
discuss finally moving from CVS to git. This was something
we talked about at BOSC/ISMB 2009, but not everyone was
there. We have two main options:
(a) Move from CVS (on the OBF servers) to github. All our
developers will need to get github accounts, and be added
as "collaborators" to the existing github repository. I would
want a mechanism in place to backup the repository to the
OBF servers (Bartek already has something that should
work).
(b) Move from CVS to git (on the OBF servers). All our
developers can continue to use their existing OBF accounts.
Bartek's existing scripts could be modified to push the
updates from this OBF git repository onto github.
In either case, there will be some "plumbing" work required,
for example I'd like to continue to offer a recent source code
dump at http://biopython.open-bio.org/SRC/biopython/ etc.
Given we don't really seem to have the expertise "in house"
to run an OBF git server ourselves right now, option (a) is
simplest, and as I recall those of us at BOSC where OK
with this plan.
Assuming we go down this route (CVS to github), everyone
with an existing CVS account should setup a github account
if they want to continue to have commit access (e.g. Frank,
Iddo). I would suggest that initially you get used to working
with git and github BEFORE trying anything directly on what
would be the "official" repository. It took me a while and I'm
still learning ;)
Is this agreeable? Are there any other suggestions?
[Once this is settled, we can talk about things like merge
requests and if they should be accompanied by a Bugzilla
ticket or not.]
Peter
From eric.talevich at gmail.com Mon Aug 17 10:02:02 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 17 Aug 2009 10:02:02 -0400
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
References: <20090817004826.GA4221@kunkel>
<320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
<320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
Message-ID: <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com>
On Mon, Aug 17, 2009 at 6:01 AM, Peter wrote:
> On Mon, Aug 17, 2009 at 10:04 AM, Peter
> wrote:
> >> If you can do that I'll update the website and send out
> >> announcements in the morning. Thanks much.
> >
> > Give me an hour or so ;)
>
> OK, all uploaded, including the new tutorial. I also did the wiki
> (as it was simple for me to get the new file sizes), and added
> version 1.51 to bugzilla (not sure if you have the relevent
> permissions there or not - could you check?).
>
> Over to you now Brad for the release announcements (OBF
> blog, email) and PyPi, http://pypi.python.org/pypi/biopython/
> and anything else on the list.
>
> Thanks,
>
> Peter
>
Great to see the release went smoothly! I'm probably being impatient here,
but was a tag created for v1.51 final? I don't see it in GitHub yet, and
it's been slightly over an hour since the last push.
Thanks,
Eric
From chapmanb at 50mail.com Mon Aug 17 10:17:58 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 17 Aug 2009 10:17:58 -0400
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com>
References: <20090817004826.GA4221@kunkel>
<320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
<320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
<3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com>
Message-ID: <20090817141758.GE12768@sobchak.mgh.harvard.edu>
Hi Eric;
> Great to see the release went smoothly! I'm probably being impatient here,
> but was a tag created for v1.51 final? I don't see it in GitHub yet, and
> it's been slightly over an hour since the last push.
It was tagged last evening as biopython-151:
> cvs log setup.py | head
RCS file: /home/repository/biopython/biopython/setup.py,v
Working file: setup.py
head: 1.171
branch:
locks: strict
access list:
symbolic names:
biopython-151: 1.171
biopython-151b: 1.168
Maybe there is an issue with tags pushing to Git. Bartek and Peter
were discussing this, but I don't remember the ultimate conclusion.
Brad
From bartek at rezolwenta.eu.org Mon Aug 17 10:29:31 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Mon, 17 Aug 2009 16:29:31 +0200
Subject: [Biopython-dev] Biopython 1.51 status
In-Reply-To: <20090817141758.GE12768@sobchak.mgh.harvard.edu>
References: <20090817004826.GA4221@kunkel>
<320fb6e00908170204m40ebd63cvb1371d7b86aab1fc@mail.gmail.com>
<320fb6e00908170301g789b5dacr1b3d49fa01a44601@mail.gmail.com>
<3f6baf360908170702y7fde75b5l491ddf0cba0143e3@mail.gmail.com>
<20090817141758.GE12768@sobchak.mgh.harvard.edu>
Message-ID: <8b34ec180908170729g6c13333dk98d722cdb1d54bf0@mail.gmail.com>
On Mon, Aug 17, 2009 at 4:17 PM, Brad Chapman wrote:
> Hi Eric;
>
>> Great to see the release went smoothly! I'm probably being impatient here,
>> but was a tag created for v1.51 final? I don't see it in GitHub yet, and
>> it's been slightly over an hour since the last push.
>
> Maybe there is an issue with tags pushing to Git. Bartek and Peter
> were discussing this, but I don't remember the ultimate conclusion.
The ultimate conclusion will be reached when we move to github... ;)
But for now, I'll just need to convert this tag manually.
Just give me a few hours
Bartek
From bartek at rezolwenta.eu.org Mon Aug 17 11:07:32 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Mon, 17 Aug 2009 17:07:32 +0200
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
Message-ID: <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
On Mon, Aug 17, 2009 at 2:43 PM, Peter wrote:
> Hi all,
>
> Given we don't really seem to have the expertise "in house"
> to run an OBF git server ourselves right now, option (a) is
> simplest, and as I recall those of us at BOSC where OK
> with this plan.
>
> Assuming we go down this route (CVS to github), everyone
> with an existing CVS account should setup a github account
> if they want to continue to have commit access (e.g. Frank,
> Iddo). I would suggest that initially you get used to working
> with git and github BEFORE trying anything directly on what
> would be the "official" repository. It took me a while and I'm
> still learning ;)
>
> Is this agreeable? Are there any other suggestions?
>
> [Once this is settled, we can talk about things like merge
> requests and if they should be accompanied by a Bugzilla
> ticket or not.]
>
Hi All,
I absolutely agree here with Peter, i.e. I would suggest we move now from
CVS to a git branch hosted on github.
Since I'm more involved in the technical setup we currently have, I'd also add
a few more technical arguments for this move:
- While current setup is working, it is suboptimal because there is an extra
conversion step both for accepting changes done by people in git (git to CVS)
and propagating releases (CVS to github).
- Once we move to git as our version control system, we need to have a
"master" branch
which will be easily available for viewing and branching: we can't
do it now on open-bio
servers (it requires git installation and some server-side scripts
to have a browseable
repository), also "moving" to github is easier because it actually
requires no physical action,
we just need to stop updating CVS.
- If anyone have fears of depending on github, I think it's much less
of a problem than with CVS,
moving our "master" branch from github to somewhere else is very
easy and does not require
any action on the side of github, we just post the branch somewhere,
and start pushing there
(you can find a list of possible hosting solutions here:
http://git.or.cz/gitwiki/GitHosting)
- Regarding the backups of the github branch: I'm already doing this.
If you have a shell account
on dev.open-bio.org, you can get the current git branch of
biopython from /home/bartek/git_branch
(location subject to change), so this would require no additional
work, although it would be optimal,
to actually install git on open-bio server, so that the updating
script can be run from there. If we had that,
we could actually hook it up directly to github, so that instead of
running once in an hour, it would be run
after each push to the branch (http://github.com/guides/post-receive-hooks)
To summarize, I'm ready to switch off the part of my script which is
updating the gihub branch from CVS.
For now, I would leave the part that is making backups of github
branch on open-bio server (via rsync).
Once we have git installed on dev.open-bio, I can hook it up to
notifications from github.
cheers
Bartek
From biopython at maubp.freeserve.co.uk Mon Aug 17 11:52:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 16:52:36 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
Message-ID: <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
On Mon, Aug 17, 2009 at 4:07 PM, Bartek
Wilczynski wrote:
>
> Hi All,
>
> I absolutely agree here with Peter, i.e. I would suggest we move now from
> CVS to a git branch hosted on github.
>
> Since I'm more involved in the technical setup we currently have, I'd also add
> a few more technical arguments for this move:
>
> - While current setup is working, it is suboptimal because there is an extra
> ?conversion step both for accepting changes done by people in git (git to CVS)
> ?and propagating releases (CVS to github).
Yeah - this works but for anything non-trivial it would be a pain.
> - Once we move to git as our version control system, we need to have a
> "master" branch ?which will be easily available for viewing and branching:
> we can't do it now on open-bio?servers (it requires git installation and
> some server-side scripts to have a browseable?repository), ...
My impression from talking to OBF guys is if we really want to we
can do this, but it require us (Biopython) to take care of installing
and running git on an OBF machine.
> ... also "moving" to github is easier because it actually
> requires no physical action, ?we just need to stop updating CVS.
Yes - this is the big plus of "option (a)" over "option (b)" in my
earlier email.
> - If anyone have fears of depending on github, I think it's much less
> of a problem than with CVS, ?moving our "master" branch from
> github to somewhere else is very easy and does not require any
> action on the side of github, we just post the branch somewhere,
> and start pushing there (you can find a list of possible hosting
> solutions here: http://git.or.cz/gitwiki/GitHosting)
Yes, it is good to know we won't be tied to github (unless we start
using more of the tools they offer on top of git itself).
> - Regarding the backups of the github branch: I'm already doing this.
> If you have a shell account on dev.open-bio.org, you can get the
> current ?git branch of biopython from /home/bartek/git_branch
>?(location subject to change), so this would require no additional
> work,
Yes - that is what I was hinting at in my email (trying to be brief).
> ... although it would be optimal, to actually install git on open-bio
> server, so that the updating script can be run from there.
Yes. Even something as simple as a cron job running on an
OBF server would satisfy me from a back up point of view.
> If we had that, we could actually hook it up directly to github,
> so that instead of running once in an hour, it would be run
>?after each push to the branch (http://github.com/guides/post-receive-hooks)
More complex, but worth considering.
> To summarize, I'm ready to switch off the part of my script which is
> updating the gihub branch from CVS.
Good. We'll also want to ask the OBF admins to make CVS
read only once we move.
> For now, I would leave the part that is making backups of github
> branch on open-bio server (via rsync).
That would be my plan for the short term. We can then talk
to the OBF server admins about how we can do this better.
> Once we have git installed on dev.open-bio, I can hook it
> up to notifications from github.
If we go to the trouble of installing git on the OBF servers ;)
Peter
From mhampton at d.umn.edu Mon Aug 17 11:42:39 2009
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Mon, 17 Aug 2009 10:42:39 -0500 (CDT)
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To:
References:
Message-ID:
Hi,
I am preparing biopython-1.51 for inclusion as an optional package for
Sage (www.sagemath.org). I ran the test suite and got 8 errors; I am not
sure if these are all expected. The KDTree ones I have seen before, but
some look new. My test log is available at:
http://sage.math.washington.edu/home/mhampton/biopython-1.51-testlog.txt
in case anyone wants to take a look.
Biopython has been available in Sage for several years as an optional
package, but I would like to make it a standard component. This has
become much more likely since the clean-up of Numeric and mx-texttools
dependencies. I think the only real issue is setting up some testing
during the Sage package installation, which is my motivation for really
understanding the test failures.
Cheers,
Marshall Hampton
Department of Mathematics and Statistics
University of Minnesota, Duluth
From biopython at maubp.freeserve.co.uk Mon Aug 17 12:28:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 17:28:27 +0100
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To:
References:
Message-ID: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
On Mon, Aug 17, 2009 at 4:42 PM, Marshall Hampton wrote:
>
> Hi,
>
> I am preparing biopython-1.51 for inclusion as an optional package for Sage
> (www.sagemath.org). ?I ran the test suite and got 8 errors; I am not sure if
> these are all expected.
I wouldn't have expected any failures.
> The KDTree ones I have seen before, but some look new. ?My test log is
> available at:
>
> http://sage.math.washington.edu/home/mhampton/biopython-1.51-testlog.txt
>
> in case anyone wants to take a look.
This one should be simple: test_EMBOSS.py
ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in
genbank format: 16 vs 1 records
This is a known regression in EMBOSS 6.1.0 which will be fixed
in their next release. Can you check this by running embossversion?
The others are all ImportErrors (e.g. cannot import name _CKDTree)
I rather suspect you are running the test suite BEFORE compiling
the C extensions, and that this may similarly affect Bio.Restriction.
> Biopython has been available in Sage for several years as an optional
> package, but I would like to make it a standard component. This has
> has become much more likely since the clean-up of Numeric and
> mx-texttools dependencies.
Cool.
> I think the only real issue is setting up some testing during the Sage
> package installation, which is my motivation for really understanding
> the test failures.
I don't know anything about your test framework, but surely other
packages (e.g. NumPy) have a similar requirement (compile
before test) so this should be fixable.
Regards,
Peter
From biopython at maubp.freeserve.co.uk Mon Aug 17 12:35:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 17:35:53 +0100
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
Message-ID: <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com>
On Mon, Aug 17, 2009 at 5:28 PM, Peter wrote:
>
> The others are all ImportErrors (e.g. cannot import name _CKDTree)
> I rather suspect you are running the test suite BEFORE compiling
> the C extensions, and that this may similarly affect Bio.Restriction.
Also this line is interesting - it suggest you have not installed NumPy,
or not told sage is is a dependency?
test_Cluster ... skipping. If you want to use Bio.Cluster, install
NumPy first and then reinstall Biopython
P.S. Why does this page talk about Biopython version "4.2b"?
http://wiki.sagemath.org/Sage_Spkg_Tracking
Peter
From matzke at berkeley.edu Mon Aug 17 15:48:33 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 17 Aug 2009 12:48:33 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <20090811131019.GW12604@sobchak.mgh.harvard.edu>
References: <20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
<20090811131019.GW12604@sobchak.mgh.harvard.edu>
Message-ID: <4A89B411.4090501@berkeley.edu>
Pencils down update: I have uploaded the relevant test scripts and data
files to git, and deleted old loose files.
http://github.com/nmatzke/biopython/commits/Geography
Here is a simple draft tutorial:
http://biopython.org/wiki/BioGeography#Tutorial
Strangely, while working on the tutorial I discovered that I did
something somewhere in the last revision that is messing up the parsing
of automatically downloaded records from GBIF, I am tracking this down
currently and will upload as soon as I find it.
I would like to thank everyone for the opportunity to participate in
GSoC, and to thank everyone for their help. For me, this summer turned
into more of a "growing from a scripter to a programmer" summer than I
expected initially. As a result I spent a more time refactoring and
retracing my steps than I figured. However I think the resulting main
product, a GBIF interface and associated tools, is much better than it
would have been without the advice & encouragement of Brad, Hilmar, etc.
I will be using this for my own research and will continue developing it.
Cheers!
Nick
Brad Chapman wrote:
> Hi Nick;
>
>> Summary: Major focus is getting the GBIF access/search/parse module into
>> "done"/submittable shape. This primarily requires getting the
>> documentation and testing up to biopython specs. I have a fair bit of
>> documentation and testing, need advice (see below) for specifics on what
>> it should look like.
>
> Awesome. Thanks for working on the cleanup for this.
>
>> OK, I will do this. Should I try and figure out the unittest stuff? I
>> could use a simple example of what this is supposed to look like.
>
> In addition to Peter's pointers, here is a simple example from a
> small thing I wrote:
>
> http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py
>
> You can copy/paste the unit test part to get a base, and then
> replace the t_* functions with your own real tests.
>
> Simple scripts that generate consistent output are also fine; that's
> the print and compare approach.
>
>>> - What is happening with the Nodes_v2 and Treesv2 files? They look
>>> like duplicates of the Nexus Nodes and Trees with some changes.
>>> Could we roll those changes into the main Nexus code to avoid
>>> duplication?
>> Yeah, these were just copies with your bug fix, and with a few mods I
>> used to track crashes. Presumably I don't need these with after a fresh
>> download of biopython.
>
> Cool. It would be great if we could weed these out as well.
>
>> The API is really just the interface with GBIF. I think developing a
>> cookbook entry is pretty easy, I assume you want something like one of
>> the entries in the official biopython cookbook?
>
> Yes, that would work great. What I was thinking of are some examples
> where you provide background and motivation: Describe some useful
> information you want to get from GBIF, and then show how to do it.
> This is definitely the most useful part as it gives people working
> examples to start with. From there they can usually browse the lower
> level docs or code to figure out other specific things.
>
>> Re: API documentation...are you just talking about the function
>> descriptions that are typically in """ """ strings beneath the function
>> definitions? I've got that done. Again, if there is more, an example
>> of what it should look like would be useful.
>
> That looks great for API level docs. You are right on here; for this
> week I'd focus on the cookbook examples and cleanup stuff.
>
> My other suggestion would be to rename these to follow Biopython
> conventions, something like:
>
> gbif_xml -> GbifXml
> shpUtils -> ShapefileUtils
> geogUtils -> GeographyUtils
> dbfUtils -> DbfUtils
>
> The *Utils might have underscores if they are not intended to be
> called directly.
>
> Thanks for all your hard work,
> Brad
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From mhampton at d.umn.edu Mon Aug 17 16:46:42 2009
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Mon, 17 Aug 2009 15:46:42 -0500 (CDT)
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To: <320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com>
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
<320fb6e00908170935v1919a5b3h3497e12f3156a477@mail.gmail.com>
Message-ID:
On Mon, 17 Aug 2009, Peter wrote:
> On Mon, Aug 17, 2009 at 5:28 PM, Peter wrote:
>>
>> The others are all ImportErrors (e.g. cannot import name _CKDTree)
>> I rather suspect you are running the test suite BEFORE compiling
>> the C extensions, and that this may similarly affect Bio.Restriction.
>
> Also this line is interesting - it suggest you have not installed NumPy,
> or not told sage is is a dependency?
> test_Cluster ... skipping. If you want to use Bio.Cluster, install
> NumPy first and then reinstall Biopython
Numpy is included in Sage, so I guess there is some sort of path problem.
I'll give it another look.
> P.S. Why does this page talk about Biopython version "4.2b"?
> http://wiki.sagemath.org/Sage_Spkg_Tracking
>
> Peter
>
I have no idea, that was simply wrong. I have corrected that wiki page.
Thanks for the feedback!
Marshall Hampton
From mhampton at d.umn.edu Mon Aug 17 17:25:28 2009
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Mon, 17 Aug 2009 16:25:28 -0500 (CDT)
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To: <320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
Message-ID:
On Mon, 17 Aug 2009, Peter wrote:
> This one should be simple: test_EMBOSS.py
> ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in
> genbank format: 16 vs 1 records
> This is a known regression in EMBOSS 6.1.0 which will be fixed
> in their next release. Can you check this by running embossversion?
My emboss version is 6.1.0, so that explains that.
After copying the Tests folder from the source to my site-packages
directory, most of the errors go away, except for the one mentioned above
and this one:
ERROR: test_SeqIO_online
----------------------------------------------------------------------
Traceback (most recent call last):
File "run_tests.py", line 248, in runTest
suite = unittest.TestLoader().loadTestsFromName(name)
File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line
576, in loadTestsFromName
module = __import__('.'.join(parts_copy))
File "test_SeqIO_online.py", line 62, in
record = SeqIO.read(handle, format) # checks there is exactly one
record
File
"/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py",
line 485, in read
raise ValueError("No records found in handle")
ValueError: No records found in handle
...not sure what the problem might be with that.
-Marshall
From mhampton at d.umn.edu Mon Aug 17 17:31:43 2009
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Mon, 17 Aug 2009 16:31:43 -0500 (CDT)
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To:
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
Message-ID:
I hope this isn't too much email, I can just post to the dev list if you'd
like. Anyway, I manually ran my last test failure, test_SeqIO_online.py,
and when I do that everything looks OK:
thorn:16:28:30:site-packages: sage -python Tests/test_SeqIO_online.py
Checking Bio.ExPASy.get_sprot_raw()
- Fetching O23729
Got MAPAMEEIRQAQRAEGPAA...GAE [5Y08l+HJRDIlhLKzFEfkcKd1dkM] len 394
Checking Bio.Entrez.efetch()
- Fetching X52960 from genome as fasta
Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248
- Fetching X52960 from genome as gb
Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248
- Fetching 6273291 from nucleotide as fasta
Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902
- Fetching 6273291 from nucleotide as gb
Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902
- Fetching 16130152 from protein as fasta
Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367
- Fetching 16130152 from protein as gb
Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367
Not sure where to go from here, but it seems that things are basically
working correctly.
-Marshall Hampton
On Mon, 17 Aug 2009, Marshall Hampton wrote:
>
> On Mon, 17 Aug 2009, Peter wrote:
>> This one should be simple: test_EMBOSS.py
>> ValueError: Disagree on file ig IntelliGenetics/VIF_mase-pro.txt in
>> genbank format: 16 vs 1 records
>> This is a known regression in EMBOSS 6.1.0 which will be fixed
>> in their next release. Can you check this by running embossversion?
>
> My emboss version is 6.1.0, so that explains that.
>
> After copying the Tests folder from the source to my site-packages directory,
> most of the errors go away, except for the one mentioned above and this one:
>
> ERROR: test_SeqIO_online
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> File "run_tests.py", line 248, in runTest
> suite = unittest.TestLoader().loadTestsFromName(name)
> File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576,
> in loadTestsFromName
> module = __import__('.'.join(parts_copy))
> File "test_SeqIO_online.py", line 62, in
> record = SeqIO.read(handle, format) # checks there is exactly one record
> File
> "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py",
> line 485, in read
> raise ValueError("No records found in handle")
> ValueError: No records found in handle
>
> ...not sure what the problem might be with that.
>
> -Marshall
>
From biopython at maubp.freeserve.co.uk Mon Aug 17 17:37:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 22:37:05 +0100
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To:
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
Message-ID: <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com>
On Mon, Aug 17, 2009 at 10:25 PM, Marshall Hampton wrote:
>
> After copying the Tests folder from the source to my site-packages
> directory, most of the errors go away,
Well that does suggest some sort of path issue, but moving the
test directory around that isn't a very good solution.
> except for the one mentioned above and this one:
Assuming the "one mentioned above" was the EMBOSS one, fine.
> ERROR: test_SeqIO_online
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> ?File "run_tests.py", line 248, in runTest
> ? ?suite = unittest.TestLoader().loadTestsFromName(name)
> ?File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576,
> in loadTestsFromName
> ? ?module = __import__('.'.join(parts_copy))
> ?File "test_SeqIO_online.py", line 62, in
> ? ?record = SeqIO.read(handle, format) # checks there is exactly one record
> ?File
> "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py",
> line 485, in read
> ? ?raise ValueError("No records found in handle")
> ValueError: No records found in handle
>
> ...not sure what the problem might be with that.
That is an online test using the NCBI's web services. This could
be a transient failure due to the network.
Peter
From biopython at maubp.freeserve.co.uk Mon Aug 17 17:43:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 22:43:06 +0100
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To:
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
Message-ID: <320fb6e00908171443g7b9fe780h1024d2f584be7b18@mail.gmail.com>
On Mon, Aug 17, 2009 at 10:31 PM, Marshall Hampton wrote:
>
> I hope this isn't too much email, I can just post to the dev list if
> you'd like.
Doing it on the mailing list is fine, I'd read it either way ;)
>?Anyway, I manually ran my last test failure, test_SeqIO_online.py,
> and when I do that everything looks OK:
>
> thorn:16:28:30:site-packages: sage -python Tests/test_SeqIO_online.py
> Checking Bio.ExPASy.get_sprot_raw()
> - Fetching O23729
> ?Got MAPAMEEIRQAQRAEGPAA...GAE [5Y08l+HJRDIlhLKzFEfkcKd1dkM] len 394
> Checking Bio.Entrez.efetch()
> - Fetching X52960 from genome as fasta
> ?Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248
> - Fetching X52960 from genome as gb
> ?Got TGGCTCGAACTGACTAGAA...GCT [Ktxz0HgMlhQmrKTuZpOxPZJ6zGU] len 248
> - Fetching 6273291 from nucleotide as fasta
> ?Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902
> - Fetching 6273291 from nucleotide as gb
> ?Got TATACATTAAAGGAGGGGG...AGA [bLhlq4mEFJOoS9PieOx4nhGnjAQ] len 902
> - Fetching 16130152 from protein as fasta
> ?Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367
> - Fetching 16130152 from protein as gb
> ?Got MKVKVLSLLVPALLVAGAA...YQF [fCjcjMFeGIrilHAn6h+yju267lg] len 367
>
> Not sure where to go from here, but it seems that things are basically
> working correctly.
>
> -Marshall Hampton
That fits with it being a transient network issue. Some of our
units tests like Tests/test_SeqIO_online.py are simple "print
and compare" scripts, which are intended to be run via the
run_tests.py script to validate their output. You can try this:
sage -python Tests/run_tests.py test_SeqIO_online.py
Or, manually compare that output to the expected output in
file Tests/ouput/test_SeqIO_online - but it looks fine to me
by eye.
Peter
From dalke at dalkescientific.com Mon Aug 17 17:36:41 2009
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Aug 2009 23:36:41 +0200
Subject: [Biopython-dev] old Martel release
Message-ID:
Hi all,
Does anyone here have a copy of my *old* Martel code? Something
from the pre-1.0 days? I can't find it anywhere, and it looks like I
did things back then on the biopython.org machines. An example URL was:
http://www.biopython.org/~dalke/Martel/Martel-0.5.tar.gz
I'm specifically looking for the molfile format I developed. That was
9 years ago and several machines back in time.
Andrew
dalke at dalkescientific.com
From dalke at dalkescientific.com Mon Aug 17 17:40:11 2009
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Aug 2009 23:40:11 +0200
Subject: [Biopython-dev] old Martel release
In-Reply-To:
References:
Message-ID: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com>
On Aug 17, 2009, at 11:36 PM, Andrew Dalke wrote:
> Does anyone here have a copy of my *old* Martel code?
Ha! archive.org has it.
Didn't think they kept .tar.gz files, but they do!
Andrew
dalke at dalkescientific.com
From eric.talevich at gmail.com Mon Aug 17 17:47:22 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 17 Aug 2009 17:47:22 -0400
Subject: [Biopython-dev] GSoC final update: PhyloXML for Biopython
Message-ID: <3f6baf360908171447i2e3c592em5960269600e80f1b@mail.gmail.com>
Hi all,
Here's a final changelog for Aug. 10-14:
- Added a 'terminal' argument to the find() method on BaseTree.Tree, for
filtering internal/external nodes. This makes get_leaf_nodes() a trivial
function, and total_branch_length is pretty simple too.
- Updated the example phyloXML files to v1.10 schema-compliant copies from
phyloxml.org; couple bug fixes.
- Removed the project's README.rst file, so Bio/PhyloXML/ is no longer
controlled by Git. I'll merge any useful information from there into the
Biopython wiki documentation.
- Pulled the Biopython 1.51 release into my master branch, and merged that
into the phyloxml branch, so this branch (and the required GSoC patch
tarball) will apply cleanly to the publicly released Biopython 1.51 source
tree.
- Documented most of what's been done on the Biopython wiki:
http://www.biopython.org/wiki/PhyloXML
http://www.biopython.org/wiki/TreeIO
http://www.biopython.org/wiki/Tree
*Future plans*
There are a few tangential projects that deserve more attention over the
next few months, and I'm going to create separate Git branches for each of
them, to make it easier to share:
- Port the Newick tree parser and methods from Bio.Newick to Bio.Tree
and
TreeIO.
- Improve the graph drawing and networkx integration
- BioSQL adapter between Bio.Tree.BaseTree and PhyloDB tables
- Possibly, play with other tree representations -- nested-set, as
PhyloDB
does, and relationship matrix, which could bring NumPy into play (in a
separate Bio.Tree.Matrix module)
Finally, massive thanks to Brad and Christian for mentoring, Hilmar for
overseeing the whole project, Peter and the Biopython folks for their
guidance, and the various BioPerl monks and BioRubyists who shared their
wisdom.
All the best,
Eric
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
From biopython at maubp.freeserve.co.uk Mon Aug 17 17:48:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 22:48:19 +0100
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A89B411.4090501@berkeley.edu>
References: <20090708124841.GX17086@sobchak.mgh.harvard.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
<20090811131019.GW12604@sobchak.mgh.harvard.edu>
<4A89B411.4090501@berkeley.edu>
Message-ID: <320fb6e00908171448l296abbb8yb509893cfbaaaa24@mail.gmail.com>
On Mon, Aug 17, 2009 at 8:48 PM, Nick Matzke wrote:
> I would like to thank everyone for the opportunity to participate in GSoC,
> and to thank everyone for their help. ?For me, this summer turned into more
> of a "growing from a scripter to a programmer" summer than I expected
> initially. ?As a result I spent a more time refactoring and retracing my
> steps than I figured. ?However I think the resulting main product, a GBIF
> interface and associated tools, is much better than it would have been
> without the advice & encouragement of Brad, Hilmar, etc. ?I will be using
> this for my own research and will continue developing it.
That sounds like this has been a successful project, and from my
Biopython point of view the bit about you planing to continue using
and developing the code in your research is especially good news ;)
Cheers!
Peter
From biopython at maubp.freeserve.co.uk Mon Aug 17 17:54:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 22:54:45 +0100
Subject: [Biopython-dev] old Martel release
In-Reply-To: <0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com>
References:
<0528A9A7-4DE9-4078-819F-4FD342B8D88D@dalkescientific.com>
Message-ID: <320fb6e00908171454k267ed02djc982bf312b6285bb@mail.gmail.com>
On Mon, Aug 17, 2009 at 10:40 PM, Andrew Dalke wrote:
> On Aug 17, 2009, at 11:36 PM, Andrew Dalke wrote:
>>
>> ?Does anyone here have a copy of my *old* Martel code?
>
> Ha! archive.org has it.
>
> Didn't think they kept .tar.gz files, but they do!
Lucky :)
I don't know what is in it, but your dalke user account is still
there on biopython.org - which would probably still have all
the http://www.biopython.org/~dalke website content. I
guess your password has expired or something. Give the
OBF guys an email? You might have some other bits and
pieces still there...
Peter
From mhampton at d.umn.edu Mon Aug 17 17:45:07 2009
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Mon, 17 Aug 2009 16:45:07 -0500 (CDT)
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To: <320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com>
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
<320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com>
Message-ID:
Yep, I tried again and the test_SeqIO_online was ok, so I guess it was a
transient failure.
I agree that copying my Tests folder isn't a great solution. I will try
to increase my understanding of the biopython test framework - I am used
to the Sage method of mainly using docstring tests.
-Marshall
On Mon, 17 Aug 2009, Peter wrote:
> On Mon, Aug 17, 2009 at 10:25 PM, Marshall Hampton wrote:
>>
>> After copying the Tests folder from the source to my site-packages
>> directory, most of the errors go away,
>
> Well that does suggest some sort of path issue, but moving the
> test directory around that isn't a very good solution.
>
>> except for the one mentioned above and this one:
>
> Assuming the "one mentioned above" was the EMBOSS one, fine.
>
>> ERROR: test_SeqIO_online
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>> ?File "run_tests.py", line 248, in runTest
>> ? ?suite = unittest.TestLoader().loadTestsFromName(name)
>> ?File "/Users/mh/sagestuff/sage-4.1/local/lib/python/unittest.py", line 576,
>> in loadTestsFromName
>> ? ?module = __import__('.'.join(parts_copy))
>> ?File "test_SeqIO_online.py", line 62, in
>> ? ?record = SeqIO.read(handle, format) # checks there is exactly one record
>> ?File
>> "/Users/mh/sagestuff/sage-4.1/local/lib/python2.6/site-packages/Bio/SeqIO/__init__.py",
>> line 485, in read
>> ? ?raise ValueError("No records found in handle")
>> ValueError: No records found in handle
>>
>> ...not sure what the problem might be with that.
>
> That is an online test using the NCBI's web services. This could
> be a transient failure due to the network.
>
> Peter
>
From biopython at maubp.freeserve.co.uk Mon Aug 17 17:57:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Aug 2009 22:57:40 +0100
Subject: [Biopython-dev] biopython-1.51 test log, sage inclusion
In-Reply-To:
References:
<320fb6e00908170928t1efc7778k672a5f6bb036cb06@mail.gmail.com>
<320fb6e00908171437y1c4565f8jb1a19a369d389357@mail.gmail.com>
Message-ID: <320fb6e00908171457w73ca3699y11dbe255cd2748df@mail.gmail.com>
On Mon, Aug 17, 2009 at 10:45 PM, Marshall Hampton wrote:
>
> Yep, I tried again and the test_SeqIO_online was ok, so I guess it was a
> transient failure.
Good :)
> I agree that copying my Tests folder isn't a great solution. ?I will try to
> increase my understanding of the biopython test framework - I am
> used to the Sage method of mainly using docstring tests.
If it helps, there is a whole chapter in our tutorial, but most
of this is aimed at people wanting to write unit tests for us.
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
Please point out any typos or things that can be clarified.
Thanks,
Peter
From bugzilla-daemon at portal.open-bio.org Tue Aug 18 06:01:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 18 Aug 2009 06:01:25 -0400
Subject: [Biopython-dev] [Bug 2619] Bio.PDB.MMCIFParser component MMCIFlex
commented out in setup.py
In-Reply-To:
Message-ID: <200908181001.n7IA1PWk030525@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2619
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-18 06:01 EST -------
Just to note the Ubuntu/Debian packages for Biopython list flex as a build
dependency, and patch our setup.py file to re-enable the Bio.PDB.mmCIF.MMCIFlex
extension. This is a neat solution until we can update our setup.py to detect
flex on its own.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From hlapp at gmx.net Tue Aug 18 12:09:15 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 18 Aug 2009 12:09:15 -0400
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
Message-ID:
On Aug 17, 2009, at 11:52 AM, Peter wrote:
> My impression from talking to OBF guys is if we really want to we
> can do this, but it require us (Biopython) to take care of installing
> and running git on an OBF machine.
That's how I would put it too. Moreover, if you as people who want
this and know more about it already than anyone else among root-l
can't be bothered to take the initiative to spearhead this on OBF
servers, the argument that OBF "sysadmins" (which in essence is all of
us who know how to do this) should do the work is a lot less strong
than it might have to be. I.e., if you don't feel this would be time
well invested for you, it is probably even less well invested for
other OBFers.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From biopython at maubp.freeserve.co.uk Tue Aug 18 12:39:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 18 Aug 2009 17:39:23 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To:
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
Message-ID: <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
On Tue, Aug 18, 2009 at 5:09 PM, Hilmar Lapp wrote:
>
> On Aug 17, 2009, at 11:52 AM, Peter wrote:
>
>> My impression from talking to OBF guys is if we really want to we
>> can do this, but it require us (Biopython) to take care of installing
>> and running git on an OBF machine.
>
> That's how I would put it too. Moreover, if you as people who want this and
> know more about it already than anyone else among root-l can't be bothered
> to take the initiative to spearhead this on OBF servers, the argument that
> OBF "sysadmins" (which in essence is all of us who know how to do this)
> should do the work is a lot less strong than it might have to be. I.e., if
> you don't feel this would be time well invested for you, it is probably even
> less well invested for other OBFers.
Sure. Right now I don't think anyone at Biopython knows exactly
what would be involved in running a gitserver, and it would take
some investment of time to get to that point.
In the long term I think running git on an OBF machine would be a
good idea, but I don't personally want to spend time learning how to
do that right now. By using github, we don't have to invest a lot of
upfront effort in configuring a git server right away.
I think it makes sense to just move Biopython to github in the short term,
in the medium term we can (expertise permitting) get a git mirror running
on an OBF machine, and then other tools like the git equivalent of
ViewCVS (and if need be then abandon github - we won't be locked
into anything permanent).
Peter
From fkauff at biologie.uni-kl.de Wed Aug 19 03:36:45 2009
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Wed, 19 Aug 2009 09:36:45 +0200
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
Message-ID: <4A8BAB8D.1000109@biologie.uni-kl.de>
Hi all,
On 08/17/2009 02:43 PM, Peter wrote:
> Hi all,
>
> Now that Biopython 1.51 is out (thanks Brad), we should
> discuss finally moving from CVS to git. This was something
> we talked about at BOSC/ISMB 2009, but not everyone was
> there. We have two main options:
>
> (a) Move from CVS (on the OBF servers) to github. All our
> developers will need to get github accounts, and be added
> as "collaborators" to the existing github repository. I would
> want a mechanism in place to backup the repository to the
> OBF servers (Bartek already has something that should
> work).
>
>
I agree, this sounds at this point like the most feasible way to go. In
the long run we can still reconsider to run git on the OBF servers, but
t this point running such a server is an additional amount of work that
brings no additional benefit.
Cheers,
Frank
> (b) Move from CVS to git (on the OBF servers). All our
> developers can continue to use their existing OBF accounts.
> Bartek's existing scripts could be modified to push the
> updates from this OBF git repository onto github.
>
> In either case, there will be some "plumbing" work required,
> for example I'd like to continue to offer a recent source code
> dump at http://biopython.open-bio.org/SRC/biopython/ etc.
>
> Given we don't really seem to have the expertise "in house"
> to run an OBF git server ourselves right now, option (a) is
> simplest, and as I recall those of us at BOSC where OK
> with this plan.
>
> Assuming we go down this route (CVS to github), everyone
> with an existing CVS account should setup a github account
> if they want to continue to have commit access (e.g. Frank,
> Iddo). I would suggest that initially you get used to working
> with git and github BEFORE trying anything directly on what
> would be the "official" repository. It took me a while and I'm
> still learning ;)
>
> Is this agreeable? Are there any other suggestions?
>
> [Once this is settled, we can talk about things like merge
> requests and if they should be accompanied by a Bugzilla
> ticket or not.]
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
From matzke at berkeley.edu Wed Aug 19 04:56:59 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Wed, 19 Aug 2009 01:56:59 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A89B411.4090501@berkeley.edu>
References: <20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
<20090811131019.GW12604@sobchak.mgh.harvard.edu>
<4A89B411.4090501@berkeley.edu>
Message-ID: <4A8BBE5B.10705@berkeley.edu>
OK, I nailed the bug, which was stemming from HTML links inside GBIF XML
results which in some situations were screwing up parsing etc. So I've
updated the tutorial to add the chunk about downloading an arbitrarily
large number of records, in user-specified increments, with an
appropriate time-delay between server requests.
Also added a chunk on classifying records into user-specified geographic
areas based on their latitude/longitude.
Also updated the test scripts and test results files, and deleted some
remaining loose/unnecessary files.
Updated tutorial: http://biopython.org/wiki/BioGeography#Tutorial
Github commits: http://github.com/nmatzke/biopython/commits/Geography
I think I've reached a good stopping point for the moment, I welcome
comments on the tutorial and/or on the prospects for turning this into
an official biopython module, etc.
Thanks again, and cheers!
Nick
Nick Matzke wrote:
> Pencils down update: I have uploaded the relevant test scripts and data
> files to git, and deleted old loose files.
> http://github.com/nmatzke/biopython/commits/Geography
>
> Here is a simple draft tutorial:
> http://biopython.org/wiki/BioGeography#Tutorial
>
> Strangely, while working on the tutorial I discovered that I did
> something somewhere in the last revision that is messing up the parsing
> of automatically downloaded records from GBIF, I am tracking this down
> currently and will upload as soon as I find it.
>
> I would like to thank everyone for the opportunity to participate in
> GSoC, and to thank everyone for their help. For me, this summer turned
> into more of a "growing from a scripter to a programmer" summer than I
> expected initially. As a result I spent a more time refactoring and
> retracing my steps than I figured. However I think the resulting main
> product, a GBIF interface and associated tools, is much better than it
> would have been without the advice & encouragement of Brad, Hilmar, etc.
> I will be using this for my own research and will continue developing it.
>
> Cheers!
> Nick
>
>
> Brad Chapman wrote:
>> Hi Nick;
>>
>>> Summary: Major focus is getting the GBIF access/search/parse module
>>> into "done"/submittable shape. This primarily requires getting the
>>> documentation and testing up to biopython specs. I have a fair bit
>>> of documentation and testing, need advice (see below) for specifics
>>> on what it should look like.
>>
>> Awesome. Thanks for working on the cleanup for this.
>>
>>> OK, I will do this. Should I try and figure out the unittest stuff?
>>> I could use a simple example of what this is supposed to look like.
>>
>> In addition to Peter's pointers, here is a simple example from a
>> small thing I wrote:
>>
>> http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py
>>
>> You can copy/paste the unit test part to get a base, and then
>> replace the t_* functions with your own real tests.
>>
>> Simple scripts that generate consistent output are also fine; that's
>> the print and compare approach.
>>
>>>> - What is happening with the Nodes_v2 and Treesv2 files? They look
>>>> like duplicates of the Nexus Nodes and Trees with some changes.
>>>> Could we roll those changes into the main Nexus code to avoid
>>>> duplication?
>>> Yeah, these were just copies with your bug fix, and with a few mods I
>>> used to track crashes. Presumably I don't need these with after a
>>> fresh download of biopython.
>>
>> Cool. It would be great if we could weed these out as well.
>>
>>> The API is really just the interface with GBIF. I think developing a
>>> cookbook entry is pretty easy, I assume you want something like one
>>> of the entries in the official biopython cookbook?
>>
>> Yes, that would work great. What I was thinking of are some examples
>> where you provide background and motivation: Describe some useful
>> information you want to get from GBIF, and then show how to do it.
>> This is definitely the most useful part as it gives people working
>> examples to start with. From there they can usually browse the lower
>> level docs or code to figure out other specific things.
>>
>>> Re: API documentation...are you just talking about the function
>>> descriptions that are typically in """ """ strings beneath the
>>> function definitions? I've got that done. Again, if there is more,
>>> an example of what it should look like would be useful.
>>
>> That looks great for API level docs. You are right on here; for this
>> week I'd focus on the cookbook examples and cleanup stuff.
>>
>> My other suggestion would be to rename these to follow Biopython
>> conventions, something like:
>>
>> gbif_xml -> GbifXml
>> shpUtils -> ShapefileUtils
>> geogUtils -> GeographyUtils
>> dbfUtils -> DbfUtils
>>
>> The *Utils might have underscores if they are not intended to be
>> called directly.
>>
>> Thanks for all your hard work,
>> Brad
>>
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From bugzilla-daemon at portal.open-bio.org Wed Aug 19 05:29:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 19 Aug 2009 05:29:36 -0400
Subject: [Biopython-dev] [Bug 2619] Bio.PDB.MMCIFParser component MMCIFlex
commented out in setup.py
In-Reply-To:
Message-ID: <200908190929.n7J9TaR0006301@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2619
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-19 05:29 EST -------
(In reply to comment #5)
> Just to note the Ubuntu/Debian packages for Biopython list flex as a build
> dependency, and patch our setup.py file to re-enable the Bio.PDB.mmCIF.MMCIFlex
> extension. This is a neat solution until we can update our setup.py to detect
> flex on its own.
>
Alex Lancaster has kindly done the same for the latest Fedora RPM package
(Biopython 1.51). See
https://admin.fedoraproject.org/community/?package=python-biopython#package_maintenance
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bartek at rezolwenta.eu.org Wed Aug 19 05:45:20 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 19 Aug 2009 11:45:20 +0200
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
Message-ID: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
Hi guys,
On Tue, Aug 18, 2009 at 6:39 PM, Peter wrote:
> On Tue, Aug 18, 2009 at 5:09 PM, Hilmar Lapp wrote:
>>
>> On Aug 17, 2009, at 11:52 AM, Peter wrote:
>>
>>> My impression from talking to OBF guys is if we really want to we
>>> can do this, but it require us (Biopython) to take care of installing
>>> and running git on an OBF machine.
>>
>> That's how I would put it too. Moreover, if you as people who want this and
>> know more about it already than anyone else among root-l can't be bothered
>> to take the initiative to spearhead this on OBF servers, the argument that
>> OBF "sysadmins" (which in essence is all of us who know how to do this)
>> should do the work is a lot less strong than it might have to be. I.e., if
>> you don't feel this would be time well invested for you, it is probably even
>> less well invested for other OBFers.
>
> Sure. Right now I don't think anyone at Biopython knows exactly
> what would be involved in running a gitserver, and it would take
> some investment of time to get to that point.
>
I think there is some grave misunderstanding here.
There is nothing magical or difficult in installing git on OBF
servers. It's just a package.
There is no effort to be spearheaded by anyone. The command "yum install git"
needs to be run by someone with root privileges. That's it. It's
absolutely enough
to allow people with obf developer accounts to use git for development.
As for running a git-protocol-server, this is a bit more complicated
and can be done in many more
ways than with CVS. I don't think that anyone is expecting OBF to
provide git repository
hosting in a standardized way (currently only BioRuby uses git and
they seem to be fine
with github, similar for biopython)
The importance of having git installed on OBF machines comes from the
fact that it can
be useful for many things even if we don't host the repository on OBF servers.
Most importantly, for doing regular backups of git branch from github
to OBF servers we
need a machine with git installed. Currently it's my work machine, but
I think it would be a
much better setup if we could do it directly from an OBF machine.
> In the long term I think running git on an OBF machine would be a
> good idea, but I don't personally want to spend time learning how to
> do that right now. By using github, we don't have to invest a lot of
> upfront effort in configuring a git server right away.
>
> I think it makes sense to just move Biopython to github in the short term,
> in the medium term we can (expertise permitting) get a git mirror running
> on an OBF machine, and then other tools like the git equivalent of
> ViewCVS (and if need be then abandon github - we won't be locked
> into anything permanent).
>
I don't quite understand what do you mean by "running git". Once we
have git installed,
you can use push and pull over ssh to a branch sitting on OBF machine.
We can also
make the mirror available for people (read-only) through http (just
place the repo in a
directory published with apache, no extra software required), But I
don't think this makes
much sense if we actually want to use collaborative features of
github. In my opinion this
would only bring confusion: either we make the github branch official or not.
The most difficult part is the "viewCVS" replacement. There is the
gitweb.cgi script, which
is (in my opinion) inferior to github interface. Installing it
wouldn't be difficult (it's CGI) so
we could do it, but is it better than github here? I'm not sure.. (you
can see how it would look
on a slightly out-of-date biopython branch on my machine:
http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=summary )
To summarize, I think that the only thing we really need from OBF is
to have git installed
(Hilmar, can you help with this? I tried to even compile it on
dev.open-bio.org but there it depends
on multiple libraries and I gave up...)
best regards
Bartek
From biopython at maubp.freeserve.co.uk Wed Aug 19 05:58:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 19 Aug 2009 10:58:12 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
Message-ID: <320fb6e00908190258u2262a3b9s256dff5db38ddd41@mail.gmail.com>
On Wed, Aug 19, 2009 at 10:45 AM, Bartek
Wilczynski wrote:
>> Sure. Right now I don't think anyone at Biopython knows exactly
>> what would be involved in running a gitserver, and it would take
>> some investment of time to get to that point.
>
> I think there is some grave misunderstanding here.
You have certainly clarified a few things for me ;)
> There is nothing magical or difficult in installing git on OBF
> servers. It's ?just a package. There is no effort to be spearheaded
> by anyone. The command "yum install git" needs to be run by
> someone with root privileges. That's it. It's absolutely enough
> to allow people with obf developer accounts to use git for
> development.
Oh. That is less complicated than I realised - assuming all the
existing dev accounts have SSH access.
> As for running a git-protocol-server, this is a bit more complicated
> and can be done in many more ways than with CVS. I don't think
> that anyone is expecting OBF to provide git repository hosting in
> a standardized way (currently only BioRuby uses git and they
> seem to be fine with github, similar for biopython)
>
> The importance of having git installed on OBF machines comes
> from the fact that it can be useful for many things even if we don't
> host the repository on OBF servers.
I had been assuming we would also need the git-protocol-server,
and to mess about with the firewall and perhaps webserver, but
if I understand you correctly even *just* the core git tool running
on the OBF would be useful (even if just for backups). So let's try
and do that...
> ...
> To summarize, I think that the only thing we really need from OBF is
> to have git installed
Any of the OBF server admins should be able to install the git
*package* for us (this should be trivial as long as the Linux OS
is fairly up to date). We should probably ask via a support
request on on the root-l mailing list... let's just give Hilmar a
chance to reply first.
Peter
From hlapp at gmx.net Wed Aug 19 18:17:20 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 19 Aug 2009 18:17:20 -0400
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
Message-ID: <8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
On Aug 19, 2009, at 5:45 AM, Bartek Wilczynski wrote:
> To summarize, I think that the only thing we really need from OBF is
> to have git installed
> (Hilmar, can you help with this? I tried to even compile it on
> dev.open-bio.org but there it depends on multiple libraries and I
> gave up...)
Post to root-l (copied here, for convenience) and ask if someone can
set you up with the necessary privileges, assuming that you are
volunteering to do the installation?
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From biopython at maubp.freeserve.co.uk Thu Aug 20 06:01:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 11:01:54 +0100
Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME?
Message-ID: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com>
Hi Bartek,
With the introduction of Bio.Motif, we declared Bio.AlignAce and
Bio.MEME as obsolete as of release 1.50 in the DEPRECATED file. I note
we didn't update the module docstrings themselves to make this more
prominent.
Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for
the next release (i.e. put this in their docstrings and issue
deprecation warnings)?
Peter
From barwil at gmail.com Thu Aug 20 06:10:23 2009
From: barwil at gmail.com (Bartek Wilczynski)
Date: Thu, 20 Aug 2009 12:10:23 +0200
Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME?
In-Reply-To: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com>
References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com>
Message-ID: <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com>
On Thu, Aug 20, 2009 at 12:01 PM, Peter wrote:
> Hi Bartek,
>
> With the introduction of Bio.Motif, we declared Bio.AlignAce and
> Bio.MEME as obsolete as of release 1.50 in the DEPRECATED file. I note
> we didn't update the module docstrings themselves to make this more
> prominent.
>
> Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for
> the next release (i.e. put this in their docstrings and issue
> deprecation warnings)?
I think so. Should I change something in the docstrings?
Bartek
From biopython at maubp.freeserve.co.uk Thu Aug 20 06:20:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 11:20:30 +0100
Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME?
In-Reply-To: <8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com>
References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com>
<8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com>
Message-ID: <320fb6e00908200320k6d8902e0r4742a92a5956b1ed@mail.gmail.com>
On Thu, Aug 20, 2009 at 11:10 AM, Bartek Wilczynski wrote:
>> Do you think we can officially deprecate Bio.AlignAce and Bio.MEME for
>> the next release (i.e. put this in their docstrings and issue
>> deprecation warnings)?
>
> I think so. ?Should I change something in the docstrings?
>
The start of the module docstring should be a one line description of
the module - just include "(DEPRECATED)" at the end. Then it will
show up nicely in the API docs: http://biopython.org/DIST/docs/api/
If you look at that page you should be able to see entries like this:
* Bio.Fasta: Utilities for working with FASTA-formatted sequences (DEPRECATED)
* Bio.FilteredReader: Code for more fancy file handles (OBSOLETE)
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 20 07:28:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 12:28:46 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
Message-ID: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
Hi all,
You may recall a thread back in June with Cedar Mckay (cc'd - not
sure if he follows the dev list or not) about indexing large sequence
files - specifically FASTA files but any sequential file format. I posted
some rough code which did this building on Bio.SeqIO:
http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html
I have since generalised this, and have something which I think
would be ready for merging into the main trunk for wider testing.
The code is on github on my "index" branch at the moment,
http://github.com/peterjc/biopython/commits/index
This would add a new function to Bio.SeqIO, provisionally called
indexed_dict, which takes two arguments: filename and format
name (e.g. "fasta"), plus an optional alphabet. This will return a
dictionary like object, using SeqRecord identifiers as keys, and
SeqRecord objects as values. There is (deliberately) no way to
allow the user to choose a different keying mechanism (although
I can see how to do this at a severe performance cost).
As with the Bio.SeqIO.convert() function, the new addition of
Bio.SeqIO.indexed_dict() will be the only public API change.
Everything else is deliberately private, allowing us the freedom
to change details if required in future.
The essential idea is the same my post in June. Nothing about
the existing SeqIO framework is changed (so this won't break
anything). For each file we scan though it looking for new records,
not the file offset, and extract the identifier string. These are stored
in a normal (private) Python dictionary. On requesting a record, we
seek to the appropriate offset and parse the data into a SeqRecord.
For simple file formats we can do this by calling Bio.SeqIO.parse().
For complex file formats (such as SFF files, or anything else with
important information in a header), the implementation is a little
more complicated - but we can provide the same API to the user.
Note that the indexing step does not fully parse the file, and
thus may ignore corrupt/invalid records. Only when (if) they are
accessed will this trigger a parser error. This is a shame, but
means the indexing can (in general) be done very fast.
I am proposing to merge all of this (except the SFF file support),
but would welcome feedback (even after a merger). I already
have basic unit tests, covering the following SeqIO file formats:
"ace", "embl", "fasta", "fastq" (all three variants), "genbank"/"gb",
"ig", "phd", "pir", and "swiss" (plus "sff" but I don't think that
parser is ready to be checked in yet).
An example using the new code, this takes just a few seconds
to index this 238MB GenBank file, and record access is almost
instant:
>>> from Bio import SeqIO
>>> gb_dict = SeqIO.indexed_dict("gbpln1.seq", "gb")
>>> len(gb_dict)
59918
>>> gb_dict.keys()[:5]
['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1', 'AB161026.1']
>>> record = gb_dict["AB433451.1"]
>>> print record.id, len(record), len(record.features)
AB433451.1 590 2
And using a 1.3GB FASTQ file, indexing is about a minute, and
again, record access is almost instant:
>>> from Bio import SeqIO
>>> fq_dict = SeqIO.indexed_dict("SRR001666_1.fastq", "fastq")
>>> len(fq_dict)
7047668
>>> fq_dict.keys()[:4]
['SRR001666.2320093', 'SRR001666.2320092', 'SRR001666.1250635',
'SRR001666.2354360']
>>> record = fq_dict["SRR001666.2765432"]
>>> print record.id, record.seq
SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 20 08:24:34 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 13:24:34 +0100
Subject: [Biopython-dev] Deprecating Bio.AlignAce and Bio.MEME?
In-Reply-To: <8b34ec180908200450r15823d18q87a8cbfccbdc9b13@mail.gmail.com>
References: <320fb6e00908200301l5c44cc43oc8b332b844ed6e16@mail.gmail.com>
<8b34ec180908200310ue02430fsbd18116f3389bf89@mail.gmail.com>
<320fb6e00908200320k6d8902e0r4742a92a5956b1ed@mail.gmail.com>
<8b34ec180908200450r15823d18q87a8cbfccbdc9b13@mail.gmail.com>
Message-ID: <320fb6e00908200524k126ca330n86c3e8516777113c@mail.gmail.com>
On Thu, Aug 20, 2009 at 12:50 PM, Bartek Wilczynski wrote:
>
> On Thu, Aug 20, 2009 at 12:20 PM, Peter wrote:
>
>> The start of the module docstring should be a one line description of
>> the module - just include "(DEPRECATED)" at the end. Then it will
>> show up nicely in the API docs: http://biopython.org/DIST/docs/api/
>
> Done. Should be in CVS now.
Sorry I was unclear - I was only talking about the docstrings. In
addition we need to actually issue a deprecation warning (via the
warnings module), and update the DEPRECATED file in the root
folder. I've done this in CVS - sorry for any confusion.
I've also tried to clarify the procedure on the wiki,
http://biopython.org/wiki/Deprecation_policy
If you can add a couple of examples to the AlignAce and MEME module
docstrings showing a short example using the deprecated module, and
the equivalent using Bio.Motif, that would be great.
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 20 08:43:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 13:43:07 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
Message-ID: <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
On Thu, Aug 20, 2009 at 10:14 AM, Bartek
Wilczynski wrote:
>
> Hi all,
>
> As the biopython project is moving now its development from CVS to
> git, it would be very helpful for us if git software was installed on
> dev.open-bio.org machine.
>
> The most convenient for us would be if someone with root privileges on
> this machine would install the package (it's in the centos
> repository). I can also do the installation myself, as suggested by
> Hilmar (assuming I get the permissions required for package
> installation, account=bartek).
Bartek - do you think we need git on any of the other OBF machines
in addition to dev.open-bio.org (current IP 207.154.17.71)?
However, I'd like to have http://biopython.org/SRC/biopython
kept up to date (also available via www.biopython.org and
biopython.open-bio.org - these are all the same machine,
IP 207.154.17.70). It might be easiest to do that with git
installed on that machine too - or do you think it would be
simpler to push the latest files from dev.open-bio.org instead?
There is also the public CVS server, cvs.biopython.org aka
cvs.open-bio.org (IP 207.154.17.75) but I doubt we will need
to worry about that one in future.
Peter
From bartek at rezolwenta.eu.org Thu Aug 20 09:06:53 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 20 Aug 2009 15:06:53 +0200
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
Message-ID: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote:
> Bartek - do you think we need git on any of the other OBF machines
> in addition to dev.open-bio.org (current IP 207.154.17.71)?
What we _need_ is a single machine, where we can run scripts from cron
and where git is installed. That's why I requested the installation on
dev.open-bio machine (it happens to be the only one I have an account
on). The idea is to run something from cron and pull from github to
have a backup copy of an up-to-date branch. The scripts can (after
each update) push to other machines.
>
> However, I'd like to have http://biopython.org/SRC/biopython
> kept up to date (also available via www.biopython.org and
> biopython.open-bio.org - these are all the same machine,
> IP 207.154.17.70). It might be easiest to do that with git
> installed on that machine too - or do you think it would be
> simpler to push the latest files from dev.open-bio.org instead?
There is no need for git on the www-server machine if we only want to
publish the code, or a read-only git branch over http for download. I
think it's easier to have a single place where cron jobs are run.
However, If we wanted to hook the scripts to github notifications
rather than to cron, then we need some way to trigger scripts by a hit
to a webpage, in which case it _might_ be easier to set things up on
the machine with a web server. But I think we should be fine with the
machinery running on the dev. machine.
There is one remaining issue: We would need to have some directory
where the branch would be kept. Currently it sits in my home directory
whic probably should be changed to something like
/home/biopython/git_branch. I am in biopython group, but currently
/home/biopython does not even allow me to see /home/biopython, not to
mention writing into it. I think it would be the best to set the
scripts to run as biopython user.
>
> There is also the public CVS server, cvs.biopython.org aka
> cvs.open-bio.org (IP 207.154.17.75) but I doubt we will need
> to worry about that one in future.
Certainly. I don't think we need to worry about this one.
Bartek
From biopython at maubp.freeserve.co.uk Thu Aug 20 09:24:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 14:24:15 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
<8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
Message-ID: <320fb6e00908200624i43b650e5q8478b0b5c12af67b@mail.gmail.com>
On Thu, Aug 20, 2009 at 2:06 PM, Bartek
Wilczynski wrote:
> On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote:
>
>> Bartek - do you think we need git on any of the other OBF machines
>> in addition to dev.open-bio.org (current IP 207.154.17.71)?
>
> What we _need_ is a single machine, where we can run scripts from cron
> and where git is installed. That's why I requested the installation on
> dev.open-bio machine (it happens to be the only one I have an account
> on). The idea is to run something from cron and pull from github to
> have a backup copy of an up-to-date branch. The scripts can (after
> each update) push to other machines.
>
> ...
>
> There is no need for git on the www-server machine if we only want to
> publish the code, or a read-only git branch over http for download. I
> think it's easier to have a single place where cron jobs are run.
So just push a dump of the latest code to http://biopython.org/SRC/biopython
or push fresh epydoc api docs to http://biopython.org/DIST/docs/api-live/
or whatever from dev.open-bio.org. That sounds fine to me.
> There is one remaining issue: We would need to have some directory
> where the branch would be kept. Currently it sits in my home directory
> which probably should be changed to something like
> /home/biopython/git_branch. I am in biopython group, but currently
> /home/biopython does not even allow me to see /home/biopython, not to
> mention writing into it. I think it would be the best to set the
> scripts to run as biopython user.
Yes - we'll need some OBF admin input there...
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 20 09:28:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 14:28:22 +0100
Subject: [Biopython-dev] [Root-l] Moving from CVS to git
In-Reply-To:
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
Message-ID: <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com>
On Thu, Aug 20, 2009 at 2:15 PM, Chris Fields wrote:
>
> I would be interested in that as well.
>
> It appears dev.open-bio.org has apt (there is an /etc/apt directory), but
> I'm failing to find apt-get in my PATH. ?Haven't installed on it yet, but a
> packaged version would probably be easier.
If we can have a packaged version of git on dev.open-bio.org from the
Linux distro, that would be easiest (especially for keeping it up to date).
> Also, are we planning ro mirrors on portal for anon access, or should we
> (ab)use github for that purpose? ?To me a ro mirror sorta defeats the
> purpose of git...
For Biopython we plan to use github (initially at least) for committing
changes. This will also allow anonymous access.
A public OBF read only mirror of a git repository is still useful for people to
clone from, and keep the local copy up to date - plus as a backup for
if/when github is congested or unavailable. But not essential.
Peter
From dag at sonsorol.org Thu Aug 20 09:41:55 2009
From: dag at sonsorol.org (Chris Dagdigian)
Date: Thu, 20 Aug 2009 09:41:55 -0400
Subject: [Biopython-dev] [Root-l] Moving from CVS to git
In-Reply-To: <320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com>
Message-ID:
Git is now installed via 'yum' on dev.open-bio.org
Regards,
Chris
On Aug 20, 2009, at 9:28 AM, Peter wrote:
> On Thu, Aug 20, 2009 at 2:15 PM, Chris Fields
> wrote:
>>
>> I would be interested in that as well.
>>
>> It appears dev.open-bio.org has apt (there is an /etc/apt
>> directory), but
>> I'm failing to find apt-get in my PATH. Haven't installed on it
>> yet, but a
>> packaged version would probably be easier.
>
> If we can have a packaged version of git on dev.open-bio.org from the
> Linux distro, that would be easiest (especially for keeping it up to
> date).
>
>> Also, are we planning ro mirrors on portal for anon access, or
>> should we
>> (ab)use github for that purpose? To me a ro mirror sorta defeats the
>> purpose of git...
>
> For Biopython we plan to use github (initially at least) for
> committing
> changes. This will also allow anonymous access.
>
> A public OBF read only mirror of a git repository is still useful
> for people to
> clone from, and keep the local copy up to date - plus as a backup for
> if/when github is congested or unavailable. But not essential.
>
> Peter
>
> _______________________________________________
> Root-l mailing list
> Root-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/root-l
From mjldehoon at yahoo.com Thu Aug 20 09:58:03 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 20 Aug 2009 06:58:03 -0700 (PDT)
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
Message-ID: <818036.61284.qm@web62408.mail.re1.yahoo.com>
I just have two suggestions:
Since indexed_dict returns a dictionary-like object, it may make sense for the _IndexedSeqFileDict to inherit from a dict.
Another issue is whether we can fold indexed_dict and to_dict into one.
Right now we have
def to_dict(sequences, key_function=None) :
def indexed_dict(filename, format, alphabet=None) :
What if we have a single function "dictionary" that can take sequences, a handle, or a filename, and optionally the format, alphabet, key_function, and a parameter "indexed" that indicates if the file should be indexed or kept into memory? Or something like that.
Otherwise, the code looks really nice. Thanks!
--Michiel
--- On Thu, 8/20/09, Peter wrote:
> From: Peter
> Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
> To: "Biopython-Dev Mailing List"
> Cc: "Cedar McKay"
> Date: Thursday, August 20, 2009, 7:28 AM
> Hi all,
>
> You may recall a thread back in June with Cedar Mckay (cc'd
> - not
> sure if he follows the dev list or not) about indexing
> large sequence
> files - specifically FASTA files but any sequential file
> format. I posted
> some rough code which did this building on Bio.SeqIO:
> http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html
>
> I have since generalised this, and have something which I
> think
> would be ready for merging into the main trunk for wider
> testing.
> The code is on github on my "index" branch at the moment,
> http://github.com/peterjc/biopython/commits/index
>
> This would add a new function to Bio.SeqIO, provisionally
> called
> indexed_dict, which takes two arguments: filename and
> format
> name (e.g. "fasta"), plus an optional alphabet. This will
> return a
> dictionary like object, using SeqRecord identifiers as
> keys, and
> SeqRecord objects as values. There is (deliberately) no way
> to
> allow the user to choose a different keying mechanism
> (although
> I can see how to do this at a severe performance cost).
>
> As with the Bio.SeqIO.convert() function, the new addition
> of
> Bio.SeqIO.indexed_dict() will be the only public API
> change.
> Everything else is deliberately private, allowing us the
> freedom
> to change details if required in future.
>
> The essential idea is the same my post in June. Nothing
> about
> the existing SeqIO framework is changed (so this won't
> break
> anything). For each file we scan though it looking for new
> records,
> not the file offset, and extract the identifier string.
> These are stored
> in a normal (private) Python dictionary. On requesting a
> record, we
> seek to the appropriate offset and parse the data into a
> SeqRecord.
> For simple file formats we can do this by calling
> Bio.SeqIO.parse().
>
> For complex file formats (such as SFF files, or anything
> else with
> important information in a header), the implementation is a
> little
> more complicated - but we can provide the same API to the
> user.
>
> Note that the indexing step does not fully parse the file,
> and
> thus may ignore corrupt/invalid records. Only when (if)
> they are
> accessed will this trigger a parser error. This is a shame,
> but
> means the indexing can (in general) be done very fast.
>
> I am proposing to merge all of this (except the SFF file
> support),
> but would welcome feedback (even after a merger). I
> already
> have basic unit tests, covering the following SeqIO file
> formats:
> "ace", "embl", "fasta", "fastq" (all three variants),
> "genbank"/"gb",
> "ig", "phd", "pir", and "swiss" (plus "sff" but I don't
> think that
> parser is ready to be checked in yet).
>
> An example using the new code, this takes just a few
> seconds
> to index this 238MB GenBank file, and record access is
> almost
> instant:
>
> >>> from Bio import SeqIO
> >>> gb_dict = SeqIO.indexed_dict("gbpln1.seq",
> "gb")
> >>> len(gb_dict)
> 59918
> >>> gb_dict.keys()[:5]
> ['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1',
> 'AB161026.1']
> >>> record = gb_dict["AB433451.1"]
> >>> print record.id, len(record),
> len(record.features)
> AB433451.1 590 2
>
> And using a 1.3GB FASTQ file, indexing is about a minute,
> and
> again, record access is almost instant:
>
> >>> from Bio import SeqIO
> >>> fq_dict =
> SeqIO.indexed_dict("SRR001666_1.fastq", "fastq")
> >>> len(fq_dict)
> 7047668
> >>> fq_dict.keys()[:4]
> ['SRR001666.2320093', 'SRR001666.2320092',
> 'SRR001666.1250635',
> 'SRR001666.2354360']
> >>> record = fq_dict["SRR001666.2765432"]
> >>> print record.id, record.seq
> SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From biopython at maubp.freeserve.co.uk Thu Aug 20 10:13:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 15:13:00 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <818036.61284.qm@web62408.mail.re1.yahoo.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
<818036.61284.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com>
On Thu, Aug 20, 2009 at 2:58 PM, Michiel de Hoon wrote:
>
> I just have two suggestions:
>
> Since indexed_dict returns a dictionary-like object, it may make sense
> for the _IndexedSeqFileDict to inherit from a dict.
We'd have to override things like values() to prevent explosions in memory,
and just give a not implemented exception. But yes, good point.
> Another issue is whether we can fold indexed_dict and to_dict into one.
> Right now we have
>
> def to_dict(sequences, key_function=None) :
>
> def indexed_dict(filename, format, alphabet=None) :
>
> What if we have a single function "dictionary" that can take sequences, a
> handle, or a filename, and optionally the format, alphabet, key_function,
> and a parameter "indexed" that indicates if the file should be indexed or
> kept into memory? Or something like that.
I wondered about this, but there are a couple of important differences
between my file indexer, and the existing to_dict function.
For the Bio.SeqIO.to_dict() function, the optional key_function argument
maps a SeqRecord to the desired index (by default the record's id is used).
Supporting a key_function for indexing files in the same way would mean
every single record in the file must be parsed into a SeqRecord while
building the index. This is possible, but would really really slow things
down - and while I considered it, I don't like this idea at all. Instead each
format indexer has essentially got a "mini parser" which just extracts
the id string, so things are much much faster.
Also, the to_dict function can be used on any sequences - not
just from a file. They could be a list of SeqRecords, or a generator
expression filtering output from Bio.SeqIO.parse(). Anything at all
really.
Finally I had better explain my thoughts on indexing and handles versus
filenames. For the SeqIO (and AlignIO etc) parsers, and handle which
supports the basic read/readline/iteration functionality can be used.
For the indexed_dict() function as written, we need to keep the handle
open for as long as the dictionary is kept in memory. We also must have
a handle which supports seek and tell (e.g. not a urllib handle, or
compressed files). Finally, the mode the file was opened in can be
important (e.g. for SFF files universal read lines mode must not be
used). So while indexed_dict could take a file handle (instead of a
filename) there are a lot of provisos. I felt just taking a filename was
the simplest solution here.
> Otherwise, the code looks really nice. Thanks!
Great - thanks for your comments.
Peter
From bartek at rezolwenta.eu.org Thu Aug 20 10:19:50 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 20 Aug 2009 16:19:50 +0200
Subject: [Biopython-dev] [Root-l] Moving from CVS to git
In-Reply-To:
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com>
Message-ID: <8b34ec180908200719k74ebe1ccqa9cdf61684963997@mail.gmail.com>
On Thu, Aug 20, 2009 at 3:41 PM, Chris Dagdigian wrote:
>
> Git is now installed via 'yum' on dev.open-bio.org
>
Wonderful, thanks a lot.
Bartek
From biopython at maubp.freeserve.co.uk Thu Aug 20 10:42:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 15:42:35 +0100
Subject: [Biopython-dev] [Root-l] Moving from CVS to git
In-Reply-To: <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com>
<28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu>
Message-ID: <320fb6e00908200742u2e0fbc16v3f17f1e00b13f634@mail.gmail.com>
On Thu, Aug 20, 2009 at 3:24 PM, Chris Fields wrote:
>
> Thanks Chris D! ?Not sure, but can we view repos on dev similar to portal
> (via gitweb or similar)? ?Or should we mirror these over to portal for that
> purpose?
>
> chris
Again, this falls into the nice to have in the medium/long term, but not
essential in the short term (for Biopython to move from CVS to git). We
can manage with the github web interface for history etc.
Peter
From dag at sonsorol.org Thu Aug 20 11:30:31 2009
From: dag at sonsorol.org (Chris Dagdigian)
Date: Thu, 20 Aug 2009 11:30:31 -0400
Subject: [Biopython-dev] [Root-l] Moving from CVS to git
In-Reply-To: <28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<320fb6e00908200628n7c8fb573p5bbeb65ebc2da533@mail.gmail.com>
<28E30EC3-5E07-4AB1-A60A-155A5D179223@illinois.edu>
Message-ID: <7A8D8D5E-4C39-4713-B92F-3A384374DCAC@sonsorol.org>
Sure, just need informed advice on the 'best' packages to install and
possibly some install help if I get stuck somewhere.
-Chris
On Aug 20, 2009, at 10:24 AM, Chris Fields wrote:
> Thanks Chris D! Not sure, but can we view repos on dev similar to
> portal (via gitweb or similar)? Or should we mirror these over to
> portal for that purpose?
>
> chris
>
> On Aug 20, 2009, at 8:41 AM, Chris Dagdigian wrote:
>
>>
>> Git is now installed via 'yum' on dev.open-bio.org
>>
>> Regards,
>> Chris
>>
>>
>> On Aug 20, 2009, at 9:28 AM, Peter wrote:
>>
>>> On Thu, Aug 20, 2009 at 2:15 PM, Chris
>>> Fields wrote:
>>>>
>>>> I would be interested in that as well.
>>>>
>>>> It appears dev.open-bio.org has apt (there is an /etc/apt
>>>> directory), but
>>>> I'm failing to find apt-get in my PATH. Haven't installed on it
>>>> yet, but a
>>>> packaged version would probably be easier.
>>>
>>> If we can have a packaged version of git on dev.open-bio.org from
>>> the
>>> Linux distro, that would be easiest (especially for keeping it up
>>> to date).
>>>
>>>> Also, are we planning ro mirrors on portal for anon access, or
>>>> should we
>>>> (ab)use github for that purpose? To me a ro mirror sorta defeats
>>>> the
>>>> purpose of git...
>>>
>>> For Biopython we plan to use github (initially at least) for
>>> committing
>>> changes. This will also allow anonymous access.
>>>
>>> A public OBF read only mirror of a git repository is still useful
>>> for people to
>>> clone from, and keep the local copy up to date - plus as a backup
>>> for
>>> if/when github is congested or unavailable. But not essential.
>>>
>>> Peter
>>>
>>> _______________________________________________
>>> Root-l mailing list
>>> Root-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/root-l
>>
From biopython at maubp.freeserve.co.uk Thu Aug 20 12:19:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 17:19:19 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
<818036.61284.qm@web62408.mail.re1.yahoo.com>
<320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com>
Message-ID: <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com>
Peter wrote:
> Michiel wrote:
>>
>> I just have two suggestions:
>>
>> Since indexed_dict returns a dictionary-like object, it may make sense
>> for the _IndexedSeqFileDict to inherit from a dict.
>
> We'd have to override things like values() to prevent explosions in memory,
> and just give a not implemented exception. But yes, good point.
Done on github - I also had to override all the writeable dict methods like
pop and clear which don't make sense here. The code for the class is now
a bit longer, but is certainly more dict-like. I also had to implement __str__
and __repr__ to do something I think is useful and sensible.
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 20 14:07:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 19:07:38 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
<818036.61284.qm@web62408.mail.re1.yahoo.com>
<320fb6e00908200713q3fddd010x8f355260bb98b063@mail.gmail.com>
<320fb6e00908200919o6721161bie98951da2e89af9c@mail.gmail.com>
Message-ID: <320fb6e00908201107u4c09fd7dj1bcc60ceabe0ecf9@mail.gmail.com>
On Thu, Aug 20, 2009 at 5:19 PM, Peter wrote:
>
> Done on github - I also had to override all the writeable dict methods like
> pop and clear which don't make sense here. The code for the class is now
> a bit longer, but is certainly more dict-like. I also had to implement __str__
> and __repr__ to do something I think is useful and sensible.
>
I have checked this new indexing functionality into CVS, but the github branch
is still there for the SFF file support (parsing and indexing). We can of course
still easily tweak the naming or the public side of the API. In the meantime
I'll think about updating the tutorial...
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 20 14:11:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 20 Aug 2009 19:11:16 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908170807m7ac97ecbvb1c74f2f9194b262@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
<8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
Message-ID: <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com>
On Thu, Aug 20, 2009 at 2:06 PM, Bartek
Wilczynski wrote:
> On Thu, Aug 20, 2009 at 2:43 PM, Peter wrote:
>
>> Bartek - do you think we need git on any of the other OBF machines
>> in addition to dev.open-bio.org (current IP 207.154.17.71)?
>
> What we _need_ is a single machine, where we can run scripts from cron
> and where git is installed. That's why I requested the installation on
> dev.open-bio machine (it happens to be the only one I have an account
> on). The idea is to run something from cron and pull from github to
> have a backup copy of an up-to-date branch. The scripts can (after
> each update) push to other machines.
Bartek, now that Chris D has kindly installed git on dev.open-bio.org,
can you look into backing up our github repository onto dev.open-bio.org?
Initially just running a cron job using your own user account should be fine.
Thanks,
Peter
From bartek at rezolwenta.eu.org Thu Aug 20 17:07:06 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 20 Aug 2009 23:07:06 +0200
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<320fb6e00908170852y2e8158t320c60d6d6a4370c@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
<8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
<320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com>
Message-ID: <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com>
On Thu, Aug 20, 2009 at 8:11 PM, Peter wrote:
> Bartek, now that Chris D has kindly installed git on dev.open-bio.org,
> can you look into backing up our github repository onto dev.open-bio.org?
> Initially just running a cron job using your own user account should be fine.
I've only quickly tested git, and I was able to pull from github with
no problems. I will try porting thew scripts from my machine to
dev.open-bio tomorrow.
In the meantime, I've checked that biopython account on dev.open-bio
machine is assigned to Brad Marshall. I haven't seen him posting to
the list lately. Does anyone have the access to this account?
cheers
Barte
--
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 08:26:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 08:26:45 -0400
Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid
os.cmd
In-Reply-To:
Message-ID: <200908211226.n7LCQjX7025910@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2867
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 08:26 EST -------
I'm going to assume the attempted fix worked (included with Biopython 1.51
final), and close this bug.
Please reopen it if there is still a problem.
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 08:52:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 08:52:24 -0400
Subject: [Biopython-dev] [Bug 2544] Bio.GenBank and SeqFeature improvements
In-Reply-To:
Message-ID: <200908211252.n7LCqOOt026458@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2544
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 08:52 EST -------
(In reply to comment #5)
>
> I'm leaving this bug open for defining __repr__ for the
> Bio.SeqFeature.Reference object ... ONLY.
>
Done in CVS, marking as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:07:23 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 09:07:23 -0400
Subject: [Biopython-dev] [Bug 2711] GenomeDiagram.py: write() and
write_to_string() are inefficient and don't check inputs
In-Reply-To:
Message-ID: <200908211307.n7LD7NoU026962@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2711
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|REOPENED |RESOLVED
Resolution| |FIXED
------- Comment #28 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:07 EST -------
(In reply to comment #27)
> So the only remaining issue is a unit test involving at least checks for
> the presence of renderPM due to versions of reportlab less than 2.2.
Added test_GraphicsBitmaps.py to CVS which will make sure we can output a
bitmap, and flag renderPM as a missing (optional) dependency if not found.
Marking issue as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:11:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 09:11:48 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
bioentry_id
In-Reply-To:
Message-ID: <200908211311.n7LDBmUT027199@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2833
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #26 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:11 EST -------
Marking this old bug as fixed, given the work around.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:24:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 09:24:56 -0400
Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq +
SeqRecord objects / define __contains__ method
In-Reply-To:
Message-ID: <200908211324.n7LDOuP7027608@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2853
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:24 EST -------
(In reply to comment #3)
> Patch for Seq object checked in.
>
> Leaving bug open for possible similar addition to the SeqRecord object.
>
Done in Bio/SeqRecord.py CVS revision 1.43, marking as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:24:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 09:24:59 -0400
Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string,
even subclass string?
In-Reply-To:
Message-ID: <200908211324.n7LDOxRW027624@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2351
Bug 2351 depends on bug 2853, which changed state.
Bug 2853 Summary: Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method
http://bugzilla.open-bio.org/show_bug.cgi?id=2853
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 09:55:58 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 09:55:58 -0400
Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO
In-Reply-To:
Message-ID: <200908211355.n7LDtwvC028668@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2865
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 09:55 EST -------
I've checked in a slightly revised version of Cymon's patch to allow
Bio.SeqIO to write "phd" files.
Checking in Tests/test_SeqIO_QualityIO.py;
/home/repository/biopython/biopython/Tests/test_SeqIO_QualityIO.py,v <--
test_SeqIO_QualityIO.py
new revision: 1.14; previous revision: 1.13
done
Checking in Tests/output/test_SeqIO;
/home/repository/biopython/biopython/Tests/output/test_SeqIO,v <-- test_SeqIO
new revision: 1.51; previous revision: 1.50
done
Checking in Bio/SeqIO/__init__.py;
/home/repository/biopython/biopython/Bio/SeqIO/__init__.py,v <-- __init__.py
new revision: 1.58; previous revision: 1.57
done
Checking in Bio/SeqIO/PhdIO.py;
/home/repository/biopython/biopython/Bio/SeqIO/PhdIO.py,v <-- PhdIO.py
new revision: 1.8; previous revision: 1.7
done
Cymon - could you double check this please? I made one change regarding
the filename/record description, and also you hadn't rounded the Solexa
scores to the nearest integer value after they were converted to PHRED
scores.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:42 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 10:21:42 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To:
Message-ID: <200908211421.n7LELgY3029289@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2891
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-21 10:21 EST -------
This should be fixed in CVS (pushed to github hourly), although I used a
slighlty different style to break up the long test methods.
Please reopen this bug if the problem persists.
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 10:21:44 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
Jython Error Fix+Patch
In-Reply-To:
Message-ID: <200908211421.n7LELi5g029306@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2895
Bug 2895 depends on bug 2891, which changed state.
Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 10:21:47 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To:
Message-ID: <200908211421.n7LELll4029321@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2893
Bug 2893 depends on bug 2891, which changed state.
Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Aug 21 10:21:50 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 21 Aug 2009 10:21:50 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To:
Message-ID: <200908211421.n7LELobS029336@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2892
Bug 2892 depends on bug 2891, which changed state.
Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From dmikewilliams at gmail.com Sun Aug 23 13:47:53 2009
From: dmikewilliams at gmail.com (Mike Williams)
Date: Sun, 23 Aug 2009 13:47:53 -0400
Subject: [Biopython-dev] how to determine BioPython version number
Message-ID:
Hi there. About a year ago a message was posted that suggested using
Martel.__version__ to determine that BioPython versio number. A
couple weeks ago the draft announcment for BioPython 1.51 said that
Martel is no longer included.
If Martel is no longer included, is there some other way for a program
to determine the version number of BioPython that is installed?
Tried searching for this, but found nothing relevant.
Mike
Below are snippets from the two messages referred to above:
subject: [Biopython-dev] determining the version
Peter biopython at maubp.freeserve.co.uk
Wed Sep 24 17:12:24 EDT 2008
> Somewhat related to this, what is the appropriate way to find the version of
> BioPython installed within Python?
So I'm not the only person to have wondered about this. For now, I
can only suggest an ugly workarround:
import Martel
print Martel.__version__
Since Biopython 1.45, by convention the Martel version has been
incremented to match that of Biopython. Of course, in a few releases
time we probably won't be including Martel any more.
On Thu, Aug 13, 2009 at 6:10 AM, Peter wrote:
subject: [Biopython-dev] Draft announcement for Biopython 1.51
... we no longer include Martel/Mindy, and thus don't have
any dependence on mxTextTools.
From biopython at maubp.freeserve.co.uk Sun Aug 23 15:58:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 23 Aug 2009 20:58:07 +0100
Subject: [Biopython-dev] how to determine BioPython version number
In-Reply-To:
References:
Message-ID: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com>
On Sun, Aug 23, 2009 at 6:47 PM, Mike Williams wrote:
>
> Hi there. ?About a year ago a message was posted that suggested using
> Martel.__version__ to determine that BioPython versio number.
You are looking at at old thread, and missed what happened since, try:
import Bio
Bio.__version__
I think this deserves a FAQ entry in the next release of the tutorial...
The Martel version "trick" was a work around for determining the
version which worked for a few moderately old versions of Biopython
(prior to us adding Bio.__version__).
Peter
From dmikewilliams at gmail.com Sun Aug 23 16:14:51 2009
From: dmikewilliams at gmail.com (Mike Williams)
Date: Sun, 23 Aug 2009 16:14:51 -0400
Subject: [Biopython-dev] how to determine BioPython version number
In-Reply-To: <320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com>
References:
<320fb6e00908231258w73f9d38fo9b726fa2fb7dcec@mail.gmail.com>
Message-ID:
On Sun, Aug 23, 2009 at 3:58 PM, Peter wrote:
> You are looking at at old thread, and missed what happened since, try:
>
> import Bio
> Bio.__version__
>
> I think this deserves a FAQ entry in the next release of the tutorial...
>
> The Martel version "trick" was a work around for determining the
> version which worked for a few moderately old versions of Biopython
> (prior to us adding Bio.__version__).
>
Thanks Peter. I had two problems, looking at an old thread and having
an older versions of BioPython, 1.48 and 1.49 on fedora 10 and 11.
The method you supplied works fine with the 1.51b version I just got
from cvs.
Mike
From bugzilla-daemon at portal.open-bio.org Mon Aug 24 11:15:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 24 Aug 2009 11:15:44 -0400
Subject: [Biopython-dev] [Bug 2904] New: Interface for Novoalign
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2904
Summary: Interface for Novoalign
Product: Biopython
Version: 1.51
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: osvaldo.zagordi at bsse.ethz.ch
Hi,
I wrote an interface for the short sequence alignment program Novoalign
(www.novocraft.com). All I did was to modify the interface for Muscle. I might
cover some other aligner in the near future.
Hope it's useful to someone.
Best,
Osvaldo
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 24 11:16:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 24 Aug 2009 11:16:48 -0400
Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign
In-Reply-To:
Message-ID: <200908241516.n7OFGm98032344@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2904
------- Comment #1 from osvaldo.zagordi at bsse.ethz.ch 2009-08-24 11:16 EST -------
Created an attachment (id=1361)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1361&action=view)
Interface to run novoalign (www.novocraft.com)
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 24 11:21:07 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 24 Aug 2009 11:21:07 -0400
Subject: [Biopython-dev] [Bug 2905] New: Short read alignment format
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2905
Summary: Short read alignment format
Product: Biopython
Version: 1.51
Platform: All
OS/Version: Mac OS
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: osvaldo.zagordi at bsse.ethz.ch
Hi again,
is there any plan to develop some parsers for alignment of short reads? There's
a lot of formats around, and the most serious proposal for a format I've seen
is SAM (http://samtools.sourceforge.net/). I should start writing something to
parse this output soon. Any suggestion on where to start from (in order not to
depend on some module that will be soon obsolete)?
Thanks,
Osvaldo
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 24 21:34:02 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 24 Aug 2009 21:34:02 -0400
Subject: [Biopython-dev] [Bug 2907] New: When a genomic record has been
loaded using eFetch,
if it is written to genbank format the header line refers to 'aa'
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2907
Summary: When a genomic record has been loaded using eFetch, if
it is written to genbank format the header line refers
to 'aa'
Product: Biopython
Version: 1.51b
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: david.wyllie at ndm.ox.ac.uk
When a genomic record has been loaded using eFetch, if it is written to genbank
format the header line refers to 'aa' not 'bp' although the .seq.alphabet is
set (correctly, I think) to generic_dna.
The background here is that we're annotating some viral genomes computationally
(however, the annotation isn't necessary for the problem here, see below) and
then writing the output to .gb format. After this we load the file using
LaserGene (a commercial sequence editing program) to have a look at it etc.
This doesn't work terribly well because of the 'aa' designation in the header
line. Apart from this, the export seems ok.
I'm using a git download from mid-June 09.
here is an example which illustrates this:
# load dependencies
from Bio import Entrez
from Bio import SeqIO
from Bio import SeqRecord
from Bio.Alphabet import generic_protein, generic_dna
# get a sequence from Genbank
print "going to recover a sequence from genbank...."
ifh = Entrez.efetch(db="nucleotide",id="DQ923122",rettype="gb")
# parse the file handle
recordlist=[]
print "OK, got the records from genbank, parsing ..."
for record in SeqIO.parse(ifh, "genbank"):
recordlist.append(record)
ifh.close()
# write it to a file
for thisrecord in recordlist:
# confirm it's dna
assert (type(thisrecord.seq.alphabet)==type(generic_dna)), "We are
supposed to be dealing with a DNA sequence, but we aren't, can't continue."
# write to gb
ofn=thisrecord.id+".gb"
print "Writing thisrecord to ",ofn
ofh=open(ofn,"w")
SeqIO.write([thisrecord], ofh, "gb")
ofh.close
exit()
# top lines of the genbank file reads as follows
#
#LOCUS DQ923122 34250 aa DNA VRL
01-JAN-1980
#DEFINITION Human adenovirus 52 isolate T03-2244, complete genome.
#ACCESSION DQ923122
#VERSION DQ923122.2 GI:124375632
#KEYWORDS
#SOURCE Human adenovirus 52
# ORGANISM Human adenovirus 52
# Viruses; dsDNA viruses, no RNA stage; Adenoviridae;
Mastadenovirus;
# unclassified Human adenoviruses
#FEATURES Location/Qualifiers
# source 1..34250
# /country="USA"
# /isolate="T03-2244"
# /mol_type="genomic DNA"
# /organism="Human adenovirus 52"
# /db_xref="taxon:332179
Thank you for any advice you have to offer.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 24 21:36:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 24 Aug 2009 21:36:48 -0400
Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded
using eFetch,
if it is written to genbank format the header line refers to 'aa'
In-Reply-To:
Message-ID: <200908250136.n7P1amWC017814@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2907
------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-08-24 21:36 EST -------
Created an attachment (id=1362)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1362&action=view)
test case, which is the same as that pasted into the message
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 24 21:37:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 24 Aug 2009 21:37:48 -0400
Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded
using eFetch,
if it is written to genbank format the header line refers to 'aa'
In-Reply-To:
Message-ID: <200908250137.n7P1bm9c017839@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2907
------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-08-24 21:37 EST -------
Created an attachment (id=1363)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1363&action=view)
example of the genbank file written
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Aug 25 05:40:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 25 Aug 2009 05:40:29 -0400
Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded
using eFetch,
if it is written to genbank format the header line refers to 'aa'
In-Reply-To:
Message-ID: <200908250940.n7P9eT7w001376@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2907
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-25 05:40 EST -------
Hi David,
I spotted this (aa/bp mix up in the LOCUS line) after the beta was out, and it
should already be fixed in Biopython 1.51 final. Please update and retest, and
if there is still a problem please reopen this bug. Thanks!
Note that unless I was going to modify the annotation (which the background use
case suggests you are), I would save the raw GenBank record from Entrez
directly to disk (since parsing it and then writing it back out with SeqIO
isn't yet perfect - e.g. the date in the LOCUS line).
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Aug 25 06:09:50 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 25 Aug 2009 06:09:50 -0400
Subject: [Biopython-dev] [Bug 2907] When a genomic record has been loaded
using eFetch,
if it is written to genbank format the header line refers to 'aa'
In-Reply-To:
Message-ID: <200908251009.n7PA9o4T002461@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2907
------- Comment #4 from david.wyllie at ndm.ox.ac.uk 2009-08-25 06:09 EST -------
thank you - this is indeed fixed in the latest git version.
Best wishes
David
(In reply to comment #3)
> Hi David,
>
> I spotted this (aa/bp mix up in the LOCUS line) after the beta was out, and it
> should already be fixed in Biopython 1.51 final. Please update and retest, and
> if there is still a problem please reopen this bug. Thanks!
>
> Note that unless I was going to modify the annotation (which the background use
> case suggests you are), I would save the raw GenBank record from Entrez
> directly to disk (since parsing it and then writing it back out with SeqIO
> isn't yet perfect - e.g. the date in the LOCUS line).
>
> Peter
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Aug 25 06:33:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 11:33:56 +0100
Subject: [Biopython-dev] Command line wrappers for assembly tools
Message-ID: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com>
Hi all,
Osvaldo Zagordi has recently offered a Bio.Application style command line
wrapper for Novoalign (a commercial short read aligner from Novocraft), see
enhancement Bug 2904, and the Novocraft website:
http://bugzilla.open-bio.org/show_bug.cgi?id=2904
http://www.novocraft.com/products.html
Note that Novocraft do offer a trial/evaluation version, but I have no idea
what the terms and conditions are, and I personally do not have access
to the commercial tool (e.g. for testing the wrapper). Nevertheless, this
would be a nice addition to Biopython.
I personally would like to have wrappers for some of the "off instrument"
applications from Roche 454 (e.g. the Newbler assembler, read mapper
and perhaps their SFF tools), which I have been using. These are Linux
only (which is a pain as Windows and Mac OS X are out), but Roche
seem relatively relaxed about making the software available to any
academics using their sequencer (I'd suggest anyone interested
contact your local sequencing centre for this).
While some of these tools would fit under Bio.Align.Applications, does
creating a similar collection at Bio.Sequencing.Applications make more
sense? For example, the Roche sffinfo tool isn't in itself a alignment
application - but it is related to DNA sequencing.
Peter
From mjldehoon at yahoo.com Tue Aug 25 06:41:20 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 25 Aug 2009 03:41:20 -0700 (PDT)
Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores)
In-Reply-To:
Message-ID: <54938.41623.qm@web62405.mail.re1.yahoo.com>
I did (3) and (4) below, and I added a __str__ method but I didn't touch the other print functions (2).
For (1), maybe a better way is to subclass the SeqMat class for each of the matrix types instead of storing the matrix type in self.mat_type. Any comments or objections (especially Iddo)?
--Michiel.
--- On Sat, 7/25/09, Iddo Friedberg wrote:
> I'm the author of subsmat IIRC.
> Everything sounds good, but I would not make 2.6 changes
> that will break on 2.5. Ubuntu still uses 2.5 and I imagine
> other linux distros do too.
> 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2,
> SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at
> the level of Bio.SubsMat). I think that these should be
> class variables of the Bio.SubsMat.SeqMat class.
>
>
>
>
> 2) The print_mat method. It would be more Pythonic to use
> __str__, __format__ for this, though the latter is only
> available for Python versions >= 2.6.
>
>
>
> 3) The __sum__ method. I guess that this was intended to be
> __add__?
>
>
>
> 4) The sum_letters attribute. To calculate the sum of all
> values for a given letter, currently the following two
> functions are involved:
>
>
>
> ? def all_letters_sum(self):
>
> ? ? ?for letter in self.alphabet.letters:
>
> ? ? ? ? self.sum_letters[letter] =
> self.letter_sum(letter)
>
>
>
> ? def letter_sum(self,letter):
>
> ? ? ?assert letter in self.alphabet.letters
>
> ? ? ?sum = 0.
>
> ? ? ?for i in self.keys():
>
> ? ? ? ? if letter in i:
>
> ? ? ? ? ? ?if i[0] == i[1]:
>
> ? ? ? ? ? ? ? sum += self[i]
>
> ? ? ? ? ? ?else:
>
> ? ? ? ? ? ? ? sum += (self[i] / 2.)
>
> ? ? ?return sum
>
>
>
> As you can see, the result is not returned, but stored in
> an attribute called sum_letters. I suggest to replace this
> with the following:
>
>
>
> ? ?def sum(self):
>
> ? ? ? ?result = {}
>
> ? ? ? ?for letter in self.alphabet.letters:
>
> ? ? ? ? ? ?result[letter] = 0.0
>
> ? ? ? ?for pair, value in self:
>
> ? ? ? ? ? ?i1, i2 = pair
>
> ? ? ? ? ? ?if i1==i2:
>
> ? ? ? ? ? ? ? ?result[i1] += value
>
> ? ? ? ? ? ?else:
>
> ? ? ? ? ? ? ? ?result[i1] += value / 2
>
> ? ? ? ? ? ? ? ?result[i2] += value / 2
>
> ? ? ? ?return result
>
>
>
> so without storing the result in an attribute.
>
>
>
>
>
> Any comments, objections?
>
>
>
> --Michiel
>
>
>
> --- On Fri, 7/24/09, Michiel de Hoon
> wrote:
>
>
>
> > From: Michiel de Hoon
>
> > Subject: Re: [Biopython-dev] Calculating motif scores
>
> > To: "Bartek Wilczynski"
>
> > Cc: biopython-dev at biopython.org
>
> > Date: Friday, July 24, 2009, 5:34 AM
>
> >
>
> > > As for the PWM being a separate class and used by
> the
>
> > motif:
>
> > > I don't know. I'm using
> Bio.SubsMat.FreqTable for
>
> > implementing
>
> > > frequency table, so I understand that the new
> PWM
>
> > class would
>
> > > be basically a "smarter" FreqTable.
> I'm not sure
>
> > whether it
>
> > > solves any problems...
>
> >
>
> > Wow, I didn't even know the Bio.SubsMat module
> existed.
>
> > As we have several different but related modules
>
> > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we
> should
>
> > define the purpose and scope of each of these
> modules.
>
> > Maybe a good way to start is the documentation.
> Bio.SubsMat
>
> > is currently divided into two chapters (14.4 and
> 16.2). I'll
>
> > have a look at this over the weekend to see if this
> can be
>
> > cleaned up a bit.
>
> >
>
> > --Michiel.
>
> >
>
> >
>
> > ? ? ?
>
> > _______________________________________________
>
> > Biopython-dev mailing list
>
> > Biopython-dev at lists.open-bio.org
>
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
> >
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
> Biopython-dev mailing list
>
> Biopython-dev at lists.open-bio.org
>
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>
From bartek at rezolwenta.eu.org Tue Aug 25 06:52:24 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 25 Aug 2009 12:52:24 +0200
Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores)
In-Reply-To: <54938.41623.qm@web62405.mail.re1.yahoo.com>
References:
<54938.41623.qm@web62405.mail.re1.yahoo.com>
Message-ID: <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com>
On Tue, Aug 25, 2009 at 12:41 PM, Michiel de Hoon wrote:
> I did (3) and (4) below, and I added a __str__ method but I didn't touch the other print functions (2).
>
> For (1), maybe a better way is to subclass the SeqMat class for each of the matrix types instead of storing the matrix type in self.mat_type. Any comments or objections (especially Iddo)?
>
Hi,
I don't have any objections here. Just for clarification: is it now in
CVS or on some git branch?
cheers
Bartek
From biopython at maubp.freeserve.co.uk Tue Aug 25 06:59:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 11:59:38 +0100
Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores)
In-Reply-To: <8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com>
References:
<54938.41623.qm@web62405.mail.re1.yahoo.com>
<8b34ec180908250352r259a310egbf19963cff43e099@mail.gmail.com>
Message-ID: <320fb6e00908250359i2c35d8b0pe84d590a9527b8bb@mail.gmail.com>
On Tue, Aug 25, 2009 at 11:52 AM, Bartek
Wilczynski wrote:
> I don't have any objections here. Just for clarification: is it now in
> CVS or on some git branch?
All on CVS still (and thus being pushed to gitgub). Do you want to
give us a git status update on the other thread?
http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006620.html
Peter
From bartek at rezolwenta.eu.org Tue Aug 25 07:58:05 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 25 Aug 2009 13:58:05 +0200
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
<8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
<320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com>
<8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com>
Message-ID: <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com>
Hi all,
Time for an update on how things are with git and biopython.
On Thu, Aug 20, 2009 at 11:07 PM, Bartek
Wilczynski wrote:
> I've only quickly tested git, and I was able to pull from github with
> no problems. I will try porting thew scripts from my machine to
> dev.open-bio tomorrow.
That works fine. I've set up a crontab script
(/home/bartek/github_backup.sh) on dev.open-bio machine which fetches
the current github branch and saves it to
/home/bartek/biopython_from_github. Then it creates a "bare
repository" (/home/bartek/biopython.git) which can be then used by
others. If you have an shell account on the dev machine, you should be
able to clone it over ssh with the following command:
git clone ssh://_YOUR_USERNAME_ at dev.open-bio.org/~bartek/biopython.git
if this is put into a directory accesible via http, one can also clone
(anonymously) over http. I don't have an account on biopython www
server, but I was able to put it on my server (just to check if it
works). You can fetch it like this:
git clone http://bartek.rezolwenta.eu.org/biopython.git
In conclusion: it works. I would say, that the next important step is
to decide when to stop commiting to CVS... I'm just waiting for a
signal to terminate the updates from CVS to github and we are done.
In the meantime, it would make sense to make it more stable which
involves some technical details (mostly related to user accounts)
Namely, we need to
- set up these scripts on biopython account instead of my own (see below)
- decide whether we want other things to be done by these scripts
(generating src tarballs, etc)
>
> In the meantime, I've checked that biopython account on dev.open-bio
> machine is assigned to Brad Marshall. I haven't seen him posting to
> the list lately. Does anyone have the access to this account?
This would come in handy now. Anybody knows how to access this account?
cheers
Bartek
From biopython at maubp.freeserve.co.uk Tue Aug 25 08:13:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 13:13:24 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<320fb6e00908180939t544c7913j919795ff6e35d42f@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
<8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
<320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com>
<8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com>
<8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com>
Message-ID: <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com>
On Tue, Aug 25, 2009 at 12:58 PM, Bartek
Wilczynski wrote:
> Hi all,
>
> Time for an update on how things are with git and biopython.
>
> On Thu, Aug 20, 2009 at 11:07 PM, Bartek
> Wilczynski wrote:
>> I've only quickly tested git, and I was able to pull from github with
>> no problems. I will try porting thew scripts from my machine to
>> dev.open-bio tomorrow.
>
> That works fine. I've set up a crontab script
> (/home/bartek/github_backup.sh) on dev.open-bio machine which fetches
> the current github branch and saves it to
> /home/bartek/biopython_from_github. Then it creates a "bare
> repository" (/home/bartek/biopython.git) which can be then used by
> others. If you have an shell account on the dev machine, you should be
> able to clone it over ssh with the following command:
> git clone ssh://_YOUR_USERNAME_ at dev.open-bio.org/~bartek/biopython.git
Yes, that works for me (and thus in theory anyone with a dev
account).
> if this is put into a directory accesible via http, one can also clone
> (anonymously) over http. I don't have an account on biopython www
> server, but I was able to put it on my server (just to check if it
> works). You can fetch it like this:
> git clone http://bartek.rezolwenta.eu.org/biopython.git
Excellent. We can ask the OBF to give you access to biopython.org
(and Brad too since it would have helped when he did the recent
release) which would help setting this stuff up [and see below]
> In conclusion: it works. I would say, that the next important step is
> to decide when to stop commiting to CVS... I'm just waiting for a
> signal to terminate the updates from CVS to github and we are done.
OK - so the basics are ready (backing up from github to an OBF
machine). Good job.
> In the meantime, it would make sense to make it more stable which
> involves some technical details (mostly related to user accounts)
> Namely, we need to
> - set up these scripts on biopython account instead of my own (see below)
> - decide whether we want other things to be done by these scripts
> (generating src tarballs, etc)
>
>> In the meantime, I've checked that biopython account on dev.open-bio
>> machine is assigned to Brad Marshall. I haven't seen him posting to
>> the list lately. Does anyone have the access to this account?
>
> This would come in handy now. Anybody knows how to access this account?
I have no idea who Brad Marshall is. We'll have to take this up with
the OBF. I'll email you off list...
Peter
From biopython at maubp.freeserve.co.uk Tue Aug 25 09:23:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 14:23:15 +0100
Subject: [Biopython-dev] Moving from CVS to git
In-Reply-To: <320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com>
References: <320fb6e00908170543t57db8e32v12c37240e83070ed@mail.gmail.com>
<8b34ec180908190245u70a888ecx7076e1d249e07c04@mail.gmail.com>
<8AE0AE2F-A91A-46A3-860E-D450C07ED4F9@gmx.net>
<8b34ec180908200214waab8c73xffcc2355363e0724@mail.gmail.com>
<320fb6e00908200543o4f3d38d3t2667959527382db0@mail.gmail.com>
<8b34ec180908200606o6670fdberd11913f3d922beb4@mail.gmail.com>
<320fb6e00908201111m7295aa3bme237f84a517498f4@mail.gmail.com>
<8b34ec180908201407y130d8ec1w32dde7e5baef77ae@mail.gmail.com>
<8b34ec180908250458p22339d96uda0251eb29f031b6@mail.gmail.com>
<320fb6e00908250513o54bad8beo43b5c82a84579120@mail.gmail.com>
Message-ID: <320fb6e00908250623j19daa0cey429265f8c2bcb4ff@mail.gmail.com>
>>> In the meantime, I've checked that biopython account on dev.open-bio
>>> machine is assigned to Brad Marshall. I haven't seen him posting to
>>> the list lately. Does anyone have the access to this account?
>>
>> This would come in handy now. Anybody knows how to access this account?
>
> I have no idea who Brad Marshall is. We'll have to take this up with
> the OBF. I'll email you off list...
Just for the record, on closer inspection, Brad Marshall has/had a
separate account but it included "biopython" in the user's name.
I presume he was another former contributor to the project.
Peter
From biopython at maubp.freeserve.co.uk Tue Aug 25 11:11:34 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 16:11:34 +0100
Subject: [Biopython-dev] Fwd: More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
Message-ID: <320fb6e00908250811r18aaec6fj6c2f0e40996fda0a@mail.gmail.com>
Hi all,
This was posted to the OBF cross project mailing list, but if any of
you guys have some sample FASTQ data please consider sharing
a small sample (e.g. the first ten reads). We would need this to be
"no-strings attached" so that it could be used in any of the OBF
projects under their assorted open source licences.
In addition to the notes below, I would be interested in is any
FASTQ files from your local sequence centre, which may use
their own conventions for the record title lines (e.g. record names).
Thanks,
Peter
P.S. Rather that trying to send any attachments to the mailing
list, please email me personally.
---------- Forwarded message ----------
From: Peter
Date: Tue, Aug 25, 2009 at 12:24 PM
Subject: More FASTQ examples for cross project testing
To: open-bio-l at lists.open-bio.org
Cc: Peter Rice , Chris Fields
Hi all,
I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
off list about this plan. I'm going to co-ordinate putting together a
set of valid FASTQ files for shared testing (to supplement the
existing set of invalid FASTQ files already done and being used in
Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
What I have in mind is:
XXX_original_YYY.fastq - sample input
XXX_as_sanger.fastq - reference output
XXX_as_solexa.fastq - reference output
XXX_as_illumina.fastq - reference output
where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
longreads, sanger_full_range, solexa_full_range ...) and YYY is the
FASTQ variant (sanger, solexa or illumina) for the "input" file.
For example, we might have:
wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
perhaps repeating the title on the plus lines
wrapped1_as_sanger.fastq - The same data but using the consensus of no
line wrapping and omitting the repeated title on the plus lines.
wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
(ASCII offset 64), with capping at Solexa 62 (ASCII 126).
wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
offset 64, with capping at PHRED 62 (ASCII 126).
Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
(e.g. at 60 characters). I will include "sanger_full_range" which
would cover all the valid PHRED scores from 0 to 93, and similarly for
Solexa and Illumina files - these are important for testing the score
conversions. I have some ideas for deliberately tricky (but valid)
files which should properly test any parser.
The point is we have "perhaps odd but valid" originals, plus the
"cleaned up" versions (using the same FASTQ variant), and "cleaned up"
versions in the other two FASTQ variants.
Ideally asking Biopython/BioPerl/EMBOSS to convert the
XXX_original_YYY.fastq files into any of the three FASTQ variants will
give exactly the same as the reference outputs.
If anyone has any comments or suggestions please speak up (e.g. my
suggested naming conventions).
Real life examples of FASTQ files anyone has had trouble parsing (even
with 3rd party tools) would be particularly useful - although we'd
probably want to cut down big example files in order to keep the
dataset to a reasonable size.
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Wed Aug 26 07:36:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Aug 2009 12:36:36 +0100
Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list /
nested SeqFeatures
Message-ID: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com>
Hi all,
I've retitled this thread (originally on the main list) to focus on the
more general idea of filtering SeqRecord feature list (as that has
very little to do with SQLAlchemy) and how this interact with
nested SeqFeature objects.
On Wed, Aug 26, 2009, Peter wrote:
> On Wed, Aug 26, 2009, Kyle Ellrott wrote:
>> I've added a new database function lookupFeature to quickly search for
>> sequences features without have to load all of them for any particular
>> sequence.
>> ...
>
> Interesting - and potentially useful if you are interested in just
> part of the genome (e.g. an operon).
>
> Have you tested this on composite features (e.g. a join)?
> Without looking into the details of your code this isn't clear.
>
> I wonder how well this would scale with a big BioSQL database
> ...
>
> On the other hand, if all the record's features have already been
> loaded into memory, there would just be thousands of locations
> to look at - it might be quicker.
>
> This brings me to another idea for how this interface might work,
> via the SeqRecord - how about adding a method like this:
>
> def filtered_features(self, start=None, end=None, type=None):
>
> Note I think it would also be nice to filter on the feature type (e.g.
> CDS or gene). This method would return a sublist of the full
> feature list (i.e. a list of those SeqFeature objects within the
> range given, and of the appropriate type). This could initially
> be implemented with a simple loop, but there would be scope
> for building an index or something more clever.
>
> [Note we are glossing over some potentially ambiguous
> cases with complex composite locations, where the "start"
> and "end" may differ from the "span" of the feature.]
>
> The DBSeqRecord would be able to do the same (just inherit
> the method), but you could try doing this via an SQL query, ...
Brad, it occurred to me this idea (a filtered_features method
on the SeqRecord) might cause trouble with what I believe you
have in mind for parsing GFF files into nested SeqFeatures.
Is that still your plan?
In particular, if you have save a CDS feature within a gene
feature, and the user asked for all the CDS features, simply
scanning the top level features list would miss it.
Would it be safe to assume (or even enforce) that subfeatures
are always *with* the location spanned by the parent feature?
Even with this proviso, a daughter feature may still be small
enough to pass a start/end filter, even if the parent feature
is not. Again, scanning the top level features list would miss
it.
All these issues go away if we continue to treat the SeqRecord
features list as a flat list, and only use the SeqFeature
subfeatures list purely for storing composite locations (i.e.
sub regions of the parent feature - not for true subfeatures).
There are other downsides to using nested SubFeatures,
it will probably require a lot of reworking of the GenBank
output due to how composite features like joins are
currently stored, and I haven't even looked at the BioSQL
side of things. You may have looked at that already
though, so I may just be worrying about nothing.
Peter
From eoc210 at googlemail.com Sun Aug 30 15:33:59 2009
From: eoc210 at googlemail.com (Ed Cannon)
Date: Sun, 30 Aug 2009 20:33:59 +0100
Subject: [Biopython-dev] OBO2OWL parser / converter
Message-ID: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com>
Hi All,
I would like to thank you guys for all your hard work and effort in making
biopython a great piece of open software.
I would also like to introduce myself, my name is Ed Cannon, I am a postdoc
at Cambridge University working in the fields of chemo/bioinformatics and
semantic web technologies in the group of Peter Murray-Rust.
Since a fair amount of my work involves ontologies, I have written an open
biomedical ontology (.obo) to web ontology language (.owl) converter. The
resultant file can be loaded and used from Protege. I was wondering if this
software would be of any interest to the biopython community? I have just
sent a pull request to biopython on github. The code is located at my branch
on my account: http://github.com/eoc21/biopython/tree/eoc21Branch.
Thanks,
Ed
From krother at rubor.de Mon Aug 31 07:19:07 2009
From: krother at rubor.de (Kristian Rother)
Date: Mon, 31 Aug 2009 13:19:07 +0200
Subject: [Biopython-dev] RNA module contributions
Message-ID: <4A9BB1AB.1070608@rubor.de>
Hi,
to start work on RNA modules, I'd like to contribute some of our tested
modules to BioPython. Before I place them into my GIT branch, it would
be great to get some comments:
Bio.RNA.SecStruc
- represents a RNA secondary structures,
- recognizing of SSEs (helix, loop, bulge, junction)
- recognizing pseudoknots
Bio.RNA.ViennaParser
- parses RNA secondary structures in the Vienna format into SecStruc
objects.
Bio.RNA.BpseqParser
- parses RNA secondary structures in the Bpseq format into SecStruc
objects.
Connected to RNA, but with a wider focus:
Bio.???.ChemicalGroupFinder
- identifies chemical groups (ribose, carboxyl, etc) in a molecule
graph (place to be defined yet)
There is a contribution from Bjoern Gruening as well:
Bio.PDB.PDBMLParser
- creates PDB.Structure objects from PDB-XML files.
Comments and suggestions welcome!
Best Regards,
Kristian Rother
From hlapp at gmx.net Mon Aug 31 08:17:43 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 31 Aug 2009 08:17:43 -0400
Subject: [Biopython-dev] OBO2OWL parser / converter
In-Reply-To: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com>
References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com>
Message-ID: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net>
Hi Ed -
is your converter operating in a way that is congruent with (or even
utilizing) the mapping and the converter provided by the NCBO and
Berkeley Ontology projects?
http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page
If not, I'm not sure how beneficial it is for users to have multiple
and possibly conflicting mappings.
-hilmar
On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote:
> Hi All,
>
> I would like to thank you guys for all your hard work and effort in
> making
> biopython a great piece of open software.
>
> I would also like to introduce myself, my name is Ed Cannon, I am a
> postdoc
> at Cambridge University working in the fields of chemo/
> bioinformatics and
> semantic web technologies in the group of Peter Murray-Rust.
>
> Since a fair amount of my work involves ontologies, I have written
> an open
> biomedical ontology (.obo) to web ontology language (.owl)
> converter. The
> resultant file can be loaded and used from Protege. I was wondering
> if this
> software would be of any interest to the biopython community? I
> have just
> sent a pull request to biopython on github. The code is located at
> my branch
> on my account: http://github.com/eoc21/biopython/tree/eoc21Branch.
>
> Thanks,
>
> Ed
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From biopython at maubp.freeserve.co.uk Mon Aug 31 08:42:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 13:42:52 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
Message-ID: <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
On Thu, Aug 20, 2009 at 12:28 PM, Peter wrote:
> Hi all,
>
> You may recall a thread back in June with Cedar Mckay (cc'd - not
> sure if he follows the dev list or not) about indexing large sequence
> files - specifically FASTA files but any sequential file format. I posted
> some rough code which did this building on Bio.SeqIO:
> http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html
The Bio.SeqIO.indexed_dict() functionality is in CVS/github now
as I would like some wider testing. My earlier email explained the
implementation approach, and gave some example code:
http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html
This aims to solve a fairly narrow problem - dictionary like random
access to any record in a sequence file as a SeqRecord via the
record id string as the key. It should works on any sequential file
format, and can even works on binary SFF files (code on a branch
in github still).
Bio.SeqIO.to_dict() has always offered a very simple in memory
solution (a python dictionary of SeqRecord objects) which is fine
for small files (e.g. a few thousand FASTA entries), but won't scale
much more than that.
Using a BioSQL database would also allow random access to any
SeqRecord (and not just by looking it up by the identifier), but I
doubt it would scale well to 10s of millions of short read sequences.
It is also non-trivial to install the DB itself, the schema and the
Python bindings.
The new Bio.SeqIO.indexed_dict() code offers a read only
dictionary interface which does work for millions of reads. As
implemented, there is still a memory bound as all the keys and
their associated file offsets are held in memory. For example, a
7 million record FASTQ file taking 1.3GB on disk seems to need
almost 700MB in memory (just a very crude measurement).
Although clearly this is much more capable than the naive full
dictionary in memory approach (which is out of the question
here), this too could become a real bottle-neck before long
Biopython's old Martel/Mindy code used to build an on disk index,
which avoided this memory constraint. However, we're removed
that (due to mxTextTool breakage etc). In any case, it was also
much much slower:
http://lists.open-bio.org/pipermail/biopython/2009-June/005309.html
Using a Bio.SeqIO.indexed_dict() like API, we could of course
build an index file on disk to avoid this potential memory problem.
As Cedar suggested, this index file could be handled transparently
(created and deleted automatically), or indeed could be explicitly
persisted/reloaded to avoid re-indexing unnecessarily:
http://lists.open-bio.org/pipermail/biopython/2009-June/005265.html
Sticking to the narrow use case of (read only) random access to
a sequence file, all we really need to store is the lookup table of
keys (or their Python hash) and offsets in the original file. If they
are fast enough, we might even be able to reuse the old Martel/
Mindy index file format... or the OBDA specification if that is still
in use:
http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html
Another option (like the shelve idea we talked about last month)
is to parse the sequence file with SeqIO, and serialise all the
SeqRecord objects to disk, e.g. with pickle or some key/value
database. This is potentially very complex (e.g. arbitrary Python
objects in the annotation), and could lead to a very large "index"
file on disk. On the other hand, some possible back ends would
allow editing the database... which could be very useful.
Brad - do you have any thoughts? I know you did some work
with key/value indexers:
http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/
Peter
From chapmanb at 50mail.com Mon Aug 31 08:54:52 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 31 Aug 2009 08:54:52 -0400
Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list /
nested SeqFeatures
In-Reply-To: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com>
References: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com>
Message-ID: <20090831125452.GA75451@sobchak.mgh.harvard.edu>
Peter and Kyle;
> I've retitled this thread (originally on the main list) to focus on the
> more general idea of filtering SeqRecord feature list (as that has
> very little to do with SQLAlchemy) and how this interact with
> nested SeqFeature objects.
Sorry to have missed this thread in real time; I was out of town
last week. Generally, it is great we are focusing on standard
queries and building up APIs to make them more intuitive. Nice.
> Brad, it occurred to me this idea (a filtered_features method
> on the SeqRecord) might cause trouble with what I believe you
> have in mind for parsing GFF files into nested SeqFeatures.
> Is that still your plan?
Yes, that was still the idea although I haven't dug into it much
beyond last time we discussed this. This is the direct translation
of the GFF way of handling multiple transcripts and coding features,
and seems like the intuitive way to handle the problem.
> In particular, if you have save a CDS feature within a gene
> feature, and the user asked for all the CDS features, simply
> scanning the top level features list would miss it.
I think we'll be okay here. With nesting everything would still be
stored in the seqfeature table. The seqfeature_relationship table
defines the nesting relationship but for the sake of queries all of
the features can be treated as flat directly related to the bioentry
of interest.
Secondarily, you would need to reconstitute the nested relationship
if that is of interest, but for the query example of "give me all
features of this type in this region" you could return a simple flat
iterator of them.
> Would it be safe to assume (or even enforce) that subfeatures
> are always *with* the location spanned by the parent feature?
> Even with this proviso, a daughter feature may still be small
> enough to pass a start/end filter, even if the parent feature
> is not. Again, scanning the top level features list would miss
> it.
The within assumption makes sense to me here. There may be
pathological cases that fall outside of this, but no examples are
coming to mind right now.
> There are other downsides to using nested SubFeatures,
> it will probably require a lot of reworking of the GenBank
> output due to how composite features like joins are
> currently stored, and I haven't even looked at the BioSQL
> side of things. You may have looked at that already
> though, so I may just be worrying about nothing.
Agreed. My thought was to prototype this with GFF and then think
further about GenBank features. Initially, I just want to get the
GFF parsing documented and in the Biopython repository, and then the
BioSQL storage would be a logical next step.
Brad
From chapmanb at 50mail.com Mon Aug 31 08:58:54 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 31 Aug 2009 08:58:54 -0400
Subject: [Biopython-dev] Command line wrappers for assembly tools
In-Reply-To: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com>
References: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com>
Message-ID: <20090831125854.GB75451@sobchak.mgh.harvard.edu>
Hi all;
> Osvaldo Zagordi has recently offered a Bio.Application style command line
> wrapper for Novoalign (a commercial short read aligner from Novocraft), see
> enhancement Bug 2904, and the Novocraft website:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2904
> http://www.novocraft.com/products.html
Very nice. I've been meaning to play with Novoalign and have heard
some good things.
> While some of these tools would fit under Bio.Align.Applications, does
> creating a similar collection at Bio.Sequencing.Applications make more
> sense? For example, the Roche sffinfo tool isn't in itself a alignment
> application - but it is related to DNA sequencing.
I like the idea of a Sequencing namespace or at least something
different than the current Align, which implicitly refers mostly to
multiple alignment programs.
Brad
From biopython at maubp.freeserve.co.uk Mon Aug 31 09:11:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 14:11:19 +0100
Subject: [Biopython-dev] Command line wrappers for assembly tools
In-Reply-To: <20090831125854.GB75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908250333m6dc4b8eew475dc309f1e3ddb4@mail.gmail.com>
<20090831125854.GB75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00908310611i2ce6a639i550631cb47a02050@mail.gmail.com>
On Mon, Aug 31, 2009 at 1:58 PM, Brad Chapman wrote:
> Hi all;
>
>> Osvaldo Zagordi has recently offered a Bio.Application style command line
>> wrapper for Novoalign (a commercial short read aligner from Novocraft), see
>> enhancement Bug 2904, and the Novocraft website:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2904
>> http://www.novocraft.com/products.html
>
> Very nice. I've been meaning to play with Novoalign and have heard
> some good things.
Cool. Do you think you'll be able to try that out, and test Osvaldo's
wrapper at the same time?
>> While some of these tools would fit under Bio.Align.Applications, does
>> creating a similar collection at Bio.Sequencing.Applications make more
>> sense? For example, the Roche sffinfo tool isn't in itself a alignment
>> application - but it is related to DNA sequencing.
>
> I like the idea of a Sequencing namespace or at least something
> different than the current Align, which implicitly refers mostly to
> multiple alignment programs.
That sounds like a plan then...
Peter
From biopython at maubp.freeserve.co.uk Mon Aug 31 09:15:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 14:15:42 +0100
Subject: [Biopython-dev] [Biopython] Filtering SeqRecord feature list /
nested SeqFeatures
In-Reply-To: <20090831125452.GA75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908260436wbbd461bt205ada4fcc5c802c@mail.gmail.com>
<20090831125452.GA75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00908310615y23051634sbe6076fa9667296b@mail.gmail.com>
On Mon, Aug 31, 2009 at 1:54 PM, Brad Chapman wrote:
>> There are other downsides to using nested SubFeatures,
>> it will probably require a lot of reworking of the GenBank
>> output due to how composite features like joins are
>> currently stored, and I haven't even looked at the BioSQL
>> side of things. You may have looked at that already
>> though, so I may just be worrying about nothing.
>
> Agreed. My thought was to prototype this with GFF and then
> think further about GenBank features. Initially, I just want to
> get the GFF parsing documented and in the Biopython
> repository, and then the BioSQL storage would be a logical
> next step.
If (as Michiel and I suggested) your GFF parser returns some
generic object (e.g. a GFF record class, or a tuple of basic
python types including a dictionary of annotation), then yes,
that can be checked in without side effects.
However, if your code goes straight to SeqRecord and
SeqFeature objects, we are going to have to deal with
how BioSQL and the existing SeqIO output code will
react (e.g. the GenBank output).
Peter
From chapmanb at 50mail.com Mon Aug 31 09:24:51 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 31 Aug 2009 09:24:51 -0400
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
Message-ID: <20090831132451.GD75451@sobchak.mgh.harvard.edu>
Hi Peter;
> The Bio.SeqIO.indexed_dict() functionality is in CVS/github now
> as I would like some wider testing. My earlier email explained the
> implementation approach, and gave some example code:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html
Sweet. I pulled this from your branch earlier for something I was
doing at work and it's great stuff. My only suggestion would be to
change the function name to make it clear it's an in memory index.
This will clear us up for similar file based index functions.
> Another option (like the shelve idea we talked about last month)
> is to parse the sequence file with SeqIO, and serialise all the
> SeqRecord objects to disk, e.g. with pickle or some key/value
> database. This is potentially very complex (e.g. arbitrary Python
> objects in the annotation), and could lead to a very large "index"
> file on disk. On the other hand, some possible back ends would
> allow editing the database... which could be very useful.
My thought here was to use BioSQL and the SQLite mappings for
serializing. We build off a tested and existing serialization, and
also guide people into using BioSQL for larger projects.
Essentially, we would build an API on top of existing BioSQL
functionality that creates the index by loading the SQL and then
pushes the parsed records into it.
> Brad - do you have any thoughts? I know you did some work
> with key/value indexers:
> http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/
I've been using MongoDB (http://www.mongodb.org/display/DOCS/Home)
extensively and it rocks; it's fast and scales well. The bit of work
that is needed is translating objects into JSON representations. There
are object mappers like MongoKit (http://bitbucket.org/namlook/mongokit/)
that help with this.
Connecting these thoughts together, a rough two step development plan
would be:
- Modify the underlying Biopython BioSQL representation to be object
based, using SQLAlchemy. This is essentially what I'd suggested as
a building block from Kyle's implementation.
- Use this to provide object mappings for object-based stores, like
MongoDB/MongoKit or Google App Engine.
Brad
From biopython at maubp.freeserve.co.uk Mon Aug 31 09:49:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 14:49:40 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090831132451.GD75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
<20090831132451.GD75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
On Mon, Aug 31, 2009 at 2:24 PM, Brad Chapman wrote:
>
> Hi Peter;
>
>> The Bio.SeqIO.indexed_dict() functionality is in CVS/github now
>> as I would like some wider testing. My earlier email explained the
>> implementation approach, and gave some example code:
>> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html
>
> Sweet. I pulled this from your branch earlier for something I was
> doing at work and it's great stuff.
Thanks :)
What file formats where you working on, and how many records?
> My only suggestion would be to
> change the function name to make it clear it's an in memory index.
> This will clear us up for similar file based index functions.
True. Have got any bright ideas for a better name? While the
index is in memory, the SeqRecord objects are not (unlike the
original Bio.SeqIO.to_dict() function).
Or we have one function Bio.SeqIO.indexed_dict() which can
either use an in memory index, OR an on disk index, offering
the same functionality.
>> Another option (like the shelve idea we talked about last month)
>> is to parse the sequence file with SeqIO, and serialise all the
>> SeqRecord objects to disk, e.g. with pickle or some key/value
>> database. This is potentially very complex (e.g. arbitrary Python
>> objects in the annotation), and could lead to a very large "index"
>> file on disk. On the other hand, some possible back ends would
>> allow editing the database... which could be very useful.
>
> My thought here was to use BioSQL and the SQLite mappings for
> serializing. We build off a tested and existing serialization, and
> also guide people into using BioSQL for larger projects.
> Essentially, we would build an API on top of existing BioSQL
> functionality that creates the index by loading the SQL and then
> pushes the parsed records into it.
Using BioSQL in this way is a much more general tool than
simply "indexing a sequence file". It feels like a sledgehammer
to crack a nut. Also, do you expect it to scale well for 10 million
plus short reads? It may do, but on the other hand it may not.
You will also face the (file format specific but potentially significant)
up front cost of parsing the full file in order to get the SeqRecord
objects which are then mapped into the database. My new
Bio.SeqIO.indexed_dict() code (whatever we call it) avoids this
and the speed up is very nice (file format specific of course).
Also while the current BioSQL mappings are "tried and tested",
they don't cover everything, in particular per-letter-annotation
such as a set of quality scores (something that needs addressing
anyway, probably with JSON or XML serialisation).
All the above make me lean towards a less ambitious target
(read only dictionary access to a sequence file), which just
requires having an (on disk) index of file offsets (which could
be done with SQLite or anything else suitable). This choice
could even be done on the fly at run time (e.g. we look at the
size of the file to decide if we should use an in memory index
or on disk - or start out in memory and if the number of records
gets too big, switch to on disk).
Peter
From mjldehoon at yahoo.com Mon Aug 31 09:50:37 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 31 Aug 2009 06:50:37 -0700 (PDT)
Subject: [Biopython-dev] Fw: Re: RNA module contributions
Message-ID: <444088.91207.qm@web62408.mail.re1.yahoo.com>
Forgot to forward this to the list.
--- On Mon, 8/31/09, Michiel de Hoon wrote:
> From: Michiel de Hoon
> Subject: Re: [Biopython-dev] RNA module contributions
> To: "Kristian Rother"
> Date: Monday, August 31, 2009, 9:49 AM
> Hi Kristian,
>
> As I am working in transcriptomics, I'll be happy to see
> some more RNA modules in Biopython. Thanks!
> Just one comment for now:
> Recent parsers in Biopython use a function rather than a
> class.
> So instead of
>
> from Bio import ThisOrThatModule
> handle = open("myinputfile")
> parser = ThisOrThatModule.Parser()
> record = parser.parse(handle)
>
> you would have
>
> from Bio import ThisOrThatModule
> handle = open("myinputfile")
> record = ThisOrThatModule.read(handle)
>
> This assumes that myinputfile contains only one record. If
> you have input files with multiple records, you can use
>
> from Bio import ThisOrThatModule
> handle = open("myinputfile")
> records = ThisOrThatModule.parse(handle)
>
> where the parse function is a generator function.
>
> How about the following for the RNA module?
>
> from Bio import RNA
> handle = open("myinputfile")
> record = RNA.read(handle, format="vienna")
> # or format="bpseq", as appropriate
>
> where record will be a Bio.RNA.SecStruc object.
>
> For consistency with other Biopython modules, you might
> also consider to rename Bio.RNA.SecStruc as Bio.RNA.Record.
> On the other hand, the name SecStruc is more informative,
> and maybe some day there will be other kinds of records in
> Bio.RNA.
>
> Thanks!
>
> --Michiel.
>
> --- On Mon, 8/31/09, Kristian Rother
> wrote:
>
> > From: Kristian Rother
> > Subject: [Biopython-dev] RNA module contributions
> > To: "Biopython-Dev Mailing List"
> > Date: Monday, August 31, 2009, 7:19 AM
> >
> > Hi,
> >
> > to start work on RNA modules, I'd like to contribute
> some
> > of our tested modules to BioPython. Before I place
> them into
> > my GIT branch, it would be great to get some
> comments:
> >
> > Bio.RNA.SecStruc
> > ???- represents a RNA secondary structures,
> > ???- recognizing of SSEs (helix, loop,
> > bulge, junction)
> > ???- recognizing pseudoknots
> >
> > Bio.RNA.ViennaParser? ? ?
> > ???- parses RNA secondary structures in the
> > Vienna format into SecStruc objects.
> >
> > Bio.RNA.BpseqParser? ? ? ? ? -
> > parses RNA secondary structures in the Bpseq format
> into
> > SecStruc objects.
> >
> > Connected to RNA, but with a wider focus:
> >
> > Bio.???.ChemicalGroupFinder
> > ???- identifies chemical groups (ribose,
> > carboxyl, etc) in a molecule graph (place to be
> defined
> > yet)
> >
> > There is a contribution from Bjoern Gruening as well:
> >
> > Bio.PDB.PDBMLParser
> > ???- creates PDB.Structure objects from
> > PDB-XML files.
> >
> >
> > Comments and suggestions welcome!
> >
> > Best Regards,
> > ???Kristian Rother
> >
> >
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> >
>
>
>
>
From biopython at maubp.freeserve.co.uk Mon Aug 31 13:44:44 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 18:44:44 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
<20090831132451.GD75451@sobchak.mgh.harvard.edu>
<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
Message-ID: <320fb6e00908311044h24cd62d9n809582c7d32e5824@mail.gmail.com>
On Mon, Aug 31, 2009 at 2:49 PM, Peter wrote:
> All the above make me lean towards a less ambitious target
> (read only dictionary access to a sequence file), which just
> requires having an (on disk) index of file offsets (which could
> be done with SQLite or anything else suitable). This choice
> could even be done on the fly at run time (e.g. we look at the
> size of the file to decide if we should use an in memory index
> or on disk - or start out in memory and if the number of records
> gets too big, switch to on disk).
With the current code (in memory dictionary mapping keys to
file offsets), the 7 million record FASTQ file (1.3GB on disk)
required almost 700MB in memory. Indexing took about 1 min.
This is probably OK for many potential uses.
I just did a quick hack to use shelve (default settings) to hold the
key to file offset mapping. RAM usage was about 10MB, the
index file about 320MB (could have been a little more, my code
cleaned up after itself), but indexing took about 12 minutes.
http://github.com/peterjc/biopython/tree/index-shelve
I also did a proof of principle implementation using SQLite to
hold the key to file offset mapping. This also needed only about
10MB of RAM, the SQLite index file was about 400MB and
indexing took about 8 minutes. Perhaps this can be sped up...
http://github.com/peterjc/biopython/tree/index-sqlite
On the bright side, these all work for all the previously supported
indexable file formats, even SFF - which is pretty cool.
The trade off of 1 minute and 700MB RAM (in memory) versus
8 minutes but only 10MB RAM (using SQLite) means neither
solution will suit every use case. So unless the SQLite dict
approach can be sped up, it may be worthwhile to support
both this and the in memory index - although I haven't worked
out how best to arrange my code to achieve this elegantly.
Anyway, using SQLite like this seems workable (especially
since for Python 2.5+ it is included in the standard library).
Another option is the Berkeley DB library (especially if we can
do this following the OBF OBDA standard for the index file),
but while bsddb was included in Python 2.x it has been
deprecated for Python 2.6+ and removed in Python 3.0+
It is still available as a third party install though...
Peter
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 20:46:38 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 16:46:38 -0400
Subject: [Biopython-dev] [Bug 2894] New: Jython List difference causes
failed assertion in CondonTable Fix+Patch
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2894
Summary: Jython List difference causes failed assertion in
CondonTable Fix+Patch
Product: Biopython
Version: 1.51b
Platform: Other
OS/Version: Other
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: kellrott at ucsd.edu
Different list behaviour in Jython causes assertion to fail because last to
elements on produced list are swapped. Haven't taken the time to figure out if
this caused by sloppy list usage or Jython list weirdness. At this point, will
assume that list order doesn't matter and simple expand the assertion to allow
both cases...
list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values)
Python : ['TGA', 'TAA', 'TAG', 'TAR', 'TRA']
Jython : ['TGA', 'TAA', 'TAG', 'TRA', 'TAR']
NOTE: Fixing this bug causes setup.py to fail (java.lang.ClassFormatError:
Invalid method Code length) because it exposes previously untested bugs
*** biopython-1.51b_orig/Bio/Data/CodonTable.py 2009-05-08 14:20:19.000000000
-0700
--- biopython-1.51b/Bio/Data/CodonTable.py 2009-08-01 13:30:46.000000000
-0700
***************
*** 615,621 ****
assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TGA']
assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TAA', 'TAR']
assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values)
== ['UAG', 'UAA', 'UAR']
! assert list_ambiguous_codons(['TGA', 'TAA',
'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA']
# Forward translation is "onto", that is, any given codon always maps
# to the same protein, or it doesn't map at all. Thus, I can build
--- 615,623 ----
assert list_ambiguous_codons(['TAG', 'TGA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TGA']
assert list_ambiguous_codons(['TAG', 'TAA'],IUPACData.ambiguous_dna_values)
== ['TAG', 'TAA', 'TAR']
assert list_ambiguous_codons(['UAG', 'UAA'],IUPACData.ambiguous_rna_values)
== ['UAG', 'UAA', 'UAR']
! #Jython BUG? For some order Jython swaps the order of the last two
elements...
! assert list_ambiguous_codons(['TGA', 'TAA',
'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA']
or\
! list_ambiguous_codons(['TGA', 'TAA',
'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TRA', 'TAR']
# Forward translation is "onto", that is, any given codon always maps
# to the same protein, or it doesn't map at all. Thus, I can build
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Aug 1 21:16:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 17:16:48 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
assertion in CondonTable Fix+Patch
In-Reply-To:
Message-ID: <200908012116.n71LGmgG031493@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2894
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-01 17:16 EST -------
(In reply to comment #0)
> Different list behaviour in Jython causes assertion to fail because last to
> elements on produced list are swapped. Haven't taken the time to figure out
> if this caused by sloppy list usage or Jython list weirdness. ...
Are you using Biopython 1.51b, or the latest code from CVS/github? This sounds
like a duplicate of Bug 2887 (set order is Python implementation dependent).
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:47 -0400
Subject: [Biopython-dev] [Bug 2895] New:
Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2895
Summary: Bio.Restriction.Restriction_Dictionary Jython Error
Fix+Patch
Product: Biopython
Version: 1.51b
Platform: Other
OS/Version: Other
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: kellrott at ucsd.edu
BugsThisDependsOn: 2891,2892,2893,2894
Jython is limited to JVM method sizes, overly large methods cause JVM
exceptions (java.lang.ClassFormatError: Invalid method Code length ...).
The Bio.Restriction.Restriction_Dictionary module defines to much data in the
base method, by breaking the defined dicts into pieces held in separate
methods, then merging them, the code will correctly compile in Jython.
Patch:
11,12c11,14
< rest_dict = \
< {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'),
---
>
>
> def RestDict1():
> return {'AarI': {'charac': (11, 8, None, None, 'CACCTGC'),
1503,1504c1505,1508
< 'suppl': ('I',)},
< 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'),
---
> 'suppl': ('I',)} }
>
> def RestDict2():
> return { 'BbvCI': {'charac': (2, -2, None, None, 'CCTCAGC'),
3500c3504,3508
< 'suppl': ('X',)},
---
> 'suppl': ('X',)} }
>
>
> def RestDict3():
> return {
4497c4505,4508
< 'suppl': ('I',)},
---
> 'suppl': ('I',)} }
>
> def RestDict4():
> return {
5494,5495c5505,5508
< 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')},
< 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'),
---
> 'suppl': ('E', 'G', 'I', 'M', 'N', 'V')} }
>
> def RestDict5():
> return { 'DrdI': {'charac': (7, -7, None, None, 'GACNNNNNNGTC'),
6479c6492,6495
< 'suppl': ('N',)},
---
> 'suppl': ('N',)} }
>
> def RestDict6():
> return {
7194,7195c7210,7214
< 'suppl': ('N',)},
< 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'),
---
> 'suppl': ('N',)} }
>
>
> def RestDict7():
> return { 'Hpy8I': {'charac': (3, -3, None, None, 'GTNNAC'),
8491c8510,8513
< 'suppl': ()},
---
> 'suppl': ()} }
>
> def RestDict8():
> return {
9608c9630,9634
< 'suppl': ('F',)},
---
> 'suppl': ('F',)} }
>
>
> def RestDict9():
> return {
11992,11993c12018,12051
< suppliers = \
< {'A': ('Amersham Pharmacia Biotech',
---
>
>
> rest_dict = {}
> tmp = RestDict1()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict2()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict3()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict4()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict5()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict6()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict7()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict8()
> for a in tmp:
> rest_dict[a] = tmp[a]
> tmp = RestDict9()
> for a in tmp:
> rest_dict[a] = tmp[a]
>
>
> def Suppliers():
> return {'A': ('Amersham Pharmacia Biotech',
13626,13627c13684,13692
< typedict = \
< {'type145': (('NonPalindromic',
---
>
>
> suppliers = Suppliers()
>
>
>
>
> def TypeDict():
> return {'type145': (('NonPalindromic',
14498a14564,14567
>
> typedict = TypeDict()
>
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:49 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To:
Message-ID: <200908020246.n722knhV005000@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2891
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:50 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:50 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To:
Message-ID: <200908020246.n722koqM005006@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2892
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:51 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:51 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To:
Message-ID: <200908020246.n722kpGh005015@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2893
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Aug 2 02:46:52 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 1 Aug 2009 22:46:52 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
assertion in CondonTable Fix+Patch
In-Reply-To:
Message-ID: <200908020246.n722kq8g005021@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2894
kellrott at ucsd.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2895
nThis| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Mon Aug 3 14:57:59 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 3 Aug 2009 10:57:59 -0400
Subject: [Biopython-dev] GSoC Weekly Update 11: PhyloXML for Biopython
Message-ID: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
Hi all,
Previously (July 27-31) I:
- Added the remaining checks for restricted tokens
- Modified the tree, parser and writer for phyloXML 1.10 support -- it
validates now, and unit tests pass. PhyloXML 1.00 validation breaks,
but
that won't affect anyone except BioPerl, and they said they can deal
with
it on their end
- Changed how the Parser and Writer classes work to resemble other
Biopython parser classes more closely
- Picked standard attributes for BaseTree's Tree and Node objects
(informed
by PhyloDB, though the names are slightly different); added
properties to PhyloXML's Clade to mimic both types
- Made SeqRecord conversion actually work (with reasonable
round-tripping
capability); added a unit test
- Changed __str__ methods to not include the object's class name if
there's
another representative label to use (e.g. name) -- that's easy enough
to
add in the caller
- Sorted out the TreeIO read/parse/write API and added some support for
the Newick format, as recommended by Peter on biopython-dev
- Split some "plumbing" (depth_first_search) off from the Tree.find()
method. Since there are a lot of potentially useful methods to have on
phylogenetic tree objects, I think it's best to distinguish between
"porcelain" (specific, easy-to-use methods for common operations) and
"plumbing" (generalized or low-level methods/algorithms that porcelain
can rely on) in the Tree class in Bio.Tree.BaseTree.
- Started a function for networkx export. The edges are screwy right
now, so I haven't checked it in yet.
This week (Aug. 3-7) I will:
Scan the code base for lingering TODO/ENH/XXX comments
Discuss merging back upstream
Work on enhancements (time permitting):
- Clean up the Parser class a bit more, to resemble Writer
- Finish networkx export
- Port common methods to Bio.Tree.BaseTree (from Bio.Nexus.Trees and
other
packages)
Run automated testing:
- Re-run performance benchmarks
- Run tests and benchmarks on alternate platforms
- Check epydoc's generated API documentation and fix docstrings
Update wiki documentation with new features:
- Tree: base classes, find() etc.,
- TreeIO: 'phyloxml', 'nexus', 'newick' wrappers; PhyloXMLIO extras;
warn
that Nexus/Newick wrappers don't return Bio.Tree objects yet
- PhyloXML: singular properties, improved str()
Remarks:
- Most of the work done this week and last, shuffling base classes and
adding various checks, actually made the I/O functions a little
slower.
I don't think this will be a big deal, and the changes were necessary,
but it's still a little disappointing.
- The networkx export will look pretty cool. After exporting a Biopython
tree to a networkx graph, it takes a couple more imports and commands
to
draw the tree to the screen or a file. Would anyone find it handy to
have
a short function in Bio.Tree or Bio.Graphics to go straight from a
tree
to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe
graphviz)
- I have to admit this: I don't know anything about BioSQL. How would I
use
and test the PhyloDB extension, and what's involved in writing a
Biopython interface for it?
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
From krother at rubor.de Mon Aug 3 15:11:15 2009
From: krother at rubor.de (Kristian Rother)
Date: Mon, 03 Aug 2009 17:11:15 +0200
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
Message-ID: <4A76FE13.6050203@rubor.de>
Hi,
We have created a lot of code that works on RNA structures in Poznan,
Poland. There are some jewels that I consider useful and mature enough
to meet a wider audience. I'd be interested in refactorizing and
packaging them as a RNAStructure package and contribute it to BioPython.
I just discussed the possibilities with Magdalena Musielak & Tomasz
Puton who wrote & tested significant portions of the code. They came up
with a list of 'most wanted' Use Cases:
- Calculate RNA base pairs
- Generate RNA secondary structures from 3D structures
- Recognize pseudoknots
- Recognize modified nucleotides in RNA 3D structures.
- Superimpose two RNA molecules.
The existing code massively uses Bio.PDB already, and has little
dependancies apart from that.
Any comments how this kind of functionality would fit into BioPython are
welcome.
Best Regards,
Kristian Rother
www.rubor.de
Structural Bioinformatics Group
UAM Poznan
From bugzilla-daemon at portal.open-bio.org Mon Aug 3 16:28:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 3 Aug 2009 12:28:39 -0400
Subject: [Biopython-dev] [Bug 2896] New: BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
Summary: BLAST XML parser: stripped leading/trailing spaces in
Hsp_midline
Product: Biopython
Version: 1.50
Platform: All
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: volkmer at mpi-cbg.de
Parsing a XML output file from NCBI BLAST using blastp & complexity filters on
omits leading/trailing spaces in the hsp match line:
hsp.query
u'XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV'
hsp.match
u'P+ T L S PPL S+S + PQ+ L+ + R+K+ + + RHHH R LDL ++V'
This makes it more awkward to evaluate the alignment. It would be the best when
query, subject and alignment always have the same length. The BLAST XML output
file at least has the correct Hsp_midline:
<,Hsp_qseq>XXXXPSPTSLATSHPPLSSMSPYMTI------PQQYLYISKIRSKLSQCALT-RHHH-RELDLRKMV</Hsp_qseq>
<<Hsp_hseq>>EFFEPAITGLYYS-PPLFSVSRLTGLLHLLERPQETLF-TNYRNKIKRLDIPLRHHHIRHLDLEQLV</Hsp_hseq>
<Hsp_midline> P+ T L S PPL S+S + PQ+ L+ + R+K+ + + RHHH R
LDL ++V</Hsp_midline>
And as the plaintext parser gives the complete alignment line it would be nice
to get the same behaviour.
Thanks,
Michael
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Aug 3 17:20:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 3 Aug 2009 13:20:24 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908031720.n73HKOFr019079@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-03 13:20 EST -------
Could you attach a complete XML file we could use for a unit test please?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Aug 3 20:48:49 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 3 Aug 2009 21:48:49 +0100
Subject: [Biopython-dev] Deprecating Bio.Fasta?
In-Reply-To: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com>
References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com>
Message-ID: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com>
On 22 June 2009, I wrote:
> ...
> I'd like to officially deprecate Bio.Fasta for the next release (Biopython
> 1.51), which means you can continue to use it for a couple more
> releases, but at import time you will see a warning message. See also:
> http://biopython.org/wiki/Deprecation_policy
>
> Would this cause anyone any problems? If you are still using Bio.Fasta,
> it would be interesting to know if this is just some old code that hasn't
> been updated, or if there is some stronger reason for still using it.
No one replied, so I plan to make this change in CVS shortly, meaning
that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work
but will trigger a deprecation warning at import.
Please speak up ASAP if this concerns you.
Thanks,
Peter
From chapmanb at 50mail.com Mon Aug 3 22:38:47 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 3 Aug 2009 18:38:47 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
Message-ID: <20090803223847.GM8112@sobchak.mgh.harvard.edu>
Hi Eric;
Thanks for the update. Things are looking in great shape as we get
towards the home stretch.
> - Most of the work done this week and last, shuffling base classes and
> adding various checks, actually made the I/O functions a little slower.
> I don't think this will be a big deal, and the changes were necessary,
> but it's still a little disappointing.
The unfortunate influence of generalization. I think the adjustment
to the generalized Tree is a big win and gives a solid framework for
any future phylogenetic modules. I don't know what the numbers are
but as long as performance is reasonable, few people will complain.
This is always something to go back around on if it becomes a hangup
in the future.
> - The networkx export will look pretty cool. After exporting a Biopython
> tree to a networkx graph, it takes a couple more imports and commands to
> draw the tree to the screen or a file. Would anyone find it handy to have
> a short function in Bio.Tree or Bio.Graphics to go straight from a tree
> to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz)
Awesome. Looking forward to seeing some trees that come out of this.
It's definitely worthwhile to formalize the functionality to go
straight from a tree to png or pdf. This will add some more
localized dependencies, so I'm torn as to whether it would be best
as a utility function or an example script. Peter might have an
opinion here.
Either way, this would be really useful as a cookbook example with a
final figure. Being able to produce some pretty is a good way to
convince people to store trees in a reasonable format like PhyloXML.
> - I have to admit this: I don't know anything about BioSQL. How would I use
> and test the PhyloDB extension, and what's involved in writing a
> Biopython interface for it?
BioSQL and the PhyloDB extension are a set of relational database
tables. Looking at the SVN logs, it appears as if the main work on
PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps
lagging behind, so my suggestion is to start with PostgreSQL.
Hilmar, please feel free to correct me here.
The schemas are available from SVN:
http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/sql
You'd want biosqldb-pg.sql and presumably also biosqldb-views-pg.sql
for BioSQL and biosql-phylodb-pg.sql and biosql-phylodata-pg.sql.
The Biopython docs are pretty nice on this -- you create the empty tables:
http://biopython.org/wiki/BioSQL#PostgreSQL
>From there you should be able to browse to get a sense of what is
there. In terms of writing an interface, the first step is loading
the data where you can mimic what is done with SeqIO and BioSQL:
http://biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database
Pass the database an iterator of trees and they are stored.
Secondarily is retrieving and querying persisted trees. Here you
would want TreeDB objects that act like standard trees, but
retrieve information from the database on demand. Here are
Seq/SeqRecord models in BioSQL:
http://github.com/biopython/biopython/tree/master/BioSQL/BioSeq.py
So it's a bit of an extended task. Time frames being what they are,
any steps in this direction are useful. If you haven't played with
BioSQL before, it's worth a look for your own interest. The underlying
key/value model is really flexible and kind of models RDF triplets. I've
used BioSQL here recently as the backend for a web app that differs a
bit from the standard GenBank like thing, and found it very flexible.
Again, great stuff. Let me know if I can add to any of that,
Brad
From bugzilla-daemon at portal.open-bio.org Tue Aug 4 08:45:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 4 Aug 2009 04:45:03 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908040845.n748j36R015856@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #2 from volkmer at mpi-cbg.de 2009-08-04 04:45 EST -------
Created an attachment (id=1353)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1353&action=view)
blastp xml sample
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Tue Aug 4 12:32:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 4 Aug 2009 08:32:39 -0400
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To: <4A76FE13.6050203@rubor.de>
References: <4A76FE13.6050203@rubor.de>
Message-ID: <20090804123239.GN8112@sobchak.mgh.harvard.edu>
Hi Kristian;
> We have created a lot of code that works on RNA structures in Poznan,
> Poland. There are some jewels that I consider useful and mature enough
> to meet a wider audience. I'd be interested in refactorizing and
> packaging them as a RNAStructure package and contribute it to BioPython.
This sounds great. I don't know enough about the area to comment
directly on your use cases -- my experience is limited to folding
structures with RNAFold and the like -- but it sounds like a solid
feature set.
> I just discussed the possibilities with Magdalena Musielak & Tomasz
> Puton who wrote & tested significant portions of the code. They came up
> with a list of 'most wanted' Use Cases:
>
> - Calculate RNA base pairs
> - Generate RNA secondary structures from 3D structures
> - Recognize pseudoknots
> - Recognize modified nucleotides in RNA 3D structures.
> - Superimpose two RNA molecules.
>
> The existing code massively uses Bio.PDB already, and has little
> dependancies apart from that.
You may also want to have a look at PyCogent, which has wrappers and
parsers for several command line programs involved with RNA structure,
along with a representation of RNA secondary structure:
http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/struct/rna2d.py?view=markup
It would be great to complement this functionality, and interact
with PyCogent where feasible.
We could offer more specific suggestions as you get rolling with this
and there is code to review. Glad to have you interested,
Brad
From tiagoantao at gmail.com Tue Aug 4 15:29:36 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 4 Aug 2009 16:29:36 +0100
Subject: [Biopython-dev] 1.52
Message-ID: <6d941f120908040829g6531804dpe51e9f24720dab78@mail.gmail.com>
Hi,
I am currently working on the implementation of Genepop support on Bio.PopGen.
Genepop support will allow calculation of basic frequentist
statistics. This is the biggest addition to Bio.PopGen and makes the
module useful for a wide range of applications. In fact I never tried
to publicize Bio.PopGen in the population genetics community, but with
this addon, that will change.
The status is as follows:
1. Code done 90% done.
Check http://github.com/tiagoantao/biopython/tree/genepop
2. Test code around 30% coverage
3. Documentation 50%
Check http://biopython.org/wiki/PopGen_dev_Genepop for a tutorial
under development.
This will be ready for 1.52. And I would like to make the code
available after the Summer vacation.
And it is about 1.52 that this mail is about ;)
I remember Peter writing about 1.52 being ad-hoc scheduled for fall. I
have September blocked with work, but I managed to have October clear
mostly just for this. So my request is: if there is more or less a
Fall release please don't schedule it for the first week in the Fall
(which is still in September) ;) . Mid-October or somewhere around
that time would be good.
Thanks a lot,
Tiago
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From matzke at berkeley.edu Tue Aug 4 17:01:34 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 04 Aug 2009 10:01:34 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To:
References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel>
<86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org>
<20090707130248.GM17086@sobchak.mgh.harvard.edu>
<3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com>
<320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
Message-ID: <4A78696E.8010808@berkeley.edu>
Hi all, update:
Major improvements/fixes:
- removed any reliance on lagrange tree module, refactored all phylogeny
code to use the revised Bio.Nexus.Tree module
- tree functions put in TreeSum (tree summary) class
- added functions for calculating phylodiversity measures, including
necessary subroutines like subsetting trees, randomly selecting tips
from a larger pool
- Code dealing with GBIF xml output completely refactored into the
following classes:
* ObsRecs (observation records & search results/summary)
* ObsRec (an individual observation record)
* XmlString (functions for cleaning xml returned by Gbif)
* GbifXml (extention of capabilities for ElementTree xml trees, parsed
from GBIF xml returns.
- another suggestion implemented: dependencies on tempfiles eliminated
by using cStringIO (temporary file-like strings, not stored as temporary
files) file_str objects instead
- another suggestion implemented: the _open method from biopython's ncbi
www functionality has been copied & modified so that it is now a method
of ObsRecs, and doesn't contain NCBI-specific defaults etc. (it does
still include a 3-second waiting time between GBIF requests, figuring
that is good practice).
- function to download large numbers of records in increments
implemented as method of ObsRecs.
This week:
- Put GIS functions in a class (easy), allowing each ObsRec to be
classified into an are (easy)
- Improve extraction of data from GBIF xmltree -- my Utricularia
"practice XML file" didn't have problems, but with running online
searches, I am discovering some fields are not always filled in, etc.
This shouldn't be too hard, using the GbifXml xmltree searching
functions, and including defaults for exceptions.
- Function for converting points to KML for Google Earth display.
Code uploaded here:
http://github.com/nmatzke/biopython/commits/Geography
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From matzke at berkeley.edu Tue Aug 4 18:28:33 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 04 Aug 2009 11:28:33 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu>
References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel>
<86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org>
<20090707130248.GM17086@sobchak.mgh.harvard.edu>
<3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com>
<320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<58AA6396-760D-40BB-B07A-EF22282E78D5@duke.edu>
Message-ID: <4A787DD1.40301@berkeley.edu>
Hilmar Lapp wrote:
>
> On Aug 4, 2009, at 1:01 PM, Nick Matzke wrote:
>
>> * ObsRecs (observation records & search results/summary)
>> * ObsRec (an individual observation record)
>
>
> I'll let the Biopython folks make the call on this, but in general I'd
> recommend to everyone trying to write reusable code to spell out names,
> especially non-local names.
>
> The days in which the length of a variable or class name was somehow
> limited or affected the speed of a program are definitely over since
> more than a decade. I know the temptation is big to save on a few
> keystrokes every time you have to type the name, but the time that you
> will cause your fellow programmers who will later try to understand your
> code is vastly greater. What prevents me from thinking that ObsRec is a
> class for an obsolete recording?
Good point, this is easy to fix, I will put it on the list. Cheers!
Nick
>
> Just my $0.02 :-)
>
> -hilmar
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From biopython at maubp.freeserve.co.uk Tue Aug 4 18:44:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 4 Aug 2009 19:44:29 +0100
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To: <4A76FE13.6050203@rubor.de>
References: <4A76FE13.6050203@rubor.de>
Message-ID: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
On Mon, Aug 3, 2009 at 4:11 PM, Kristian Rother wrote:
> Hi,
>
> We have created a lot of code that works on RNA structures in Poznan,
> Poland. There are some jewels that I consider useful and mature enough to
> meet a wider audience. I'd be interested in refactorizing and packaging them
> as a RNAStructure package and contribute it to BioPython.
I remember we talked about this briefly at BOSC/ISMB - it sounds good.
Did you get a chance to talk to Thomas Hamelryck about this?
> I just discussed the possibilities with Magdalena Musielak & Tomasz Puton
> who wrote & tested significant portions of the code. They came up with a
> list of 'most wanted' Use Cases:
>
> - Calculate RNA base pairs
> - Generate RNA secondary structures from 3D structures
> - Recognize pseudoknots
> - Recognize modified nucleotides in RNA 3D structures.
> - Superimpose two RNA molecules.
>
> The existing code massively uses Bio.PDB already, and has little
> dependancies apart from that.
>
> Any comments how this kind of functionality would fit into BioPython are
> welcome.
I see you have already started a github branch, which is great:
http://github.com/krother/biopython/tree/rol
Am I right in thinking all of this code is for 3D RNA work? Maybe that
might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA?
Did you have something in mind?
Peter
P.S. Who won the ISMB Art and Science Exhibition prize?
http://www.iscb.org/ismbeccb2009/artscience.php
From biopython at maubp.freeserve.co.uk Tue Aug 4 19:29:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 4 Aug 2009 20:29:47 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
Message-ID: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote:
> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote:
>> How about adding a function like "run_arguments" to the
>> commandlines that returns the commandline as a list.
>
> That would be a simple alternative to my vague idea "Maybe we
> can make the command line wrapper object more list like to make
> subprocess happy without needing to create a string?", which may
> not be possible. Either way, this will require a bit of work on the
> Bio.Application parameter objects...
By defining an __iter__ method, we can make the Biopython
application wrapper object sufficiently list-like that it can be
passed directly to subprocess. I think I have something working
(only tested on Linux so far), at least for the case where none
of the arguments have spaces or quotes in them.
If this works, it should make things a little easier in that we don't
have to do str(cline), and also I think it avoids the OS specific
behaviour of the shell argument as Brad noted earlier:
>> This avoids the shell nastiness with the argument list, is as
>> simple as it gets with subprocess, and gives users an easy
>> path to getting stdout, stderr and the return codes.
i.e. I am hoping we can replace this:
child = subprocess.Popen(str(cline), shell(sys.platform!="win32"), ...)
with just:
child = subprocess.Popen(cline, ...)
where the "..." represents any messing about with stdin, stdout
and stderr.
Peter
From chapmanb at 50mail.com Tue Aug 4 22:27:31 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 4 Aug 2009 18:27:31 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A78696E.8010808@berkeley.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
Message-ID: <20090804222731.GA12604@sobchak.mgh.harvard.edu>
Hi Nick;
Thanks for the update -- great to see things moving along.
> - removed any reliance on lagrange tree module, refactored all phylogeny
> code to use the revised Bio.Nexus.Tree module
Awesome -- glad this worked for you. Are the lagrange_* files in
Bio.Geography still necessary? If not, we should remove them from
the repository to clean things up.
More generally, it would be really helpful if we could do a bit of
housekeeping on the repository. The Geography namespace has a lot of
things in it which belong in different parts of the tree:
- The test code should move to the 'Tests' directory as a set of
test_Geography* files that we can use for unit testing the code.
- Similarly there are a lot of data files in there which are
appear to be test related; these could move to Tests/Geography
- What is happening with the Nodes_v2 and Treesv2 files? They look
like duplicates of the Nexus Nodes and Trees with some changes.
Could we roll those changes into the main Nexus code to avoid
duplication?
> - Code dealing with GBIF xml output completely refactored into the
> following classes:
>
> * ObsRecs (observation records & search results/summary)
> * ObsRec (an individual observation record)
> * XmlString (functions for cleaning xml returned by Gbif)
> * GbifXml (extention of capabilities for ElementTree xml trees, parsed
> from GBIF xml returns.
I'm agreed with Hilmar -- the user classes would probably benefit from expanded
naming. There is a art to naming to get them somewhere between the hideous
RidicuouslyLongNamesWithEverythingSpecified names and short truncated names.
Specifically, you've got a lot of filler in the names -- dbfUtils,
geogUtils, shpUtils. The Utils probably doesn't tell the user much
and makes all of the names sort of blend together, just as the Rec/Recs
pluralization hides a quite large difference in what the classes hold.
Something like Observation and ObservationSearchResult would make it
clear immediately what they do and the information they hold.
> This week:
What are your thoughts on documentation? As a naive user of these
tools without much experience with the formats, I could offer better
feedback if I had an idea of the public APIs and how they are
expected to be used. Moreover, cookbook and API documentation is something
we will definitely need to integrate into Biopython. How does this fit
in your timeline for the remaining weeks?
Thanks again. Hope this helps,
Brad
From hlapp at gmx.net Tue Aug 4 23:34:26 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 4 Aug 2009 19:34:26 -0400
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To: <320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
References: <4A76FE13.6050203@rubor.de>
<320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
Message-ID:
On Aug 4, 2009, at 2:44 PM, Peter wrote:
> P.S. Who won the ISMB Art and Science Exhibition prize?
> http://www.iscb.org/ismbeccb2009/artscience.php
Guess who - Kristian did :-)
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
From krother at rubor.de Wed Aug 5 08:07:12 2009
From: krother at rubor.de (Kristian Rother)
Date: Wed, 05 Aug 2009 10:07:12 +0200
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
Message-ID: <4A793DB0.5000805@rubor.de>
Hi Peter,
I remember we talked about this briefly at BOSC/ISMB - it sounds good.
Did you get a chance to talk to Thomas Hamelryck about this?
We talked on ISMB, but no details yet.
Am I right in thinking all of this code is for 3D RNA work? Maybe that
might give a good module name... Bio.RNA3D? Or Bio.PDB.RNA?
Did you have something in mind?
I was thinking of 'RNAStructure' - I also like 'RNA' as long as it does
not violate any
claims.
P.S. Who won the ISMB Art and Science Exhibition prize?
http://www.iscb.org/ismbeccb2009/artscience.php
The winning picture can be found here:
http://www.rubor.de/twentycharacters_en.html
Best Regards,
Kristian
From biopython at maubp.freeserve.co.uk Wed Aug 5 08:15:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 5 Aug 2009 09:15:36 +0100
Subject: [Biopython-dev] RFC: RNAStructure package for BioPython
In-Reply-To:
References: <4A76FE13.6050203@rubor.de>
<320fb6e00908041144s55fd6176n4aa97102f6deb22c@mail.gmail.com>
Message-ID: <320fb6e00908050115y612d89b2h757f5aa59fbb99ed@mail.gmail.com>
On Wed, Aug 5, 2009 at 12:34 AM, Hilmar Lapp wrote:
>
> On Aug 4, 2009, at 2:44 PM, Peter wrote:
>
>> P.S. Who won the ISMB Art and Science Exhibition ?prize?
>> http://www.iscb.org/ismbeccb2009/artscience.php
>
> Guess who - Kristian did :-)
>
> ? ? ? ?-hilmar
Ha! That's cool. Congratulations Kristian!
Peter
From biopython at maubp.freeserve.co.uk Wed Aug 5 10:29:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 5 Aug 2009 11:29:45 +0100
Subject: [Biopython-dev] Deprecating Bio.Fasta?
In-Reply-To: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com>
References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com>
<320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com>
Message-ID: <320fb6e00908050329m44fa2596ife06917306ae44ab@mail.gmail.com>
On Mon, Aug 3, 2009 at 9:48 PM, Peter wrote:
> On 22 June 2009, I wrote:
>> ...
>> I'd like to officially deprecate Bio.Fasta for the next release (Biopython
>> 1.51), which means you can continue to use it for a couple more
>> releases, but at import time you will see a warning message. See also:
>> http://biopython.org/wiki/Deprecation_policy
>> ...
>
> No one replied, so I plan to make this change in CVS shortly, meaning
> that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work
> but will trigger a deprecation warning at import.
>
> Please speak up ASAP if this concerns you.
I've just committed the deprecation of Bio.Fasta to CVS. This could be
reverted if anyone has a compelling reason (and tells us before we do
the final release of Biopython 1.51).
The docstring for Bio.Fasta should cover the typical situations for moving
from Bio.Fasta to Bio.SeqIO, but please feel free to ask on the mailing
list if you have a more complicated bit of old code that needs to be ported.
Thanks,
Peter
From bugzilla-daemon at portal.open-bio.org Wed Aug 5 11:29:41 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 5 Aug 2009 07:29:41 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908051129.n75BTf8i026537@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 07:29 EST -------
Thanks for the sample XML file. I could reproduce this, I think I have fixed
it.
hsp.query, hsp.match and hsp.sbjct should all be the same length.
Previously, at the end of each tag our XML parser strips the leading/trailing
white space from the tag's value before processing it. In the case of
Hsp_midline this is a very bad idea. However, the reason it did this was that
the way the current tag value was built up wasn't context aware. In particular
case, there was white space outside tags like Hsp_midline, which really belong
to the parent tag (Hsp), but was wrongly being combined.
Would you be able to test this please? All you really need to try this is the
new Bio/Blast/NCBIXML.py file (CVS revision 1.23). It might be easiest just to
update to the latest code in CVS (or on github), but I could attach the file
here if you like.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Aug 5 13:13:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 5 Aug 2009 09:13:40 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908051313.n75DDeFt031305@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #4 from volkmer at mpi-cbg.de 2009-08-05 09:13 EST -------
Hi Peter,
could you please attach the file?
The latest version of NCBIXML.py I get from cvs at code.open-bio.org still seems
to be from April 2009. When I try to specify revision 1.23 I get a checkout
warning and no file. Or is there a testing branch for this?
Michael
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Aug 5 13:27:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 5 Aug 2009 09:27:45 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908051327.n75DRjjg031915@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-05 09:27 EST -------
Created an attachment (id=1357)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1357&action=view)
Updated version of NCBIXML.py as in CVS revision 1.23
(In reply to comment #4)
> Hi Peter,
>
> could you please attach the file?
Sure.
> The latest version of NCBIXML.py I get from cvs at code.open-bio.org
> still seems to be from April 2009. When I try to specify revision
> 1.23 I get a checkout warning and no file. Or is there a testing
> branch for this?
Using code.open-bio.org (or its various aliases like cvs.biopython.org)
actually gives you access to a read only mirror of the real CVS data,
which is on dev.open-bio.org (for use by those with commit rights).
I'm not sure exactly how often the public mirror is updated, but I would
guess hourly. I would guess if you try again later it would work, but
in the meantime I have attached the new file to this bug.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Wed Aug 5 22:31:31 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 5 Aug 2009 18:31:31 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <20090803223847.GM8112@sobchak.mgh.harvard.edu>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
On Mon, Aug 3, 2009 at 6:38 PM, Brad Chapman wrote:
> Hi Eric;
> Thanks for the update. Things are looking in great shape as we get
> towards the home stretch.
>
> > - Most of the work done this week and last, shuffling base classes
> and
> > adding various checks, actually made the I/O functions a little
> slower.
> > I don't think this will be a big deal, and the changes were
> necessary,
> > but it's still a little disappointing.
>
> The unfortunate influence of generalization. I think the adjustment
> to the generalized Tree is a big win and gives a solid framework for
> any future phylogenetic modules. I don't know what the numbers are
> but as long as performance is reasonable, few people will complain.
> This is always something to go back around on if it becomes a hangup
> in the future.
>
The complete unit test suite used to take about 4.5 seconds, and now it
takes 5.8 seconds, though I've added a few more tests since then. I don't
think it will feel like it's hanging for most operations, besides parsing or
searching a huge tree.
> - The networkx export will look pretty cool. After exporting a
> Biopython
> > tree to a networkx graph, it takes a couple more imports and
> commands to
> > draw the tree to the screen or a file. Would anyone find it handy
> to have
> > a short function in Bio.Tree or Bio.Graphics to go straight from a
> tree
> > to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe
> graphviz)
>
> Awesome. Looking forward to seeing some trees that come out of this.
> It's definitely worthwhile to formalize the functionality to go
> straight from a tree to png or pdf. This will add some more
> localized dependencies, so I'm torn as to whether it would be best
> as a utility function or an example script. Peter might have an
> opinion here.
>
> Either way, this would be really useful as a cookbook example with a
> final figure. Being able to produce some pretty is a good way to
> convince people to store trees in a reasonable format like PhyloXML.
>
OK, it works now but the resulting trees look a little odd. The options
needed to get a reasonable tree representation are fiddly, so I made
draw_graphviz() a separate function that basically just handles the RTFM
work (not trivial), while the graph export still happens in to_networkx().
Here are a few recipes and a taste of each dish. The matplotlib engine seems
usable for interactive exploration, albeit cluttered -- I can't hide the
internal clade identifiers since graphviz needs unique labels, though maybe
I could make them less prominent. Drawing directly to PDF gets cluttered for
big files, and if you stray from the default settings (I played with it a
bit to get it right), it can look surreal. There would still be some benefit
to having a reportlab-based tree module in Bio.Graphics, and maybe one day
I'll get around to that.
$ ipython -pylab
from Bio import Tree, TreeIO
apaf = TreeIO.read('apaf.xml', 'phyloxml')
Tree.draw_graphviz(apaf)
# http://etal.myweb.uga.edu/phylo-nx-apaf.png
Tree.draw_graphviz(apaf, 'apaf.pdf')
# http://etal.myweb.uga.edu/apaf.pdf
Tree.draw_graphviz(apaf, 'apaf.png', format='png', prog='dot')
# http://etal.myweb.uga.edu/apaf.png -- why it's best to leave the defaults
alone
Thoughts: the internal node labels could be clear instead of red; if a node
doesn't have a name, it could check its taxonomy attribute to see if
anything's there; there's probably a way to make pygraphviz understand
distinct nodes that happen to have the same label, although I haven't found
it yet. Is PDF a good default format, or would PNG or PostScript be better?
> - I have to admit this: I don't know anything about BioSQL. How would
> I use
> > and test the PhyloDB extension, and what's involved in writing a
> > Biopython interface for it?
>
> BioSQL and the PhyloDB extension are a set of relational database
> tables. Looking at the SVN logs, it appears as if the main work on
> PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps
> lagging behind, so my suggestion is to start with PostgreSQL.
> Hilmar, please feel free to correct me here.
>
> [...]
>
> So it's a bit of an extended task. Time frames being what they are,
> any steps in this direction are useful. If you haven't played with
> BioSQL before, it's worth a look for your own interest. The underlying
> key/value model is really flexible and kind of models RDF triplets. I've
> used BioSQL here recently as the backend for a web app that differs a
> bit from the standard GenBank like thing, and found it very flexible.
>
>
I think I've seen that app, but I thought it was backed by AppEngine. Neat
stuff. I will learn BioSQL for my own benefit, but I don't think there's
enough time left in GSoC for me to add a useful PhyloDB adapter to
Biopython. So that, along with refactoring Nexus.Trees to use
Bio.Tree.BaseTree, would be a good project to continue with in the fall, at
a slower pace and with more discussion along the way.
Cheers,
Eric
From bugzilla-daemon at portal.open-bio.org Thu Aug 6 07:56:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 6 Aug 2009 03:56:25 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908060756.n767uPk1031552@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
------- Comment #6 from volkmer at mpi-cbg.de 2009-08-06 03:56 EST -------
(In reply to comment #3)
> I could reproduce this, I think I have fixed
> it.
> hsp.query, hsp.match and hsp.sbjct should all be the same length.
>
> Previously, at the end of each tag our XML parser strips the leading/trailing
> white space from the tag's value before processing it. In the case of
> Hsp_midline this is a very bad idea.
Ok, the fix seems to solve the problem.
Well I guess the only time when this problem appears is when you have
filtered/masked residues at the beginning/end of the query hsp. Otherwise the
hsp would just start with the first match and end with the last one.
Thanks,
Michael
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Aug 6 08:03:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 6 Aug 2009 04:03:03 -0400
Subject: [Biopython-dev] [Bug 2896] BLAST XML parser: stripped
leading/trailing spaces in Hsp_midline
In-Reply-To:
Message-ID: <200908060803.n76833YJ032257@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2896
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-06 04:03 EST -------
(In reply to comment #6)
>
> Ok, the fix seems to solve the problem.
>
Great - I'm marking this bug as fixed, thanks for your time reporting and then
testing this.
> Well I guess the only time when this problem appears is when you have
> filtered/masked residues at the beginning/end of the query hsp. Otherwise
> the hsp would just start with the first match and end with the last one.
I suspect there are other situations it might happen, but the fix is
general.
Cheers,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Thu Aug 6 08:06:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 09:06:43 +0100
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
<3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
Message-ID: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com>
On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich wrote:
> OK, it works now but the resulting trees look a little odd. The options
> needed to get a reasonable tree representation are fiddly, so I made
> draw_graphviz() a separate function that basically just handles the RTFM
> work (not trivial), while the graph export still happens in to_networkx().
>
> Here are a few recipes and a taste of each dish. The matplotlib engine seems
> usable for interactive exploration, albeit cluttered -- I can't hide the
> internal clade identifiers since graphviz needs unique labels, though maybe
> I could make them less prominent. ...
Graphviv does need unique names, and the node labels default to the
node name - but you can override this and use a blank label if you want.
How are you calling Graphviz? There are several Python wrappers out
there, or you could just write a dot file directly and call the graphviz
command line tools.
Peter
From eric.talevich at gmail.com Thu Aug 6 12:47:47 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 6 Aug 2009 08:47:47 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
<3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
<320fb6e00908060106h5b10442djea3f52fe9827108f@mail.gmail.com>
Message-ID: <3f6baf360908060547r8f299dao413b3657966fe9f4@mail.gmail.com>
On Thu, Aug 6, 2009 at 4:06 AM, Peter wrote:
> On Wed, Aug 5, 2009 at 11:31 PM, Eric Talevich
> wrote:
>
> > OK, it works now but the resulting trees look a little odd. The options
> > needed to get a reasonable tree representation are fiddly, so I made
> > draw_graphviz() a separate function that basically just handles the RTFM
> > work (not trivial), while the graph export still happens in
> to_networkx().
> >
> > Here are a few recipes and a taste of each dish. The matplotlib engine
> seems
> > usable for interactive exploration, albeit cluttered -- I can't hide the
> > internal clade identifiers since graphviz needs unique labels, though
> maybe
> > I could make them less prominent. ...
>
> Graphviv does need unique names, and the node labels default to the
> node name - but you can override this and use a blank label if you want.
> How are you calling Graphviz? There are several Python wrappers out
> there, or you could just write a dot file directly and call the graphviz
> command line tools.
>
I'm using the networkx and pygraphviz wrappers, since networkx already
partly wraps pygraphviz.
The direct networkx->matplotlib rendering engine figures out the
associations correctly when I pass a LabeledDiGraph instance, using Clade
objects as nodes and the str() representation as the label -- so
networkx.draw(tree) shows a tree with the internal nodes all labeled as
"Clade". But networkx.draw_graphviz(tree), while otherwise working the same
as the other networkx drawing functions, seems to convert nodes to strings
earlier, and then treats all "Clade" strings as the same node.
Surely there's a way to fix this through the networkx or pygraphviz API, but
I couldn't figure it out yesterday from the documentation and source code.
I'll poke at it some more today and try using blank labels.
Thanks,
Eric
From chapmanb at 50mail.com Thu Aug 6 13:14:42 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 6 Aug 2009 09:14:42 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11:
PhyloXML for Biopython
In-Reply-To: <3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
References: <3f6baf360908030757u6820d221pfcbb5847c33dd2ca@mail.gmail.com>
<20090803223847.GM8112@sobchak.mgh.harvard.edu>
<3f6baf360908051531u30133d3bpd90a92e338e2b887@mail.gmail.com>
Message-ID: <20090806131442.GG12604@sobchak.mgh.harvard.edu>
Hi Eric;
> OK, it works now but the resulting trees look a little odd. The options
> needed to get a reasonable tree representation are fiddly, so I made
> draw_graphviz() a separate function that basically just handles the RTFM
> work (not trivial), while the graph export still happens in to_networkx().
>
> Here are a few recipes and a taste of each dish. The matplotlib engine seems
> usable for interactive exploration, albeit cluttered -- I can't hide the
> internal clade identifiers since graphviz needs unique labels, though maybe
> I could make them less prominent. Drawing directly to PDF gets cluttered for
> big files, and if you stray from the default settings (I played with it a
> bit to get it right), it can look surreal. There would still be some benefit
> to having a reportlab-based tree module in Bio.Graphics, and maybe one day
> I'll get around to that.
This is great start. I remember pygraphviz and the networkx
representation being a bit finicky last I used it. In the end, I ended
up making a pygraphviz AGraph directly. Either way, if you can remove
the unneeded labels and change colorization as you suggested, this is
a great quick visualizations of trees.
Something reportlab based that looks like biologists expect a
phylogenetic tree to look would also be very useful. There is a
benefit in familiarity of display. Building something generally
usable like that is a longer term project.
> I think I've seen that app, but I thought it was backed by AppEngine. Neat
> stuff. I will learn BioSQL for my own benefit, but I don't think there's
> enough time left in GSoC for me to add a useful PhyloDB adapter to
> Biopython. So that, along with refactoring Nexus.Trees to use
> Bio.Tree.BaseTree, would be a good project to continue with in the fall, at
> a slower pace and with more discussion along the way.
Yes, the AppEngine display is also BioSQL on the backend; I ported over
some of the tables to the object representation used in AppEngine. I
also have used the relational schema in work projects -- it
generally is just a good place to get started.
Agreed on the timelines for GSoC. We'd be very happy to have you
continue that on those projects into the fall. Both are very useful
additions to the great work you've already done.
Brad
From biopython at maubp.freeserve.co.uk Thu Aug 6 14:39:33 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 15:39:33 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
Message-ID: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
On Tue, Aug 4, 2009 at 8:29 PM, Peter wrote:
> On Thu, Jul 9, 2009 at 10:18 AM, Peter wrote:
>> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote:
>>> How about adding a function like "run_arguments" to the
>>> commandlines that returns the commandline as a list.
>>
>> That would be a simple alternative to my vague idea "Maybe we
>> can make the command line wrapper object more list like to make
>> subprocess happy without needing to create a string?", which may
>> not be possible. Either way, this will require a bit of work on the
>> Bio.Application parameter objects...
>
> By defining an __iter__ method, we can make the Biopython
> application wrapper object sufficiently list-like that it can be
> passed directly to subprocess. I think I have something working
> (only tested on Linux so far), at least for the case where none
> of the arguments have spaces or quotes in them.
The current Bio.Application code works around generating command line
strings, and works fine cross platform. Making the Bio.Application
objects "list like" and getting this to work cross platform isn't
looking easy. Spaces on Windows are causing me big headaches.
Switching to lists of arguments appears to work fine on Unix
(specifically tested on Linux and Mac OS X), but things are more
complicated Windows. Basically using an array/list of arguments is
normal on Unix, but on Windows things get passed as strings. The
upshot is different Windows tools (or libraries used to compile them)
have to parse their command line string themselves, so different tools
do it differently. The result is you *may* need to adopt different
spaces/quotes escaping for different command line tools on Windows.
Now, if you give subprocess a list, on Windows it must first be turned
into a string, before subprocess can use the Windows API to run it.
The subprocess function list2cmdline does this, but the conventions it
follows are not universal.
I have examples of working command line strings for ClustalW and PRANK
where both the executable and some of the arguments have spaces in
them. It seems the quoting I was using to make ClustalW (or PRANK)
happy cannot be achieved via subprocess.list2cmdline (and I suspect
this applies to other tools too).
I will try and look into this further. However, even if it is
possible, I don't think we can implement the list approach in time for
Biopython 1.51, as there are just too many potential pitfalls.
I have in the meantime extended the command line tool unit tests
somewhat to include more examples with spaces in the filenames
[I'm beginning to think replacing Bio.Application.generic_run with a
simpler helper function would be easier in the short term, continuing
to just using a string with subprocess, but haven't given up yet.]
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 6 15:48:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 16:48:12 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
<320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
Message-ID: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
On Thu, Aug 6, 2009 at 3:39 PM, Peter wrote:
> Now, if you give subprocess a list, on Windows it must first be turned
> into a string, before subprocess can use the Windows API to run it.
> The subprocess function list2cmdline does this, but the conventions it
> follows are not universal.
>
> I have examples of working command line strings for ClustalW and PRANK
> where both the executable and some of the arguments have spaces in
> them. It seems the quoting I was using to make ClustalW (or PRANK)
> happy cannot be achieved via subprocess.list2cmdline (and I suspect
> this applies to other tools too).
e.g. This is a valid and working command line for PRANK, which works
both at the command line, or in Python via subprocess when given as
a string:
C:\repository\biopython\Tests>"C:\Program Files\prank.exe"
-d=Quality/example.fasta -o="temp with space" -f=11 -convert
Now, breaking up the arguments according to the description given in
the subprocess.list2cmdline docstring, I think the arguments are:
"C:\Program Files\prank.exe"
-d=Quality/example.fasta
-o="temp with space"
-f=11
-convert
Of these, the middle guy causes problems. By my reading of
the subprocess.list2cmdline docstring this is valid:
>> 2) A string surrounded by double quotation marks is
>> interpreted as a single argument, regardless of white
>> space or pipe characters contained within. A quoted
>> string can be embedded in an argument.
The example -o="temp with space" is a string surrounded by
double quotes, "temp with space", embedded in an argument.
Unfortunately, giving these five strings to subprocess.list2cmdline
results in a mess as it never checks to see if the arguments are
already quoted (as we have done for the program name and also
the output filename base). We can pass the program name in
without the quotes, and list2cmdline will do the right thing. But
there is no way for the -o argument to be handled that I can see.
This may be a bug in subprocess.list2cmdline, but it is certainly
a real limitation in my opinion.
So, it would appear that (on Windows) making our command line
wrappers act like lists (by defining __iter__) will not work in general.
The other approach which would allow our command line wrappers
to be passed directly to subprocess is to make them more string
like - but the subprocess code checks for string command lines
using isinstance(args, types.StringTypes) which means we would
have to subclass str (or unicode). I'm not sure if this can be made
to work yet...
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 6 16:05:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 17:05:24 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<20090706220453.GI17086@sobchak.mgh.harvard.edu>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
<320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
<320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
Message-ID: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com>
On Thu, Aug 6, 2009 at 4:48 PM, Peter wrote:
> The other approach which would allow our command line wrappers
> to be passed directly to subprocess is to make them more string
> like - but the subprocess code checks for string command lines
> using isinstance(args, types.StringTypes) which means we would
> have to subclass str (or unicode). I'm not sure if this can be made
> to work yet...
Thinking about it a bit more, str and unicode are immutable objects,
but we want the command line wrapper to be mutable (e.g. to add,
change or remove parameters and arguments). So it won't work.
Going back to my the original email, we could replace
Bio.Application.generic_run instead:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006344.html
>
> Possible helper functions that come to mind are:
> (a) Returns the return code (integer) only. This would basically
> be a cross-platform version of os.system using the subprocess
> module internally.
> (b) Returns the return code (integer) plus the stdout and stderr
> (which would have to be StringIO handles, with the data in
> memory). This would be a direct replacement for the current
> Bio.Application.generic_run function.
> (c) Returns the stdout (and stderr) handles. This basically is
> recreating a deprecated Python popen*() function, which seems
> silly.
Or we just declare both Bio.Application.generic_run and
ApplicationResult obsolete, and simply recommend using subprocess with
str(cline) as before. Would someone like to proof read (and test) the
tutorial in CVS where I switched all the generic_run usage to
subprocess?
Peter
From biopython at maubp.freeserve.co.uk Sat Aug 8 11:14:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 12:14:18 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
Message-ID: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
On Wed, Jul 29, 2009 at 8:43 AM, Peter wrote:
> On Tue, Jul 28, 2009 at 11:09 PM, Brad Chapman wrote:
>> Extending this to AlignIO and TreeIO as Eric suggested is
>> also great.
>
> Whatever we do for Bio.SeqIO, we can follow the same pattern
> for Bio.AlignIO etc.
>
>> So +1 from me,
>> Brad
>
> And we basically had a +0 from Michiel, and a +1 from Eric.
> And I like the idea but am not convinced we need it. Maybe
> we should put the suggestion forward on the main discussion
> list for debate?
I've stuck a branch up on github which (thus far) simply defines
the Bio.SeqIO.convert and Bio.AlignIO.convert functions.
Adding optimised code can come later.
http://github.com/peterjc/biopython/commits/convert
Right now (based on the other thread), I've experimented
with making the convert functions accept either handles
or filenames. This will make the convert function even
more of a convenience wrapper, in addition to its role as a
standardised API to allow file format specific optimisations.
Taking handles and/or filenames does rather complicate
things, and not just for remembering to close the handles.
There are issues like should we silently replace any existing
output file (I went for yes), and should the output file be
deleted if the conversion fails part way (I went for no)?
Dealing with just handles would free us from all these
considerations.
You could even consider using Python's temporary file support
to write the file to a temp location, and only at the end move
it to the desired location. However that is getting far too
complicated for my liking (and may runs into permissions
issues on Unix). If anyone wants to do this, they can do it
explicitly in the calling script.
How does this look so far?
Peter
From biopython at maubp.freeserve.co.uk Sat Aug 8 19:41:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 20:41:20 +0100
Subject: [Biopython-dev] Unit tests for deprecated modules?
In-Reply-To: <320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com>
References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com>
<48AACE23.3050107@biologie.uni-kl.de>
<320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com>
Message-ID: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com>
Last year we talked about what to do with the unit tests for deprecated modules,
http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html
On Tue, Aug 19, 2008, Peter wrote:
> Are there any strong views about when to remove unit tests for
> deprecated modules? I can see two main approaches:
>
> (a) Remove the unit test when the code is deprecated, as this avoids
> warning messages from the test suite.
> (b) Remove the unit test only when the deprecated code is actually
> removed, as continuing to test the code will catch any unexpected
> breakage of the deprecated code.
>
> I lean towards (b), but wondered what other people think.
>
> Peter
On Tue, Aug 19, 2008, Michiel de Hoon wrote:
> I would say (a). In my opinion, deprecated means that the module
> is in essence no longer part of Biopython; we just keep it around
> to give people time to change. Also, deprecation warnings distract
> from real warnings and errors in the unit tests, are likely to confuse
> users, and give the impression that Biopython is not clean. I don't
> remember a case where we had to resurrect a deprecated module,
> so we may as well remove the unit test right away.
>
> --Michiel
On Tue, Aug 19, 2008, Frank Kauff wrote:
> I favor option a. Deprecated modules are no longer under development,
> so there's not much need for a unit test. A failed test would probably
> not trigger any action anyway, because nobody's going to do much
> bugfixing in deprecated modules.
>
> Frank
So, what we agreed last year was to remove tests for deprecated
modules. This issue has come up again with the deprecation of
Bio.Fasta, and the question of what to do with test_Fasta.py
I'd like to suggest a third option: Keep the tests for deprecated
modules, but silence the deprecation warning. e.g. make can
test_Fasta.py silence the Bio.Fasta deprecation warning. Hiding
the warning would prevent the likely user confusion on running
the test suite (an issue Michiel pointed out last year). Keeping
the test will prevent us accidentally breaking Bio.Fasta during
the phasing out period.
Any thoughts?
Peter
From biopython at maubp.freeserve.co.uk Sat Aug 8 19:50:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 20:50:47 +0100
Subject: [Biopython-dev] Unit tests for deprecated modules?
In-Reply-To: <320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com>
References: <320fb6e00808190352sd6437e0qb2898e39b15287b3@mail.gmail.com>
<48AACE23.3050107@biologie.uni-kl.de>
<320fb6e00808190704p4d19eb27if2927466a27f9b2a@mail.gmail.com>
<320fb6e00908081241r1b23498du43fe19a6cc349c97@mail.gmail.com>
Message-ID: <320fb6e00908081250j189ba590o5cd9c6e98f596193@mail.gmail.com>
On Sat, Aug 8, 2009 at 8:41 PM, Peter wrote:
> Last year we talked about what to do with the unit tests for deprecated modules,
> http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004137.html
> ...
> I'd like to suggest a third option: Keep the tests for deprecated
> modules, but silence the deprecation warning. e.g. make
> test_Fasta.py silence the Bio.Fasta deprecation warning.
I've done that in CVS as a proof of principle, replacing:
from Bio import Fasta
with:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from Bio import Fasta
warnings.resetwarnings()
There may be a more elegant way to do this, but it works.
Peter
From bugzilla-daemon at portal.open-bio.org Mon Aug 10 13:43:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 10 Aug 2009 09:43:15 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
files in Bio.SeqIO
In-Reply-To:
Message-ID: <200908101343.n7ADhF4c020240@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2837
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1303 is|0 |1
obsolete| |
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-08-10 09:43 EST -------
(From update of attachment 1303)
This file is already a tiny bit out of date - I've started working on this on a
git branch.
http://github.com/peterjc/biopython/commits/sff
See also James Casbon's parser, also on github:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006456.html
http://github.com/jamescasbon/biopython/tree/sff
It looks like we could try and merge the two. James' code looks like it doesn't
need seek/tell, which means it should work on any input handle (not just an
open file).
Note neither parser yet copes with paired end data (and I have not yet found
any test files to work on).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Aug 10 16:46:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 17:46:16 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
<320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
Message-ID: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
On Sat, Aug 8, 2009 at 12:14 PM, Peter wrote:
> I've stuck a branch up on github which (thus far) simply defines
> the Bio.SeqIO.convert and Bio.AlignIO.convert functions.
> Adding optimised code can come later.
>
> http://github.com/peterjc/biopython/commits/convert
There is now a new file Bio/SeqIO/_convert.py on this
branch, and a few optimised conversions have been done.
In particular GenBank/EMBL to FASTA, any FASTQ to
FASTA, and inter-conversion between any of the three
FASTQ formats.
In terms of speed, this new code takes under a minute to
convert a 7 million short read FASTQ file to another FASTQ
variant, or to a (line wrapped) FASTA file. In comparison,
using Bio.SeqIO parse/write takes over five minutes.
In terms of code organisation within Bio/SeqIO/_convert.py
I am (as with Bio.SeqIO etc for parsing and writing) just
using a dictionary of functions, keyed on the format names.
Initially, as you can tell from the code history, I was thinking
about having each sub-function potentially dealing with more
than one conversion (e.g. GenBank to anything not needing
features), but have removed this level of complication in the
most recent commit.
The current Bio/SeqIO/_convert.py file actually looks very
long and complicated - but if you ignore the doctests (which
I would probably more to a dedicated unit test), it isn't that
much code at all.
Would anyone like to try this out?
Peter
From eric.talevich at gmail.com Mon Aug 10 17:44:31 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 10 Aug 2009 13:44:31 -0400
Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython
Message-ID: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
Hi folks,
Previously I (Aug. 3-7):
- Refactored the PhyloXML parser somewhat, to behave more like the other
Biopython parsers and also handle 'other' elements better
- Reorganized Bio.Tree a bit, generalizing the Tree base class and
improving BaseTree-PhyloXML interoperability
- Worked on networkx export and graphviz display
- Added some more tests (thanks, Diana!)
- Added TreeIO.convert(), to match the AlignIO and SeqIO modules
Next week (Aug. 10-14) I will:
- Update the wiki documentation
- Fix any surprises that come up during testing
Automated testing:
- Check unit tests for complete coverage
- Re-run performance benchmarks
- Run tests and benchmarks on alternate platforms
- Check epydoc's generated API documentation
Remarks:
- Performance of the I/O functions is close to what it was before, in
the
best of times; parsing Taxonomy nodes incrementally seems to have
helped.
- Drawing trees with Graphviz is still ugly. Hopefully I can fix it this
week, but if not, I'll probably do it after GSoC because I like pretty
things.
- Presumably, any discussion of merging with Biopython will have to wait
until after the biopython-1.51 release. I'll be around. For GSoC
requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO
modules along with the unit test suite as standalone files, rather
than
as a patch set since the last upstream revision I pulled was just a
random untagged one around the time of the last beta release.
Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML
From matzke at berkeley.edu Mon Aug 10 20:23:15 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 10 Aug 2009 13:23:15 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <20090804222731.GA12604@sobchak.mgh.harvard.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
Message-ID: <4A8081B3.2080600@berkeley.edu>
Hi all...updates...
Summary: Major focus is getting the GBIF access/search/parse module into
"done"/submittable shape. This primarily requires getting the
documentation and testing up to biopython specs. I have a fair bit of
documentation and testing, need advice (see below) for specifics on what
it should look like.
Brad Chapman wrote:
> Hi Nick;
> Thanks for the update -- great to see things moving along.
>
>> - removed any reliance on lagrange tree module, refactored all phylogeny
>> code to use the revised Bio.Nexus.Tree module
>
> Awesome -- glad this worked for you. Are the lagrange_* files in
> Bio.Geography still necessary? If not, we should remove them from
> the repository to clean things up.
Ah, they had been deleted locally but it took an extra command to delete
on git. Done.
>
> More generally, it would be really helpful if we could do a bit of
> housekeeping on the repository. The Geography namespace has a lot of
> things in it which belong in different parts of the tree:
>
> - The test code should move to the 'Tests' directory as a set of
> test_Geography* files that we can use for unit testing the code.
OK, I will do this. Should I try and figure out the unittest stuff? I
could use a simple example of what this is supposed to look like.
> - Similarly there are a lot of data files in there which are
> appear to be test related; these could move to Tests/Geography
Will do.
> - What is happening with the Nodes_v2 and Treesv2 files? They look
> like duplicates of the Nexus Nodes and Trees with some changes.
> Could we roll those changes into the main Nexus code to avoid
> duplication?
Yeah, these were just copies with your bug fix, and with a few mods I
used to track crashes. Presumably I don't need these with after a fresh
download of biopython.
>> - Code dealing with GBIF xml output completely refactored into the
>> following classes:
>>
>> * ObsRecs (observation records & search results/summary)
>> * ObsRec (an individual observation record)
>> * XmlString (functions for cleaning xml returned by Gbif)
>> * GbifXml (extention of capabilities for ElementTree xml trees, parsed
>> from GBIF xml returns.
>
> I'm agreed with Hilmar -- the user classes would probably benefit from expanded
> naming. There is a art to naming to get them somewhere between the hideous
> RidicuouslyLongNamesWithEverythingSpecified names and short truncated names.
> Specifically, you've got a lot of filler in the names -- dbfUtils,
> geogUtils, shpUtils. The Utils probably doesn't tell the user much
> and makes all of the names sort of blend together, just as the Rec/Recs
> pluralization hides a quite large difference in what the classes hold.
Will work on this, these should be made part of the
GbifObservationRecord() object or be accessed by it, basically they only
exist to classify lat/long points into user-specified areas.
> Something like Observation and ObservationSearchResult would make it
> clear immediately what they do and the information they hold.
Agreed, here is a new scheme for the names (changes already made):
=============
class GbifSearchResults():
GbifSearchResults is a class for holding a series of
GbifObservationRecord records, and processing them e.g. into classified
areas.
Also can hold a GbifDarwincoreXmlString record (the raw output returned
from a GBIF search) and a GbifXmlTree (a class for holding/processing
the ElementTree object returned by parsing the GbifDarwincoreXmlString).
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation
at an individual lat/long point.
class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by
a GBIF search, & processing it to plain text, then an xmltree (an
ElementTree).
GbifDarwincoreXmlString inherits string methods from str (class String).
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
=============
...description of methods below...
>
>> This week:
>
> What are your thoughts on documentation? As a naive user of these
> tools without much experience with the formats, I could offer better
> feedback if I had an idea of the public APIs and how they are
> expected to be used. Moreover, cookbook and API documentation is something
> we will definitely need to integrate into Biopython. How does this fit
> in your timeline for the remaining weeks?
The API is really just the interface with GBIF. I think developing a
cookbook entry is pretty easy, I assume you want something like one of
the entries in the official biopython cookbook?
Re: API documentation...are you just talking about the function
descriptions that are typically in """ """ strings beneath the function
definitions? I've got that done. Again, if there is more, an example
of what it should look like would be useful.
Documentation for the GBIF stuff below.
============
gbif_xml.py
Functions for accessing GBIF, downloading records, processing them into
a class, and extracting information from the xmltree in that class.
class GbifObservationRecord(Exception): pass
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation
at an individual lat/long point.
__init__(self):
This is an instantiation class for setting up new objects of this class.
latlong_to_obj(self, line):
Read in a string, read species/lat/long to GbifObservationRecord object
This can be slow, e.g. 10 seconds for even just ~1000 records.
parse_occurrence_element(self, element):
Parse a TaxonOccurrence element, store in OccurrenceRecord
fill_occ_attribute(self, element, el_tag, format='str'):
Return the text found in matching element matching_el.text.
find_1st_matching_subelement(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag.
record_to_string(self):
Print the attributes of a record to a string
class GbifDarwincoreXmlString(Exception): pass
class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by
a GBIF search, & processing it to plain text, then an xmltree (an
ElementTree).
GbifDarwincoreXmlString inherits string methods from str (class String).
__init__(self, rawstring=None):
This is an instantiation class for setting up new objects of this class.
fix_ASCII_lines(self, endline=''):
Convert each line in an input string into pure ASCII
(This avoids crashes when printing to screen, etc.)
_fix_ASCII_line(self, line):
Convert a single string line into pure ASCII
(This avoids crashes when printing to screen, etc.)
_unescape(self, text):
#
Removes HTML or XML character references and entities from a text string.
@param text The HTML (or XML) source text.
@return The plain text, as a Unicode string, if necessary.
source: http://effbot.org/zone/re-sub.htm#unescape-html
_fix_ampersand(self, line):
Replaces "&" with "&" in a string; this is otherwise
not caught by the unescape and unicodedata.normalize functions.
class GbifXmlTreeError(Exception): pass
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
__init__(self, xmltree=None):
This is an instantiation class for setting up new objects of this class.
print_xmltree(self):
Prints all the elements & subelements of the xmltree to screen (may require
fix_ASCII to input file to succeed)
print_subelements(self, element):
Takes an element from an XML tree and prints the subelements tag & text, and
the within-tag items (key/value or whatnot)
_element_items_to_dictionary(self, element_items):
If the XML tree element has items encoded in the tag, e.g. key/value or
whatever, this function puts them in a python dictionary and returns
them.
extract_latlongs(self, element):
Create a temporary pseudofile, extract lat longs to it,
return results as string.
Inspired by: http://www.skymind.com/~ocrow/python_string/
(Method 5: Write to a pseudo file)
_extract_latlong_datum(self, element, file_str):
Searches an element in an XML tree for lat/long information, and the
complete name. Searches recursively, if there are subelements.
file_str is a string created by StringIO in extract_latlongs() (i.e., a
temp filestr)
extract_all_matching_elements(self, start_element, el_to_match):
Returns a list of the elements, picking elements by TaxonOccurrence;
this should
return a list of elements equal to the number of hits.
_recursive_el_match(self, element, el_to_match, output_list):
Search recursively through xmltree, starting with element, recording all
instances of el_to_match.
find_to_elements_w_ancs(self, el_tag, anc_el_tag):
Burrow into XML to get an element with tag el_tag, return only those
el_tags underneath a particular parent element parent_el_tag
xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
match_el_list):
Recursively burrows down to find whatever elements with el_tag exist
inside a parent_el_tag.
create_sub_xmltree(self, element):
Create a subset xmltree (to avoid going back to irrelevant parents)
_xml_burrow_up(self, element, anc_el_tag, found_anc):
Burrow up xml to find anc_el_tag
_xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
Burrow up from element of interest, until a cousin is found with
cousin_el_tag
_return_parent_in_xmltree(self, child_to_search_for):
Search through an xmltree to get the parent of child_to_search_for
_return_parent_in_element(self, potential_parent, child_to_search_for,
returned_parent):
Search through an XML element to return parent of child_to_search_for
find_1st_matching_element(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag
extract_numhits(self, element):
Search an element of a parsed XML string and find the
number of hits, if it exists. Recursively searches,
if there are subelements.
class GbifSearchResults(Exception): pass
class GbifSearchResults():
GbifSearchResults is a class for holding a series of
GbifObservationRecord records, and processing them e.g. into classified
areas.
__init__(self, gbif_recs_xmltree=None):
This is an instantiation class for setting up new objects of this class.
print_records(self):
Print all records in tab-delimited format to screen.
print_records_to_file(self, fn):
Print the attributes of a record to a file with filename fn
latlongs_to_obj(self):
Takes the string from extract_latlongs, puts each line into a
GbifObservationRecord object.
Return a list of the objects
Functions devoted to accessing/downloading GBIF records
access_gbif(self, url, params):
Helper function to access various GBIF services
choose the URL ("url") from here:
http://data.gbif.org/ws/rest/occurrence
params are a dictionary of key/value pairs
"self._open" is from Bio.Entrez.self._open, online here:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open
Get the handle of results
(looks like e.g.: > )
(open with results_handle.read() )
_get_hits(self, params):
Get the actual hits that are be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).
get_xml_hits(self, params):
Returns hits like _get_hits, but returns a parsed XML tree.
get_record(self, key):
Given the key, get a single record, return xmltree for it.
get_numhits(self, params):
Get the number of hits that will be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).
xmlstring_to_xmltree(self, xmlstring):
Take the text string returned by GBIF and parse to an XML tree using
ElementTree.
Requires the intermediate step of saving to a temporary file (required
to make
ElementTree.parse work, apparently)
tempfn = 'tempxml.xml'
fh = open(tempfn, 'w')
fh.write(xmlstring)
fh.close()
get_all_records_by_increment(self, params, inc):
Download all of the records in stages, store in list of elements.
Increments of e.g. 100 to not overload server
extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):
Extract all of the 'TaxonOccurrence' elements to a list, store them in a
GbifObservationRecord.
_paramsdict_to_string(self, params):
Converts the python dictionary of search parameters into a text
string for submission to GBIF
_open(self, cgi, params={}):
Function for accessing online databases.
Modified from:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html
Helper function to build the URL and open a handle to it (PRIVATE).
Open a handle to GBIF. cgi is the URL for the cgi script to access.
params is a dictionary with the options to pass to it. Does some
simple error checking, and will raise an IOError if it encounters one.
This function also enforces the "three second rule" to avoid abusing
the GBIF servers (modified after NCBI requirement).
============
>
> Thanks again. Hope this helps,
> Brad
Very much, thanks!!
Nick
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From matzke at berkeley.edu Mon Aug 10 20:25:10 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 10 Aug 2009 13:25:10 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A8081B3.2080600@berkeley.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com>
<20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
Message-ID: <4A808226.5020302@berkeley.edu>
PS: Evidence of interest in this GBIF functionality already, see fwd
below...
PPS: Commit with updates names, deleted old files here:
http://github.com/nmatzke/biopython/commits/Geography
-------- Original Message --------
Subject: Re: biogeopython
Date: Fri, 07 Aug 2009 16:34:26 -0700
From: Nick Matzke
Reply-To: matzke at berkeley.edu
Organization: Dept. Integ. Biology, UC Berkeley
To: James Pringle
References:
<4A7C6DEE.1000305 at berkeley.edu>
Coolness, let me know how it works for you, feedback appreciated at this
stage. Cheers!
Nick
James Pringle wrote:
> Thanks!
> Jamie
>
> On Fri, Aug 7, 2009 at 2:09 PM, Nick Matzke > wrote:
>
> Hi Jamie!
>
> It's still under development, eventually it will be a biopython
> module, but what I've got should do exactly what you need.
>
> Just take the files from the most recent commit here:
> http://github.com/nmatzke/biopython/commits/Geography
>
> ...and run test_gbif_xml.py to get the idea, it will search on a
> taxon name, count/download all hits, parse the xml to a set of
> record objects, output each record to screen or tab-delimited file,
> etc.
>
> Cheers!
> Nick
>
>
>
>
>
> James Pringle wrote:
>
> Dear Mr. Matzke--
>
> I am an oceanographer at the University of New Hampshire, and
> with my colleagues John Wares and Jeb Byers am looking at the
> interaction of ocean circulation and species ranges. As part
> of that effort, I am using GBIF data, and was looking at your
> Summer-of-Code project. I want to start from a species name
> and get lat/long of occurance data. Is you toolbox in usable
> shape (I am an ok pythonista)? What is the best way to download
> a tested version of it (I can figure out how to get code from
> CVS/GIT, etc, so I am just looking for a pointer to a stable-ish
> tree)?
>
> Cheers,
> & Thanks
> Jamie Pringle
>
>
> --
> ====================================================
> Nicholas J. Matzke
> Ph.D. Candidate, Graduate Student Researcher
> Huelsenbeck Lab
> Center for Theoretical Evolutionary Genomics
> 4151 VLSB (Valley Life Sciences Building)
> Department of Integrative Biology
> University of California, Berkeley
>
> Lab websites:
> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> http://fisher.berkeley.edu/cteg/hlab.html
> Dept. personal page:
> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> Lab personal page:
http://fisher.berkeley.edu/cteg/members/matzke.html
> Lab phone: 510-643-6299
> Dept. fax: 510-643-6264
> Cell phone: 510-301-0179
> Email: matzke at berkeley.edu
>
> Mailing address:
> Department of Integrative Biology
> 3060 VLSB #3140
> Berkeley, CA 94720-3140
>
> -----------------------------------------------------
> "[W]hen people thought the earth was flat, they were wrong. When
> people thought the earth was spherical, they were wrong. But if you
> think that thinking the earth is spherical is just as wrong as
> thinking the earth is flat, then your view is wronger than both of
> them put together."
>
> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical
> Inquirer, 14(1), 35-44. Fall 1989.
> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> ====================================================
>
>
Nick Matzke wrote:
> Hi all...updates...
>
> Summary: Major focus is getting the GBIF access/search/parse module into
> "done"/submittable shape. This primarily requires getting the
> documentation and testing up to biopython specs. I have a fair bit of
> documentation and testing, need advice (see below) for specifics on what
> it should look like.
>
>
> Brad Chapman wrote:
>> Hi Nick;
>> Thanks for the update -- great to see things moving along.
>>
>>> - removed any reliance on lagrange tree module, refactored all
>>> phylogeny code to use the revised Bio.Nexus.Tree module
>>
>> Awesome -- glad this worked for you. Are the lagrange_* files in
>> Bio.Geography still necessary? If not, we should remove them from
>> the repository to clean things up.
>
>
> Ah, they had been deleted locally but it took an extra command to delete
> on git. Done.
>
>>
>> More generally, it would be really helpful if we could do a bit of
>> housekeeping on the repository. The Geography namespace has a lot of
>> things in it which belong in different parts of the tree:
>>
>> - The test code should move to the 'Tests' directory as a set of
>> test_Geography* files that we can use for unit testing the code.
>
> OK, I will do this. Should I try and figure out the unittest stuff? I
> could use a simple example of what this is supposed to look like.
>
>
>> - Similarly there are a lot of data files in there which are
>> appear to be test related; these could move to Tests/Geography
>
> Will do.
>
>> - What is happening with the Nodes_v2 and Treesv2 files? They look
>> like duplicates of the Nexus Nodes and Trees with some changes.
>> Could we roll those changes into the main Nexus code to avoid
>> duplication?
>
> Yeah, these were just copies with your bug fix, and with a few mods I
> used to track crashes. Presumably I don't need these with after a fresh
> download of biopython.
>
>
>
>>> - Code dealing with GBIF xml output completely refactored into the
>>> following classes:
>>>
>>> * ObsRecs (observation records & search results/summary)
>>> * ObsRec (an individual observation record)
>>> * XmlString (functions for cleaning xml returned by Gbif)
>>> * GbifXml (extention of capabilities for ElementTree xml trees,
>>> parsed from GBIF xml returns.
>>
>> I'm agreed with Hilmar -- the user classes would probably benefit from
>> expanded
>> naming. There is a art to naming to get them somewhere between the
>> hideous RidicuouslyLongNamesWithEverythingSpecified names and short
>> truncated names.
>> Specifically, you've got a lot of filler in the names -- dbfUtils,
>> geogUtils, shpUtils. The Utils probably doesn't tell the user much
>> and makes all of the names sort of blend together, just as the
>> Rec/Recs pluralization hides a quite large difference in what the
>> classes hold.
>
> Will work on this, these should be made part of the
> GbifObservationRecord() object or be accessed by it, basically they only
> exist to classify lat/long points into user-specified areas.
>
>> Something like Observation and ObservationSearchResult would make it
>> clear immediately what they do and the information they hold.
>
>
> Agreed, here is a new scheme for the names (changes already made):
>
> =============
> class GbifSearchResults():
>
> GbifSearchResults is a class for holding a series of
> GbifObservationRecord records, and processing them e.g. into classified
> areas.
>
> Also can hold a GbifDarwincoreXmlString record (the raw output returned
> from a GBIF search) and a GbifXmlTree (a class for holding/processing
> the ElementTree object returned by parsing the GbifDarwincoreXmlString).
>
>
>
> class GbifObservationRecord():
>
> GbifObservationRecord is a class for holding an individual observation
> at an individual lat/long point.
>
>
>
> class GbifDarwincoreXmlString(str):
>
> GbifDarwincoreXmlString is a class for holding the xmlstring returned by
> a GBIF search, & processing it to plain text, then an xmltree (an
> ElementTree).
>
> GbifDarwincoreXmlString inherits string methods from str (class String).
>
>
>
> class GbifXmlTree():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
> =============
>
> ...description of methods below...
>
>
>>
>>> This week:
>>
>> What are your thoughts on documentation? As a naive user of these
>> tools without much experience with the formats, I could offer better
>> feedback if I had an idea of the public APIs and how they are
>> expected to be used. Moreover, cookbook and API documentation is
>> something we will definitely need to integrate into Biopython. How
>> does this fit in your timeline for the remaining weeks?
>
> The API is really just the interface with GBIF. I think developing a
> cookbook entry is pretty easy, I assume you want something like one of
> the entries in the official biopython cookbook?
>
> Re: API documentation...are you just talking about the function
> descriptions that are typically in """ """ strings beneath the function
> definitions? I've got that done. Again, if there is more, an example
> of what it should look like would be useful.
>
> Documentation for the GBIF stuff below.
>
> ============
> gbif_xml.py
> Functions for accessing GBIF, downloading records, processing them into
> a class, and extracting information from the xmltree in that class.
>
>
> class GbifObservationRecord(Exception): pass
> class GbifObservationRecord():
> GbifObservationRecord is a class for holding an individual observation
> at an individual lat/long point.
>
>
> __init__(self):
>
> This is an instantiation class for setting up new objects of this class.
>
>
>
> latlong_to_obj(self, line):
>
> Read in a string, read species/lat/long to GbifObservationRecord object
> This can be slow, e.g. 10 seconds for even just ~1000 records.
>
>
> parse_occurrence_element(self, element):
>
> Parse a TaxonOccurrence element, store in OccurrenceRecord
>
>
> fill_occ_attribute(self, element, el_tag, format='str'):
>
> Return the text found in matching element matching_el.text.
>
>
>
> find_1st_matching_subelement(self, element, el_tag, return_element):
>
> Burrow down into the XML tree, retrieve the first element with the
> matching tag.
>
>
> record_to_string(self):
>
> Print the attributes of a record to a string
>
>
>
>
>
>
>
> class GbifDarwincoreXmlString(Exception): pass
>
> class GbifDarwincoreXmlString(str):
> GbifDarwincoreXmlString is a class for holding the xmlstring returned by
> a GBIF search, & processing it to plain text, then an xmltree (an
> ElementTree).
>
> GbifDarwincoreXmlString inherits string methods from str (class String).
>
>
>
> __init__(self, rawstring=None):
>
> This is an instantiation class for setting up new objects of this class.
>
>
>
> fix_ASCII_lines(self, endline=''):
>
> Convert each line in an input string into pure ASCII
> (This avoids crashes when printing to screen, etc.)
>
>
> _fix_ASCII_line(self, line):
>
> Convert a single string line into pure ASCII
> (This avoids crashes when printing to screen, etc.)
>
>
> _unescape(self, text):
>
> #
> Removes HTML or XML character references and entities from a text string.
>
> @param text The HTML (or XML) source text.
> @return The plain text, as a Unicode string, if necessary.
> source: http://effbot.org/zone/re-sub.htm#unescape-html
>
>
> _fix_ampersand(self, line):
>
> Replaces "&" with "&" in a string; this is otherwise
> not caught by the unescape and unicodedata.normalize functions.
>
>
>
>
>
>
>
> class GbifXmlTreeError(Exception): pass
> class GbifXmlTree():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
>
> __init__(self, xmltree=None):
>
> This is an instantiation class for setting up new objects of this class.
>
>
> print_xmltree(self):
>
> Prints all the elements & subelements of the xmltree to screen (may require
> fix_ASCII to input file to succeed)
>
>
> print_subelements(self, element):
>
> Takes an element from an XML tree and prints the subelements tag & text,
> and
> the within-tag items (key/value or whatnot)
>
>
> _element_items_to_dictionary(self, element_items):
>
> If the XML tree element has items encoded in the tag, e.g. key/value or
> whatever, this function puts them in a python dictionary and returns
> them.
>
>
> extract_latlongs(self, element):
>
> Create a temporary pseudofile, extract lat longs to it,
> return results as string.
>
> Inspired by: http://www.skymind.com/~ocrow/python_string/
> (Method 5: Write to a pseudo file)
>
>
>
>
> _extract_latlong_datum(self, element, file_str):
>
> Searches an element in an XML tree for lat/long information, and the
> complete name. Searches recursively, if there are subelements.
>
> file_str is a string created by StringIO in extract_latlongs() (i.e., a
> temp filestr)
>
>
>
> extract_all_matching_elements(self, start_element, el_to_match):
>
> Returns a list of the elements, picking elements by TaxonOccurrence;
> this should
> return a list of elements equal to the number of hits.
>
>
>
> _recursive_el_match(self, element, el_to_match, output_list):
>
> Search recursively through xmltree, starting with element, recording all
> instances of el_to_match.
>
>
> find_to_elements_w_ancs(self, el_tag, anc_el_tag):
>
> Burrow into XML to get an element with tag el_tag, return only those
> el_tags underneath a particular parent element parent_el_tag
>
>
> xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
> match_el_list):
>
> Recursively burrows down to find whatever elements with el_tag exist
> inside a parent_el_tag.
>
>
>
> create_sub_xmltree(self, element):
>
> Create a subset xmltree (to avoid going back to irrelevant parents)
>
>
>
> _xml_burrow_up(self, element, anc_el_tag, found_anc):
>
> Burrow up xml to find anc_el_tag
>
>
>
> _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
>
> Burrow up from element of interest, until a cousin is found with
> cousin_el_tag
>
>
>
>
> _return_parent_in_xmltree(self, child_to_search_for):
>
> Search through an xmltree to get the parent of child_to_search_for
>
>
>
> _return_parent_in_element(self, potential_parent, child_to_search_for,
> returned_parent):
>
> Search through an XML element to return parent of child_to_search_for
>
>
> find_1st_matching_element(self, element, el_tag, return_element):
>
> Burrow down into the XML tree, retrieve the first element with the
> matching tag
>
>
>
>
> extract_numhits(self, element):
>
> Search an element of a parsed XML string and find the
> number of hits, if it exists. Recursively searches,
> if there are subelements.
>
>
>
>
>
>
>
>
>
>
>
>
> class GbifSearchResults(Exception): pass
>
> class GbifSearchResults():
>
> GbifSearchResults is a class for holding a series of
> GbifObservationRecord records, and processing them e.g. into classified
> areas.
>
>
>
> __init__(self, gbif_recs_xmltree=None):
>
> This is an instantiation class for setting up new objects of this class.
>
>
>
> print_records(self):
>
> Print all records in tab-delimited format to screen.
>
>
>
>
> print_records_to_file(self, fn):
>
> Print the attributes of a record to a file with filename fn
>
>
>
> latlongs_to_obj(self):
>
> Takes the string from extract_latlongs, puts each line into a
> GbifObservationRecord object.
>
> Return a list of the objects
>
>
> Functions devoted to accessing/downloading GBIF records
> access_gbif(self, url, params):
>
> Helper function to access various GBIF services
>
> choose the URL ("url") from here:
> http://data.gbif.org/ws/rest/occurrence
>
> params are a dictionary of key/value pairs
>
> "self._open" is from Bio.Entrez.self._open, online here:
> http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open
>
> Get the handle of results
> (looks like e.g.: object at 0x48117f0>> )
>
> (open with results_handle.read() )
>
>
> _get_hits(self, params):
>
> Get the actual hits that are be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
>
> It will return the LAST non-none instance (in a standard search result
> there
> should be only one, anyway).
>
>
>
>
> get_xml_hits(self, params):
>
> Returns hits like _get_hits, but returns a parsed XML tree.
>
>
>
>
> get_record(self, key):
>
> Given the key, get a single record, return xmltree for it.
>
>
>
> get_numhits(self, params):
>
> Get the number of hits that will be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
>
> It will return the LAST non-none instance (in a standard search result
> there
> should be only one, anyway).
>
>
> xmlstring_to_xmltree(self, xmlstring):
>
> Take the text string returned by GBIF and parse to an XML tree using
> ElementTree.
> Requires the intermediate step of saving to a temporary file (required
> to make
> ElementTree.parse work, apparently)
>
>
>
> tempfn = 'tempxml.xml'
> fh = open(tempfn, 'w')
> fh.write(xmlstring)
> fh.close()
>
>
>
>
>
> get_all_records_by_increment(self, params, inc):
>
> Download all of the records in stages, store in list of elements.
> Increments of e.g. 100 to not overload server
>
>
>
> extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):
>
> Extract all of the 'TaxonOccurrence' elements to a list, store them in a
> GbifObservationRecord.
>
>
>
> _paramsdict_to_string(self, params):
>
> Converts the python dictionary of search parameters into a text
> string for submission to GBIF
>
>
>
> _open(self, cgi, params={}):
>
> Function for accessing online databases.
>
> Modified from:
> http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html
>
> Helper function to build the URL and open a handle to it (PRIVATE).
>
> Open a handle to GBIF. cgi is the URL for the cgi script to access.
> params is a dictionary with the options to pass to it. Does some
> simple error checking, and will raise an IOError if it encounters one.
>
> This function also enforces the "three second rule" to avoid abusing
> the GBIF servers (modified after NCBI requirement).
> ============
>
>
>>
>> Thanks again. Hope this helps,
>> Brad
>
> Very much, thanks!!
> Nick
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From biopython at maubp.freeserve.co.uk Mon Aug 10 20:49:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 21:49:29 +0100
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A8081B3.2080600@berkeley.edu>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
Message-ID: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com>
On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote:
> Hi all...updates...
>
> Summary: Major focus is getting the GBIF access/search/parse module into
> "done"/submittable shape. ?This primarily requires getting the documentation
> and testing up to biopython specs. ?I have a fair bit of documentation and
> testing, need advice (see below) for specifics on what it should look like.
>
>> - The test code should move to the 'Tests' directory as a set of
>> ?test_Geography* files that we can use for unit testing the code.
>
> OK, I will do this. ?Should I try and figure out the unittest stuff? ?I
> could use a simple example of what this is supposed to look like.
You can either go for "unittest" based tests (generally better, but more
of a learning curve - but useful for any python project), or our own
Biopython specific "print and compare" tests (basically sample scripts
with their expected output).
Read the tests chapter in the Biopython Tutorial if you haven't already.
(And if you think anything could be clearer, or you spot a typo, let us
know please - feedback would be great).
Peter
From matzke at berkeley.edu Mon Aug 10 21:10:26 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Mon, 10 Aug 2009 14:10:26 -0700
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com>
References: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
<320fb6e00908101349o46162d5n4b91819895c32f8f@mail.gmail.com>
Message-ID: <4A808CC2.6000308@berkeley.edu>
Peter wrote:
> On Mon, Aug 10, 2009 at 9:23 PM, Nick Matzke wrote:
>> Hi all...updates...
>>
>> Summary: Major focus is getting the GBIF access/search/parse module into
>> "done"/submittable shape. This primarily requires getting the documentation
>> and testing up to biopython specs. I have a fair bit of documentation and
>> testing, need advice (see below) for specifics on what it should look like.
>>
>>> - The test code should move to the 'Tests' directory as a set of
>>> test_Geography* files that we can use for unit testing the code.
>> OK, I will do this. Should I try and figure out the unittest stuff? I
>> could use a simple example of what this is supposed to look like.
>
> You can either go for "unittest" based tests (generally better, but more
> of a learning curve - but useful for any python project), or our own
> Biopython specific "print and compare" tests (basically sample scripts
> with their expected output).
>
> Read the tests chapter in the Biopython Tutorial if you haven't already.
> (And if you think anything could be clearer, or you spot a typo, let us
> know please - feedback would be great).
Thanks!
Nick
>
> Peter
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
From biopython at maubp.freeserve.co.uk Tue Aug 11 12:19:25 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Aug 2009 13:19:25 +0100
Subject: [Biopython-dev] Bio.SeqIO.convert function?
In-Reply-To: <320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com>
<20090728220943.GJ68751@sobchak.mgh.harvard.edu>
<320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com>
<320fb6e00908080414k792fca87x946ac3b9d59eaa2b@mail.gmail.com>
<320fb6e00908100946r48e96d38h28916d70b8a54683@mail.gmail.com>
Message-ID: <320fb6e00908110519k313d6d34g40502fd2578326e1@mail.gmail.com>
On Mon, Aug 10, 2009 at 5:46 PM, Peter wrote:
> In terms of speed, this new code takes under a minute to
> convert a 7 million short read FASTQ file to another FASTQ
> variant, or to a (line wrapped) FASTA file. In comparison,
> using Bio.SeqIO parse/write takes over five minutes.
If anyone is interested in the details, here I am using a 7 million
entry FASTQ file of short reads (length 36bp) from a Solexa FASTQ
format file (downloaded from the NCBI and then converted from the
Sanger FASTQ format). I'm timing conversion from Solexa to Sanger
FASTQ as it is a more common operation, and I can include the MAQ
script for comparison. I pipe the output via grep and word count as a
check on the conversion.
Using a (patched) version of MAQ's fq_all2std.pl we get about 4 mins:
$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
7047668
real 3m58.978s
user 4m13.475s
sys 0m3.705s
And using a patched version of EMBOSS 6.1.0 (without the optimisations
Peter Rice has mentioned), we get 3m42s.
$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa | grep "^@SRR" | wc -l
7047668
real 3m41.625s
user 3m56.753s
sys 0m4.091s
Using the latest Biopython in CVS (or the git master branch), with
Bio.SeqIO.parse/write, takes about twice this, 7m11s:
$ time python biopython_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
7047668
real 7m10.706s
user 7m27.597s
sys 0m3.850s
This is at least a marked improvement over Biopython 1.51b with
Bio.SeqIO.parse/write, which took about 17 minutes! The bad news is
while the Bio.SeqIO FASTQ read/write in CVS is faster than in
Biopython 1.51b, it is also much less elegant. I'm think once I've
finished adding test cases (and probably after 1.51 is out) it might
be worth while trying to make it more beautiful without sacrificing
too much of the speed gain.
Now to the good news, using my github branch with the convert function
we get a massive reduction to under a minute (52s):
$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa |
grep "^@SRR" | wc -l
7047668
real 0m51.618s
user 1m7.735s
sys 0m3.162s
We have a winner! Assuming of course there are no mistakes ;)
In fact, these measurements are a little misleading because I am
including grep (to check the record count) and the output isn't
actually going to disk. Doing the grep on its own takes about 15s:
$ time grep "^@SRR" SRR001666_1.fastq_solexa | wc -l
7047668
real 0m15.318s
user 0m17.890s
sys 0m1.087s
However, if you actually output to a file the disk speed itself
becomes important when the conversion is this fast:
$ time python convert_solexa2sanger.py < SRR001666_1.fastq_solexa > temp.fastq
real 1m3.448s
user 0m49.672s
sys 0m4.826s
$ time seqret -filter -sformat fastq-solexa -osformat fastq-sanger <
SRR001666_1.fastq_solexa > temp.fastq
real 3m55.086s
user 3m39.548s
sys 0m5.998s
$ time perl ../biopython/Tests/Quality/fq_all2std.pl sol2std
SRR001666_1.fastq_solexa > temp.fastq
real 4m10.245s
user 3m54.880s
sys 0m5.085s
$ time python ../biopython/Tests/Quality/biopython_solexa2sanger.py <
SRR001666_1.fastq_solexa > temp.fastq
real 7m27.879s
user 7m9.084s
sys 0m6.008s
Nevertheless, the Bio.SeqIO.convert(...) function still wins for now.
Peter
For those interested, here are the tiny little Biopython scripts I'm using:
# biopython_solexa2sanger.py
#FASTQ conversion using Bio.SeqIO, needs Biopython 1.50 or later.
import sys
from Bio import SeqIO
records = SeqIO.parse(sys.stdin, "fastq-solexa")
SeqIO.write(records, sys.stdout, "fastq")
and:
#convert_solexa2sanger.py
#High performance FASTQ conversion using Bio.SeqIO.convert(...)
#function likely to be in Biopython 1.52 onwards.
import sys
from Bio import SeqIO
SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq")
From chapmanb at 50mail.com Tue Aug 11 13:10:19 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 11 Aug 2009 09:10:19 -0400
Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography
update/BioPython tree module discussion
In-Reply-To: <4A8081B3.2080600@berkeley.edu>
References: <20090708124841.GX17086@sobchak.mgh.harvard.edu>
<4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu>
<20090714123534.GQ17086@sobchak.mgh.harvard.edu>
<4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu>
<4A78696E.8010808@berkeley.edu>
<20090804222731.GA12604@sobchak.mgh.harvard.edu>
<4A8081B3.2080600@berkeley.edu>
Message-ID: <20090811131019.GW12604@sobchak.mgh.harvard.edu>
Hi Nick;
> Summary: Major focus is getting the GBIF access/search/parse module into
> "done"/submittable shape. This primarily requires getting the
> documentation and testing up to biopython specs. I have a fair bit of
> documentation and testing, need advice (see below) for specifics on what
> it should look like.
Awesome. Thanks for working on the cleanup for this.
> OK, I will do this. Should I try and figure out the unittest stuff? I
> could use a simple example of what this is supposed to look like.
In addition to Peter's pointers, here is a simple example from a
small thing I wrote:
http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py
You can copy/paste the unit test part to get a base, and then
replace the t_* functions with your own real tests.
Simple scripts that generate consistent output are also fine; that's
the print and compare approach.
> > - What is happening with the Nodes_v2 and Treesv2 files? They look
> > like duplicates of the Nexus Nodes and Trees with some changes.
> > Could we roll those changes into the main Nexus code to avoid
> > duplication?
>
> Yeah, these were just copies with your bug fix, and with a few mods I
> used to track crashes. Presumably I don't need these with after a fresh
> download of biopython.
Cool. It would be great if we could weed these out as well.
> The API is really just the interface with GBIF. I think developing a
> cookbook entry is pretty easy, I assume you want something like one of
> the entries in the official biopython cookbook?
Yes, that would work great. What I was thinking of are some examples
where you provide background and motivation: Describe some useful
information you want to get from GBIF, and then show how to do it.
This is definitely the most useful part as it gives people working
examples to start with. From there they can usually browse the lower
level docs or code to figure out other specific things.
> Re: API documentation...are you just talking about the function
> descriptions that are typically in """ """ strings beneath the function
> definitions? I've got that done. Again, if there is more, an example
> of what it should look like would be useful.
That looks great for API level docs. You are right on here; for this
week I'd focus on the cookbook examples and cleanup stuff.
My other suggestion would be to rename these to follow Biopython
conventions, something like:
gbif_xml -> GbifXml
shpUtils -> ShapefileUtils
geogUtils -> GeographyUtils
dbfUtils -> DbfUtils
The *Utils might have underscores if they are not intended to be
called directly.
Thanks for all your hard work,
Brad
From chapmanb at 50mail.com Tue Aug 11 13:20:57 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 11 Aug 2009 09:20:57 -0400
Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython
In-Reply-To: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
Message-ID: <20090811132057.GX12604@sobchak.mgh.harvard.edu>
Hi Eric;
All sounds great -- looks like you are in good shape for finishing
things up this week. Really great work.
> - Presumably, any discussion of merging with Biopython will have to wait
> until after the biopython-1.51 release. I'll be around. For GSoC
> requirements, I'm planning on just dumping the Bio.Tree and Bio.TreeIO
> modules along with the unit test suite as standalone files, rather than
> as a patch set since the last upstream revision I pulled was just a
> random untagged one around the time of the last beta release.
We were discussing a release at the end of this week or over the
weekend. I think we should roll this in soon after that so anyone
can get it from the main trunk. I don't see any major issues with
integrating it.
How did you like the Git/GitHub experience? One thing we should push
after this release is moving over to that as the official
repository. Since you have been doing full time Git work this
summer, your experience will be really helpful. I still rely on CVS
as a bit of a crutch, but should learn to do things fully in Git.
Brad
From biopython at maubp.freeserve.co.uk Tue Aug 11 16:13:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Aug 2009 17:13:58 +0100
Subject: [Biopython-dev] ApplicationResult and generic_run obsolete?
In-Reply-To: <320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com>
References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com>
<8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com>
<320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com>
<320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com>
<20090708130649.GY17086@sobchak.mgh.harvard.edu>
<320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com>
<320fb6e00908041229t1a2cc4dawd29ce53ddb0a75eb@mail.gmail.com>
<320fb6e00908060739s7418757cwa179b7955645428a@mail.gmail.com>
<320fb6e00908060848m3d6bff40tf56c765f0e288fb9@mail.gmail.com>
<320fb6e00908060905i4a326327t504385ec55b0230c@mail.gmail.com>
Message-ID: <320fb6e00908110913x6cfe7826xa683a6dc130da26e@mail.gmail.com>
On Thu, Aug 6, 2009 at 5:05 PM, Peter wrote:
> Or we just declare both Bio.Application.generic_run and
> ApplicationResult obsolete, and simply recommend using
> subprocess with str(cline) as before. Would someone like to
> proof read (and test) the tutorial in CVS where I switched all
> the generic_run usage to subprocess?
>
I've just marked Bio.Application.generic_run and ApplicationResult as
obsolete in CVS.
I am content to wait for a consensus about any replacement for
generic_run once more people have tried using subprocess directly.
Peter
From biopython at maubp.freeserve.co.uk Tue Aug 11 16:44:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Aug 2009 17:44:11 +0100
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
Message-ID: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
Hi David & John,
Would either of you be able to draft a release announcement for
Biopython 1.51? We're aiming for the end of this week... touch wood.
I'm pretty sure the NEWS and DEPRECATED files are up to date (if
anyone can spot any omissions, please let us know), these try and
summarise changes for each release. Unless you have CVS or git
installed, the easiest way to read these files is currently from the
github website:
http://github.com/biopython/biopython/tree/master
Thanks,
Peter
P.S. Don't be afraid to repeat things from the Biopython 1.51 beta announcement:
http://news.open-bio.org/news/2009/06/biopython-151-beta-released/
http://lists.open-bio.org/pipermail/biopython-announce/2009-June/000057.html
From eric.talevich at gmail.com Tue Aug 11 18:50:02 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 11 Aug 2009 14:50:02 -0400
Subject: [Biopython-dev] GSoC Weekly Update 12: PhyloXML for Biopython
In-Reply-To: <20090811132057.GX12604@sobchak.mgh.harvard.edu>
References: <3f6baf360908101044i143bd421w27b89ba881a1ff00@mail.gmail.com>
<20090811132057.GX12604@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf360908111150q495e541bv405b25f0d74127fd@mail.gmail.com>
On Tue, Aug 11, 2009 at 9:20 AM, Brad Chapman wrote:
>
> How did you like the Git/GitHub experience? One thing we should push
> after this release is moving over to that as the official
> repository. Since you have been doing full time Git work this
> summer, your experience will be really helpful. I still rely on CVS
> as a bit of a crutch, but should learn to do things fully in Git.
>
>
I liked it a lot! I've spent some time with Subversion, Bazaar, Mercurial
and Git now, and I'm confident that Git was the right choice for Biopython.
My commit history shows a quick flurry of activity on each of the past few
Fridays -- that's from a couple days of exploration toward the end of the
week, then repeated calls to "git add -i" to pick out the parts that are
worth keeping. I'm careful with git-rebase, but "git commit --amend" gets a
fair amount of use. I could add a section on the Biopython wiki's GitUsage
page, called something like "Managing Commits", giving some examples of
this.
GitHub has been down briefly a few times. It was only a problem because it
happened on Monday mornings, when I wanted to push an updated README to my
public fork at the same time as my weekly update e-mail to this list. Having
a mirror on GitHub is great for getting started with Biopython development,
but I'm still unclear on how changes should propagate back upstream after
Biopython switches from CVS to Git. Pull requests? Core devs pushing to a
central Git repository on OBF servers? Maybe the BioRuby folks have advice;
if this has been settled on biopython-dev, I've missed it.
Anyway.
To create the final patch tarball next Monday for GSoC, I believe the right
incantation looks like this:
git format-patch -o gsoc-phyloxml master...phyloxml
tar czf gsoc-phyloxml.tgz gsoc-phyloxml
That's cleaner than I expected it to be. Neat.
Cheers,
Eric
From winda002 at student.otago.ac.nz Wed Aug 12 05:47:13 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 12 Aug 2009 17:47:13 +1200
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
Message-ID: <4A825761.10106@student.otago.ac.nz>
Peter wrote:
> Hi David & John,
>
> Would either of you be able to draft a release announcement for
> Biopython 1.51? We're aiming for the end of this week... touch wood.
>
We'll definitely aim to have something for the list to check out in the
next 24hrs. I guess the main points are all the Cool New Stuff from the
beta being in a stable release for the first time, FASTQ has been shown
to play nicely with across a bunch of projects and
Application.generic_run() is now on the deprecation path?
On that note, would it be useful to have a cookbook example or even a
blog-post ready to go showing a few of the ways one might use subprocess
to run commands defined with Biopython? I'm happy to put something
together that others can evaluate.
Cheers,
David
From biopython at maubp.freeserve.co.uk Wed Aug 12 09:49:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 12 Aug 2009 10:49:50 +0100
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <4A825761.10106@student.otago.ac.nz>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
Message-ID: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
On Wed, Aug 12, 2009 at 6:47 AM, David
Winter wrote:
>
> Peter wrote:
>>
>> Hi David & John,
>>
>> Would either of you be able to draft a release announcement for
>> Biopython 1.51? We're aiming for the end of this week... touch wood.
>
> We'll definitely aim to have something for the list to check out in the next
> 24hrs. I guess the main points are all the Cool New Stuff from the beta
> being in a stable release for the first time, FASTQ has been shown to play
> nicely with across a bunch of projects and Application.generic_run() is now
> on the deprecation path?
Historically we haven't made a big thing about deprecations in the
release announcements. Maybe we should - in which case also
note that Bio.Fasta has finally been deprecated.
> On that note, would it be useful to have a cookbook example or even a
> blog-post ready to go showing a few of the ways one might use subprocess to
> run commands defined with Biopython? I'm happy to put something together
> that others can evaluate.
The tutorial has several examples at the end of the chapter on
alignments (because lots of the wrappers at the moment are for
alignment tools). I've just updated the copy online to the current
version from CVS (dated 10 August 2009). If you can spot any
errors in the next couple of days we can get them fixed before
the release.
Peter
From biopython at maubp.freeserve.co.uk Wed Aug 12 12:54:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 12 Aug 2009 13:54:15 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
Message-ID: <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
On Thu, Jul 23, 2009 at 10:34 AM, Peter wrote:
> On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote:
>> I don't think there is much in it really. ?You have a factored
>> BinaryFile class, I have classes for the components of the SFF file.
>> Both are based around struct.
I have now written a third variant (loosely based on Jose's code).
This is just a single generator function (also based on struct).
Right now it is a slightly long function, but it can be refactored
easily enough. Is also a lot faster than Jose's code which is a
big plus point for large files. See:
http://github.com/peterjc/biopython/tree/sff
I haven't compared my new code against yours for speed yet
James, because your parser didn't like my large SFF file. You
have hard coded it to expect read names of length 14, and
400 flows per read. I have some data from Sanger where the
read names are length 14, but there are 800 flows per read.
Having the two reference parsers to look at was educational,
so thank you both (James and Jose) for sharing your code.
I now understand the SFF file format much better, and am now
confident I could design an indexer to provide dictionary like
access to it - a possible addition to Bio.SeqIO - see this thread:
http://lists.open-bio.org/pipermail/biopython/2009-June/005312.html
> Jose's code uses seek/tell which means it has to have a handle
> to an actual file. He also used binary read mode - I'm not sure if
> this was essential or not.
Binary more was not essential - opening an SFF file in default
mode also seemed to work fine with Jose's code.
> James' code seems to make a single pass though the file handle,
> without using seek/tell to jump about. I think this is nicer, as it is
> consistent with the other SeqIO parsers, and should work on
> more types of handles (e.g. from gzip, StringIO, or even a
> network connection).
I've also avoided using seek/tell in my rewrite.
> It looks like you (James) construct Seq objects using the full
> untrimmed sequence as is. I was undecided on if trimmed or
> untrimmed should be the default, but the idea of some kind of
> masked or trimmed Seq object had come up on the mailing list
> which might be useful here (and in contig alignments). i.e.
> something which acts like a Seq object giving the trimmed
> sequence, but which also contains the full sequence and trim
> positions.
I'm still thinking about this. One simplistic option (as used on
my branch) would be to have two input formats in Bio.SeqIO,
one untrimmed and one trimmed, e.g. "sff" and "sff-trim".
Peter
From winda002 at student.otago.ac.nz Thu Aug 13 00:32:55 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 13 Aug 2009 12:32:55 +1200
Subject: [Biopython-dev] Draft announcement for Biopython 1.51
In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
Message-ID: <4A835F37.5040907@student.otago.ac.nz>
Hi all, here is a draft announcement to go out when 1.51 is built and
ready to go. Comments and corrections are very welcome (should we keep
the deprecation paragraph in?)
I've also added a draft post to the OBF blog with this text marked up
with links and ready to go, hopefully that way whoever builds the
release can just ask someone with an account there (Brad and Peter at
least) to push post once everything is ready.
++
We are pleased to announce the release of Biopython 1.51.This new stable
release enhances version 1.50 (released in April) by extending the
functionality of existing modules, adding a set of application wrappers
for popular alignment programs and fixing a number of minor bugs.
In particular, the SeqIO module can now write genbank files that include
features and deal with FASTQ files created by Illumina 1.3+. Support for
this format allows interconversion between FASTQ files using Sloexa,
Sanger and Ilumina quality scores and has been validated against the the
BioPerl and EMBOSS implementations of this format.
Biopython 1.51 is the first stable release to include the
Align.Applications module which allows users to define command line
wrappers for popular alignment programs including ClustalW, Muscle and
T-Coffee.
??
This new release also spells the beginning of the end for some of
Biopython's older tools. Bio.Fasta and the application tools
ApplicationResult and generic_run() have been marked as deprecated which
means they can still be imported but doing who warn the user that these
functions will be removed in the future. Bio.Fasta has been superseded
by SeqIO's support for the Fasta format while we now suggest using the
subprocess module from the Python Standard Library to call applications
- use of this module is extensively documented in section 6.3 of the
Biopython Tutorial and Cookbook.
??
As always the Tutorial and Cookbook has been updated to document the
other changes made since the last release.
Thank you to everyone who tested our 1.51 beta or submitted bugs since
out last stable release and to all of our contributors
Sources and Windows Installer for the new release are available from the
downloads page.
++
From winda002 at student.otago.ac.nz Thu Aug 13 00:37:12 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 13 Aug 2009 12:37:12 +1200
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
Message-ID: <4A836038.1060609@student.otago.ac.nz>
>> On that note, would it be useful to have a cookbook example or even a
>> blog-post ready to go showing a few of the ways one might use subprocess to
>> run commands defined with Biopython? I'm happy to put something together
>> that others can evaluate.
>>
>
> The tutorial has several examples at the end of the chapter on
> alignments (because lots of the wrappers at the moment are for
> alignment tools). I've just updated the copy online to the current
> version from CVS (dated 10 August 2009). If you can spot any
> errors in the next couple of days we can get them fixed before
> the release.
>
> Peter
>
>
OK, I had only looked at the doc strings (my editor chokes on long
text files and I don't have anything to set Tex docs with) so didn't
know that existed. That looks really good (and the feeding output into
handles bit is pretty wizardly!)
Cheers,
David
From biopython at maubp.freeserve.co.uk Thu Aug 13 10:00:49 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 11:00:49 +0100
Subject: [Biopython-dev] Drafting announcement for Biopython 1.51?
In-Reply-To: <4A836038.1060609@student.otago.ac.nz>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
<4A836038.1060609@student.otago.ac.nz>
Message-ID: <320fb6e00908130300x3b4f1eb7m7711b76e0e03fd8a@mail.gmail.com>
On Thu, Aug 13, 2009 at 1:37 AM, David
Winter wrote:
>
> OK, I had only looked at the doc strings (my editor chokes on long
> text files and I don't have anything to set Tex docs with) so didn't
> know that existed.
TeX or LaTeX files are just plain text with some magic markup
e.g. \emph{text to emphasise}. Any decent text editor should
be able to load them, and some will even colour code things.
Even if you don't understand the markup, most of the time you
can actually read the raw files directly and understand them.
But yeah, the PDF or HTML output is what most people will
want to look at ;)
> That looks really good (and the feeding output into handles
> bit is pretty wizardly!)
Yeah - it is pretty cool. Sadly not all command line tools will
accept input via stdin, so this kind of thing isn't always
possible.
Peter
From biopython at maubp.freeserve.co.uk Thu Aug 13 10:10:44 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Aug 2009 11:10:44 +0100
Subject: [Biopython-dev] Draft announcement for Biopython 1.51
In-Reply-To: <4A835F37.5040907@student.otago.ac.nz>
References: <320fb6e00908110944m5323d781l54db0e1745076ed5@mail.gmail.com>
<4A825761.10106@student.otago.ac.nz>
<320fb6e00908120249l3d23a307te6e36dc620c99f9b@mail.gmail.com>
<4A835F37.5040907@student.otago.ac.nz>
Message-ID: <320fb6e00908130310n8efa09dv81963277e607da52@mail.gmail.com>
Thanks for the first draft David,
On Thu, Aug 13, 2009 at 1:32 AM, David
Winter