From biopython at maubp.freeserve.co.uk Thu Oct 1 05:04:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 1 Oct 2009 10:04:17 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> Message-ID: <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> On Wed, Sep 30, 2009 at 4:27 PM, Peter wrote: > > This has meant that generally the current status quo isn't > a problem (at least for me). However, what prompted me > to work on this issue was a real world example. > > We have a draft genome where after doing a basic > annotation, it would make sense to flip the strands. I > want to be able to load our current GenBank file, apply > the reverse complement, and have all the annotated > features recalculated to match. With more and more > sequencing projects, this isn't such an odd thing to > want to do. The github branch has SeqRecord reverse complement working pretty well (with plenty of tests covering the fuzzy locations), and a first attempt at SeqRecord addition too: http://github.com/peterjc/biopython/commits/seqrecords This lets me solve my motivating example like so: from Bio import SeqIO old_record = SeqIO.parse(open("pBAD30.gb"), "gb") handle = open("pBAD30_rc.gb", "w") SeqIO.write([old_record.reverse_complement(...)], handle, "gb") handle.close() If I wanted to shift the origin, this would be possible by combining SeqRecord slicing and addition: from Bio import SeqIO cut = 3765 old_record = SeqIO.parse(open("pBAD30.gb"), "gb") handle = open("pBAD30_rc.gb", "w") new_record = old_record[cut:] + old_record[:cut] SeqIO.write([new_record], handle, "gb") handle.close() And of course you can do both (which is probably what I will be using for the real task from work that this example is based on): from Bio import SeqIO cut = 3765 old_record = SeqIO.parse(open("pBAD30.gb"), "gb") handle = open("pBAD30_rc.gb", "w") new_record = (old_record[cut:] + old_record[:cut]).reverse_complement(...) SeqIO.write([new_record], handle, "gb") handle.close() The general scheme is nice and simple I think, but the trouble is in the details. For this particular example, it makes sense for all the annotation to preserved. For the reverse complement this is possible (although currently on my branch, this is not the default - hence the dot dot dot in the example above where right now this need to be requested explicitly). However, currently on SeqRecord slicing we take the cautious approach to the annotation, and the annotation dictionary and dbxrefs list are lost. On reflection, perhaps the more liberal straight forward approach is more useful: copy all the annotation (and leave it to the user to remove anything that becomes inappropriate). Then this code would "work": new_record = old_record[cut:] + old_record[:cut] Right now, based on the current slicing in the trunk, you have to copy these annotations manually: new_record = old_record[cut:] + old_record[:cut] new_record.annotations = old_record.annotations.copy() new_record.dbxrefs = old_record.dbxrefs[:] The question is which is preferable? The current slicing makes the user think about their annotation explicitly. The alternative is to blindly copy it, knowing that in some cases it will not be appropriate to the sub-record. Peter P.S. For those of you interested in multiple sequence alignments, once SeqRecord addition is dealt with, adding alignments becomes practical. i.e. taking two gene alignments for N species, and then concatenating them as discussed on Bug 2552: http://bugzilla.open-bio.org/show_bug.cgi?id=2552 From bugzilla-daemon at portal.open-bio.org Tue Oct 6 02:31:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 02:31:44 -0400 Subject: [Biopython-dev] [Bug 2924] New: memory leak in cnexus.c Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2924 Summary: memory leak in cnexus.c Product: Biopython Version: 1.52 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jheled at gmail.com There seem to be a serious leak in cnexus. The python documentation says, When memory buffers are passed as parameters to supply data to build objects, as for the `s' and `s#' formats, the required data is copied. Buffers provided by the caller are never referenced by the objects created by `Py_BuildValue()'. In other words, if your code invokes `malloc()' and passes the allocated memory to `Py_BuildValue()', your code is responsible for calling `free()' for that memory once `Py_BuildValue()' returns. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 6 04:18:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 04:18:45 -0400 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: Message-ID: <200910060818.n968Ij2i002732@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2924 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2009-10-06 04:18 EST ------- I think this really is a bug. The problem is in line 91: return Py_BuildValue("s",scanned_start); This should be something like: PyObject* result = Py_BuildValue("s",scanned_start); free(scanned); return result; Frank, if you agree, could you make this change? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 6 04:33:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 04:33:17 -0400 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: Message-ID: <200910060833.n968XHCf003024@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2924 ------- Comment #2 from jheled at gmail.com 2009-10-06 04:33 EST ------- Created an attachment (id=1370) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1370&action=view) fix the memory leak bug fix -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bioinformed at gmail.com Tue Oct 6 17:08:38 2009 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 6 Oct 2009 17:08:38 -0400 Subject: [Biopython-dev] population genetics, SNP data management and more In-Reply-To: <2e1434c10910061355q1098459crabf2a850a7bcaa1c@mail.gmail.com> References: <2e1434c10910061355q1098459crabf2a850a7bcaa1c@mail.gmail.com> Message-ID: <2e1434c10910061408j64576828m8a34579d628e443c@mail.gmail.com> Hi all, I'm the primary author of a suite of tools called GLU (Genotype Library & Utilities) that seems to have some features that may of of interest to BioPython developers. It is implemented in Python, uses NumPy, SciPy, PyTables (or h5py) and a few other common Python libraries, has the performance critical portions transcribed in C, and is available as open source under a BSD-like license. GLU implements a robust set of data management features for large SNP and general polymorphism data (human/mammalian for now, since we only support diploid and haploid genotypes). We regularly use it to manage datasets with 50 billion of SNP genotypes (>50k samples & > 1M SNPs). We define our own on-disk data representations in text, compressed text, and optimized binary formats, plus support PLINK and about a dozen other common formats. Our native binary storage is based on HDF5 and is quite robust and scalable. As a point of reference the Phase I-III of the International HapMap data is ~13 GB in their text format, 1.3 GB with gzip compression, and 472 MB in GLU's HDF5-based LBAT format. GLU includes modules that compute a range of descriptive statistics on genotype data quality, concordance, Mendelian consistency, relationship testing, consistency with Hardy-Weinberg proportions, and more. In addition, GLU includes modules to explore population structure, including estimation of admixture coefficients (like STRUCTURE, but with fixed source populations and frequencies) and principle components based on genetic correlations (like EIGENSTRAT and its ilk). GLU also allows supports high-throughput association testing between dichotomous, poloytomous, and continuous (Gaussian) variables and genetic effects (numerous models), covariates, and arbitrary interactions. Also supported is the rapid evaluation of pairwise linkage disequilibrium statistics and an advanced pairwise SNP tagging algorithm. There are many other features in GLU, though it is not yet feature complete and the documentation is currently a bit of a work in progress. Feel free to take a look at: http://code.google.com/p/glu-genetics >From the PopGen wiki, it seems that there is a desire to implement some of these features within BioPython. I'm happy to help, contribute code from GLU where applicable, or at minimum share some of my experiences. Best regards, -Kevin Jacobs From bugzilla-daemon at portal.open-bio.org Tue Oct 6 23:57:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 23:57:14 -0400 Subject: [Biopython-dev] [Bug 2925] New: false exception in Bio.PDB.NeighborSearch.search Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2925 Summary: false exception in Bio.PDB.NeighborSearch.search Product: Biopython Version: 1.52 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: 2lizhenhua at gmail.com Bio.PDB.NeighborSearch.search if there is no atom nearby(within a distance smaller than radius) and the level is not "A", an exception will be thrown. it would be better to return an empty list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 7 05:27:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Oct 2009 05:27:09 -0400 Subject: [Biopython-dev] [Bug 2925] false exception in Bio.PDB.NeighborSearch.search In-Reply-To: Message-ID: <200910070927.n979R9Aa003927@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2925 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-07 05:27 EST ------- Could you show us a short script that demonstrates this problem please? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Mon Oct 12 14:08:26 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 12 Oct 2009 20:08:26 +0200 Subject: [Biopython-dev] [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <8b34ec180909240551v649769f3keeae64f1ef31a633@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> <20090924122730.GL13500@sobchak.mgh.harvard.edu> <8b34ec180909240551v649769f3keeae64f1ef31a633@mail.gmail.com> Message-ID: <8b34ec180910121108l52ed4ec7l1461f6d3e1c00f62@mail.gmail.com> Hi all, On Thu, Sep 24, 2009 at 2:51 PM, Bartek Wilczynski wrote: > On Thu, Sep 24, 2009 at 2:27 PM, Brad Chapman wrote: >> >> A separate news post mentioning the C option speed and showing usage >> examples from both is a great idea. Responsiveness to new methods is >> the fun part of science. >> > I'll try to write that up and send it to the list. This took me, unfortunately more than I thought it would... The reasons are partially non-related (finishing a paper) and partially related to the matter. To put it short, my original plan was to include a wrapper for MOODS as a patch to biopython (if it is in the system -> use it) and include that information in this blog post. However, as I performed more tests of MOODS, I found out, that it might not be such a great idea. While the C module written for biopython by Michiel is working like a breeze, the MOODS package is a bit more moody... I needed to tweak the makefile to compile it on my mac, but it was working (most of the time) afterwards. Then I wanted to try on my linux box where it compiled with no problems, but it was giving me segfaults on my scripts which ran fine on a mac (it did run the simple examples though...). In addition to that, I found that the performance of MOODS was not always better than that of the brute force algorithm, which is already in Biopython. At the same time the maintainability of the Michiel's code is incomparable with the complex stuff they have. In conclusion, I don't think it is worth to put too much into the integration efforts now. I will try to contact the MOODS team about the issues I encountered and see whether they are interested in getting it integrated into biopython. If so, I can try to help with this, but it might be that we will just provide a function for getting a properly formatted log-odds matrix from Biopython motif for usage in MOODS. After all I think that not that many applications require the performance gains of MOODS over our current implemmentation. For the purpose of the blog post I've written a short script: http://github.com/barwil/biopython/blob/df0dfa8feeb15ce50d027d1492913f2d8920c9b3/Tests/Motif/moods_motif_benchmark.py which can be run assuming you have at least biopython 1.51+ installed and MOODS python bindings. It uses two different motifs showing possible behavior of Bio.Motif._pwm and MOODS. The output from my machine is as follows: reading the sequence took 0.603 seconds First motif: SRF MOODS calculation took 0.768 seconds on average Bio.Motif fast calculation took 2.407 seconds on average Second motif: Broad complex II MOODS calculation took 5.72 seconds on average Bio.Motif fast calculation took 2.687 seconds on average The averages are calculated from 10 runs, and they do not change substantially across different executions. I've made a biopython branch including this script and the additional function in Bio.Motif (for extracting log-odds in MOODS compatible format). I've also drafted a blog post, but i would greatly appreciate any help from people who are more skilled in writing. There's how it goes: """ In a recent article, Janne Korhonen et al. (Bioinformatics, 2009) introduce a new fast software library for finding motif occurences in DNA sequences. They also compare performance of their tool with currently available solutions from Bioperl and Biopython. Unfortunately, biopython is the only tool in the comparison whose performance is measured based on a solution written in an interpreted language, while both MOODS and bioperl are written in compiled languages (C++ and C, respectively). This, not surprisingly, shows biopython as by far the slowest of the three. Since the authors made their comparisons, however, we have moved on, and thanks to the C code contributed by Michiel de Hoon and included in the 1.51 release Biopython's motif finding library improved greatly and is performing comparably to the MOODS package. The results of a quick benchmark script (http://github.com/barwil/biopython/blob/df0dfa8feeb15ce50d027d1492913f2d8920c9b3/Tests/Motif/moods_motif_benchmark.py) indicate that a simple algorithm implemented in C is able to scan a whole chromosome (>23Mb) in less than 3s for a typical DNA motif. Depending on a motif, the advanced linear algorithm from MOODS package can decrease (or in some cases even increase) this running time by a few seconds. """ It sounds quite dull to me, so I would greatly appreciate ideas on improving the text and making it less formal and boring... Cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Oct 13 02:58:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Oct 2009 02:58:37 -0400 Subject: [Biopython-dev] [Bug 2927] New: Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2927 Summary: Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser Product: Biopython Version: 1.52 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: blocker Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: ibdeno at gmail.com This is a problem with NCBIStandalone.PSIBlastParser, which I need to use instead of NCBIXML since the latter one lacks some record properties that I need. My code used to work until recently (say three months) and now it seems something has changed in the latest biopython (1.52-1, I install it on an intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether I use python 2.5 or 2.6 and also the same for blastpgp 2.2.18 and 2.2.22 Here follows the relevant part of the code: #### blast_out, error_info = NCBIStandalone.blastpgp( blastcmd='/usr/local/blast-2.2.18/bin/blastpgp', database='/opt/BlastDBs/' + db, infile=file, npasses=passes, program='blastpgp', descriptions='500', alignments='1000', align_view='0', matrix_outfile=outbase + '.' + db + '.' + str(passes) + '.pssm') b_parser = NCBIStandalone.PSIBlastParser() b_record = b_parser.parse(blast_out) #### And this is the error that I now get: #### File "/Users/mol/bin/lpbl.py", line 64, in doblast b_record = b_parser.parse(blast_out) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 777, in parse self._scanner.feed(handle, self._consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 97, in feed self._scan_rounds(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 234, in _scan_rounds self._scan_alignments(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 376, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 386, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 398, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 433, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 464, in _scan_hsp_alignment read_and_call(uhandle, consumer.query, start='Query') File "/sw/lib/python2.6/site-packages/Bio/ParserSupport.py", line 303, in read_and_call method(line) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 1138, in query raise ValueError("I could not find the query in line\n%s" % line) ValueError: I could not find the query in line Query: 0 - #### Now, the interesting thing is that if I run blastpgp directly and catch the output to a file, this file never includes such a line as: Query: 0 - Actually, if I modify my code so it reads this output file, the PSIBlastParser processes it without error. Not sure if this is relevant, but I have found that something may have changed in NCBIStandalone recently, namely, this bit: _query_re = re.compile(r"Query(:?) \s*(\d+)\s*(.+) (\d+)") def query(self, line): m = self._query_re.search(line) if m is None: raise ValueError("I could not find the query in line\n%s" % line) I will post log files in plain text and xml after submitting this bug report. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 13 05:20:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Oct 2009 05:20:29 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910130920.n9D9KTVH018797@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-13 05:20 EST ------- Miguel has sent me his example text and XML output files directly by email (Bugzilla said they were too big). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 13 06:59:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Oct 2009 06:59:11 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910131059.n9DAxBR4022148@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-13 06:59 EST ------- I have tried parsing your sample output, and it seems fine: from Bio.Blast.NCBIStandalone import PSIBlastParser b_parser = PSIBlastParser() handle = open("Q3V4Q0.psiblast.txt") b_record = b_parser.parse(handle) handle.close() for b_round in b_record.rounds : print "Round %i has %i alignments" \ % (b_round.number, len(b_round.alignments)) Gives: Round 1 has 385 alignments Round 2 has 1000 alignments Round 3 has 1000 alignments Round 4 has 1000 alignments Round 5 has 1000 alignments I don't think the real problem is in the parser, but I will reply with more details on the mailing list: http://lists.open-bio.org/pipermail/biopython/2009-October/005660.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 14 08:41:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 14 Oct 2009 08:41:23 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910141241.n9ECfNBP029019@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-14 08:41 EST ------- As discussed on the mailing list, something about how blastpgp was being called via subprocess could lead to strange unparseable output. There may be some subtle issue with Bio.Blast.NCBIStandalone.blastpgp here, but in the long term that function will be phased out anyway. Miguel is now using Bio.Blast.Applications and subprocess to get BLAST to record its output directly to a file, and the problem has gone away. I'm closing this bug by marking it as "WORKSFORME" rather than "FIXED". -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 15 11:17:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Oct 2009 11:17:29 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910151517.n9FFHTGn004094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-15 11:17 EST ------- After further discussion on the mailing list, it is still not clear what triggers these "Query: 0" lines, but it affects multiple versions of blastpgp and can apparently be seen at the command line (which means it isn't Biopython's fault in any way). We should update Bio.Blast.NCBStandalone.PSIBlastParser to cope (probably by ignoring the "Query: 0" lines). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 15 12:19:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Oct 2009 12:19:38 -0400 Subject: [Biopython-dev] [Bug 2929] New: NCBIXML PSI-Blast parser should gather all information from XML blastgpg output Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2929 Summary: NCBIXML PSI-Blast parser should gather all information from XML blastgpg output Product: Biopython Version: 1.52 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: ibdeno at gmail.com With the problems encountered while parsing plain text output from blastpgp, perhaps an answer would be to use the XML output of this program. The XML output seems to have evolved in recent versions of blastpgp and now all the info gets in a single proper XML file (not several concatenated files) and, in principle, it would seem that all the information in the plain text format can also be found in the XML one. I will attach an XML output for a PSI-Blast search that converges after 3 passes. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 15 12:20:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Oct 2009 12:20:33 -0400 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200910151620.n9FGKX1j006052@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #1 from ibdeno at gmail.com 2009-10-15 12:20 EST ------- Created an attachment (id=1374) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1374&action=view) XML output from a converged run of blastpgp -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Fri Oct 16 05:02:33 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 16 Oct 2009 11:02:33 +0200 Subject: [Biopython-dev] [Biopython] Adaptor trimmer and dimers In-Reply-To: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> References: <355533.31188.qm@web52001.mail.re2.yahoo.com> <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> Message-ID: <200910161102.33821.jblanca@btc.upv.es> We also have some code to do that using exonerate. Take a look at the function create_vector_striper_by_alignment in http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/seq_cleaner.py Jose Blanca On Thursday 15 October 2009 18:20:47 Peter wrote: > On Thu, Oct 15, 2009 at 5:00 PM, natassa wrote: > > Hallo Biopythoners, > > I followed a recent thread conversation about adaptor trimming, > > which I intend to do on Illumina runs, and I am not sure I know > > where exactly in github I could find Brad Chapman's code for > > trimming AFTER modifications that he has done based on the > > thread conversation. ... > > I guess you mean Brad's August Blog Post: > http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-seq >uences/ and the following mailing list thread which included some tips on > speeding up the Biopython side of things: > http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html > > For anyone else interested, there are some simple examples in the > tutorial (using SeqRecord slicing - elegant and simple, but a bit slow): > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off >-primer > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off >-adaptor > > And I did a blog post about low level FASTQ handling for speed > at the cost of flexibility and simplicity (using some of the same > ideas from the August mailing list discussion): > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From chris.lasher at gmail.com Sun Oct 18 01:22:29 2009 From: chris.lasher at gmail.com (Chris Lasher) Date: Sun, 18 Oct 2009 01:22:29 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython Message-ID: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> I have a need to work with the gene ontology (GO) and gene ontology annotations (GOAs) for my research. It seems Biopython still lacks GO support despite a few threads from several years ago. I'd like to make GO support in Biopython a reality now. I would really appreciate any help and suggestions. Bioperl has solid GO support. I don't find their code straightforward at all; I haven't picked out what component is responsible for what task. Nonetheless, it could provide starting points to build support for Biopython. Beyond looking through Bioperl code, though, I have several questions and I really welcome suggestions: 1) First off, does anyone have any gene ontology Python code laying around? 2) What is the Biopython stance on introducing third-party dependencies? The gene ontology is represented a directed acyclic graph (DAG) and I want to use an existing graph library rather than roll our own. What would be the aversion to requiring either NetworkX or igraph as a dependency for the GO library. (I have experience with NetworkX and would prefer it, though I imagine igraph would be very similar for nearly all the methods we'd need access to to construct the DAG) 3) What are parsers written using these days? I checked the tutorial section on them (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but this wasn't explicitly covered. Any pointers to recently written parsers? I seem to recall Biopython has moved away from Martel parsers, correct? Has anything been done with pyparsing or some other parser, or is it strictly manual now? Also, I'm welcoming tips on the architecture of parsers in general. 4) Tying the GO Annotations to a fundamental Biopython data structure. This can't really be a SeqRecord object. SeqRecord.annotations makes sense, however, I can't guarantee a SeqRecord object will exist because the annotations don't come with the sequence itself. (A sequence is required to instantiate a SeqRecord object). Any suggestions on this? 5) BioSQL support. Not having used BioSQL in the past, I'm a bit wary of adding this feature, but it is implemented in Bioperl. I haven't yet figured out if it's used as the default data store for their parsers or if it is only an optional store. Comments most welcome. Best, Chris From mjldehoon at yahoo.com Sun Oct 18 04:05:10 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 18 Oct 2009 01:05:10 -0700 (PDT) Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <426498.6116.qm@web62403.mail.re1.yahoo.com> --- On Sun, 10/18/09, Chris Lasher wrote: > I'd like to make GO support in Biopython a reality now. That would be nice. > Bioperl has solid GO support. I don't find their code > straightforward at all; I haven't picked out what component is > responsible for what task. To arrive at a good design of a Biopython module, sometimes it helps to write its documentation first, before writing the actual code. > 2) What is the Biopython stance on introducing third-party > dependencies? I think we should avoid them as much as possible. In addition to the additional hassle for users and developers, unforeseen changes in third-party dependencies may break your module. > What would be the aversion to requiring either NetworkX or igraph > as a dependency for the GO library. Are these Python modules or C software? Do NetworkX or igraph have their own third-party dependencies? Do we need the full NetworkX or igraph or just a part of it? In the latter case, assuming that these are open-source software packages, we may simply include the parts we need into Biopython. Also, how far do you get by using NumPy? > 3) What are parsers written using these days? Current parsers typically work as follows, assuming that a data file contains exactly one record: >>> handle = open("mydatafile") >>> from Bio import SomeModule >>> record = SomeModule.read(handle) # record is now a SomeModule.Record object If one data file typically contains multiple records, use a "parse" function to return an iterator: >>> handle = open("mydatafile") >>> from Bio import SomeModule >>> records = SomeModule.parse(handle) >>> for record in records: ... # record is now a SomeModule.Record object > Any pointers to recently written parsers? Bio.SeqIO.read and parse are good examples. Also you can look at Bio.Medline for a simple parser using this approach. > I seem to recall Biopython has moved away from Martel > parsers, correct? Yes. > Has anything been done with pyparsing or some other > parser, or is it strictly manual now? Not as far as I know. > Also, I'm welcoming tips on the > architecture of parsers in general. See above. Also note that few parsers nowadays use Bio.ParserSupport. This was previously used to implement parsers in Biopython (with parsers, scanners, and consumers). I would avoid Bio.ParserSupport and simply write a straightforward parser using the Python standard library. > 4) Tying the GO Annotations to a fundamental Biopython data > structure. > Any suggestions on this? A SeqRecord doesn't seem to be appropriate for gene ontology. How about a Record class specifically for GO? Also, what should such a class contain? Best, --Michiel. From biopython at maubp.freeserve.co.uk Sun Oct 18 06:34:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Oct 2009 11:34:21 +0100 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <320fb6e00910180334ke404ea3gad7e466e5d76c072@mail.gmail.com> On Sun, Oct 18, 2009 at 6:22 AM, Chris Lasher wrote: > I have a need to work with the gene ontology (GO) and gene ontology > annotations (GOAs) for my research. It seems Biopython still lacks GO > support despite a few threads from several years ago. I'd like to make > GO support in Biopython a reality now. I would really appreciate any > help and suggestions. In terms of missing functionality, it would help me greatly if you could describe the kind of things you want to achieve (and therefore how it may or may not need to connect to existing code like the SeqRecord and SeqFeature objects). > Bioperl has solid GO support. I don't find their code straightforward > at all; I haven't picked out what component is responsible for what > task. Nonetheless, it could provide starting points to build support > for Biopython. Yeah - I think Hilmar commented on some of these threads. Doing ontologies properly is hard work. > Beyond looking through Bioperl code, though, I have several questions > and I really welcome suggestions: > > 1) First off, does anyone have any gene ontology Python code > laying around? Note quite what you wanted, but Ed Cannon has an OBO to OWL parser in his github repository, http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006701.html > 2) What is the Biopython stance on introducing third-party > dependencies? The gene ontology is represented a directed acyclic > graph (DAG) and I want to use an existing graph library rather than > roll our own. What would be the aversion to requiring either NetworkX > or igraph as a dependency for the GO library. (I have experience with > NetworkX and would prefer it, though I imagine igraph would be very > similar for nearly all the methods we'd need access to to construct > the DAG) As Micheil said, we prefer to avoid 3rd party dependencies *especially* build time ones. Wrappers for 3rd party command line tools are fine. Currently we do have a number of optional python dependencies for specific functionality - e.g. ReportLab for graphics, and assorted SQL database backends. The python library NetworkX may fall into this category. Adding another dependency should not be done lightly. > 3) What are parsers written using these days? I checked the tutorial > section on them > (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but > this wasn't explicitly covered. Any pointers to recently written > parsers? I seem to recall Biopython has moved away from Martel > parsers, correct? Has anything been done with pyparsing or some other > parser, or is it strictly manual now? Also, I'm welcoming tips on the > architecture of parsers in general. Martel is gone. Everything is done in plain python these days. The coding styles vary - some are scanner/consumer, but using iterators for large files (returning natural chunks of data in steps) is normal. For things like XML, there are (several) parsers in the python standard libraries. > 4) Tying the GO Annotations to a fundamental Biopython data structure. > This can't really be a SeqRecord object. SeqRecord.annotations makes > sense, however, I can't guarantee a SeqRecord object will exist > because the annotations don't come with the sequence itself. (A > sequence is required to instantiate a SeqRecord object). Any > suggestions on this? Background to the task would help. Note you can create a SeqRecord without a sequence, but it may not be sensible. See for example the QUAL file parser which uses the new UnknownSeq object where we just know the sequence length. > 5) BioSQL support. Not having used BioSQL in the past, I'm a bit wary > of adding this feature, but it is implemented in Bioperl. I haven't > yet figured out if it's used as the default data store for their > parsers or if it is only an optional store. I would describe BioSQL as an optional data store, particularly suited to holding GenBank or EMBL files. Biopython has BioSQL support (as do BioJava etc). We follow BioPerl and use a loose ad-hoc ontology, but the BioSQL schema is designed to allow proper ontologies. This is something I have raised on the BioSQL mailing list. Related to this, EMBOSS have done a lot of work mapping between the ontologies used in GenBank, EMBL, UniProt and the standard sequence ontology - something I'm hoping we may be able to re-use in our planned support for GFF3 files. Peter From chapmanb at 50mail.com Sun Oct 18 12:34:36 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 18 Oct 2009 12:34:36 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <20091018163436.GA66322@kunkel> Hi Chris; > I'd like to make GO support in Biopython a reality now. Awesome. Great to have you working on this. > Bioperl has solid GO support. I don't find their code straightforward > at all; I haven't picked out what component is responsible for what > task. Nonetheless, it could provide starting points to build support > for Biopython. Agreed. I worked on a tiny bit of Gene Ontology stuff a while ago and Chris Mungall was very helpful in explaining some of the high level decisions. > 1) First off, does anyone have any gene ontology Python code laying around? I have a couple of things here: http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/GO/ CVS says they haven't been touched in 7 years. Feel free to use it if it's helpful. I took the approach of working directly off an installed database as opposed to flat files. > 2) What is the Biopython stance on introducing third-party > dependencies? I think Michiel and Peter tackled this, but generally the approach has been to keep Biopython as a base library that doesn't require a lot of installs to get going. As far as graph libraries go, networkx is good and Eric did some work with it for the PhyloXML library this summer. Thanks again for taking this on, Brad From chapmanb at 50mail.com Sun Oct 18 12:34:36 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 18 Oct 2009 12:34:36 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <20091018163436.GA66322@kunkel> Hi Chris; > I'd like to make GO support in Biopython a reality now. Awesome. Great to have you working on this. > Bioperl has solid GO support. I don't find their code straightforward > at all; I haven't picked out what component is responsible for what > task. Nonetheless, it could provide starting points to build support > for Biopython. Agreed. I worked on a tiny bit of Gene Ontology stuff a while ago and Chris Mungall was very helpful in explaining some of the high level decisions. > 1) First off, does anyone have any gene ontology Python code laying around? I have a couple of things here: http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/GO/ CVS says they haven't been touched in 7 years. Feel free to use it if it's helpful. I took the approach of working directly off an installed database as opposed to flat files. > 2) What is the Biopython stance on introducing third-party > dependencies? I think Michiel and Peter tackled this, but generally the approach has been to keep Biopython as a base library that doesn't require a lot of installs to get going. As far as graph libraries go, networkx is good and Eric did some work with it for the PhyloXML library this summer. Thanks again for taking this on, Brad From chris.lasher at gmail.com Mon Oct 19 00:26:48 2009 From: chris.lasher at gmail.com (Chris Lasher) Date: Mon, 19 Oct 2009 00:26:48 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <20091018163436.GA66322@kunkel> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <20091018163436.GA66322@kunkel> Message-ID: <128a885f0910182126i74f7712bo2cb6e7d612532e5@mail.gmail.com> On Sun, Oct 18, 2009 at 12:34 PM, Brad Chapman wrote: > > Hi Chris; > > > I'd like to make GO support in Biopython a reality now. > > Awesome. Great to have you working on this. > > > Bioperl has solid GO support. I don't find their code straightforward > > at all; I haven't picked out what component is responsible for what > > task. Nonetheless, it could provide starting points to build support > > for Biopython. > > Agreed. I worked on a tiny bit of Gene Ontology stuff a while ago > and Chris Mungall was very helpful in explaining some of the high > level decisions. > > > 1) First off, does anyone have any gene ontology Python code laying around? > > I have a couple of things here: > > http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/GO/ > > CVS says they haven't been touched in 7 years. Feel free to use it > if it's helpful. I took the approach of working directly off an > installed database as opposed to flat files. > > > 2) What is the Biopython stance on introducing third-party > > dependencies? > > I think Michiel and Peter tackled this, but generally the approach > has been to keep Biopython as a base library that doesn't require a > lot of installs to get going. > > As far as graph libraries go, networkx is good and Eric did some > work with it for the PhyloXML library this summer. > > Thanks again for taking this on, > Brad Right, well, first off, thanks for your input so far, guys. I don't have time tonight to reply to individual points but I went ahead and started a wiki page to coordinate this. http://biopython.org/wiki/Gene_Ontology It's a wiki, so you know what to do if you have an idea or a question. I'm going to go ahead and make the executive decision to use NetworkX. I think BioPerl's Ontology framework has both third-party dependency-based (Graph.pm) and non-dependency-based solutions. Maybe we can figure out something similar, but NetworkX is such an easy dependency to satisfy that I'm going with it. Looks like this is going to be a busy week. Chris From dalloliogm at gmail.com Mon Oct 19 04:32:31 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 19 Oct 2009 10:32:31 +0200 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher wrote: > 2) What is the Biopython stance on introducing third-party > dependencies? The gene ontology is represented a directed acyclic > graph (DAG) and I want to use an existing graph library rather than > roll our own. What would be the aversion to requiring either NetworkX > or igraph as a dependency for the GO library. (I have experience with > NetworkX and would prefer it, though I imagine igraph would be very > similar for nearly all the methods we'd need access to to construct > the DAG) > introducing networkx as a dependency would also open the road to modules to work with pathways and networkx with biopython. For example, I have a partially complete script to parse Kegg's KGML files for pathway and put them into a networkx object. The problem is that biopython is a monolitic packages - you have to install it all or nothing. Maybe in a future (it is just a tought) it would be better to have it as a repository of packages, like BioConductor. > 3) What are parsers written using these days? I checked the tutorial > section on them > (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but > this wasn't explicitly covered. Any pointers to recently written > parsers? I seem to recall Biopython has moved away from Martel > parsers, correct? Has anything been done with pyparsing or some other > parser, or is it strictly manual now? Also, I'm welcoming tips on the > architecture of parsers in general. > > 4) Tying the GO Annotations to a fundamental Biopython data structure. > This can't really be a SeqRecord object. SeqRecord.annotations makes > sense, however, I can't guarantee a SeqRecord object will exist > because the annotations don't come with the sequence itself. (A > sequence is required to instantiate a SeqRecord object). Any > suggestions on this? > What about using zope component and zope interface? It is an alternative approach to object programming, based on the experience that zope developers matured with Zope 2, which was a mess of similar classes and objects that became too difficult to maintain. - http://wiki.zope.org/zope3/ComponentArchitectureApproach Comments most welcome. > > Best, > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Oct 19 04:55:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 09:55:59 +0100 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> Message-ID: <320fb6e00910190155j33394e43v2577427c6375a077@mail.gmail.com> On Mon, Oct 19, 2009 at 9:32 AM, Giovanni Marco Dall'Olio wrote: > On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher wrote: > >> 2) What is the Biopython stance on introducing third-party >> dependencies? The gene ontology is represented a directed acyclic >> graph (DAG) and I want to use an existing graph library rather than >> roll our own. What would be the aversion to requiring either NetworkX >> or igraph as a dependency for the GO library. (I have experience with >> NetworkX and would prefer it, though I imagine igraph would be very >> similar for nearly all the methods we'd need access to to construct >> the DAG) > > introducing networkx as a dependency would also open the road to > modules to work with pathways and networkx with biopython. > For example, I have a partially complete script to parse Kegg's KGML > files for pathway and put them into a networkx object. I've not used NetworkX personally, but it looks cool. The only network analysis I've done in Python used NumPy for adjacency matrices, and GraphViz via pydot for graphical output. > The problem is that biopython is a monolitic packages - you have > to install it all or nothing. And why is that a problem? This is a serious question. NumPy is a build time dependency (due to the C code), but pure python dependencies like MySQLdb (or potentially NetworkX) can be installed after Biopython, if and when then are needed. > Maybe in a future (it is just a tought) it would be better to > have it as a repository of packages, like BioConductor. If/when PyPI becomes the standard way to deal with Python packages and interdependencies, then that might be workable. But without some system like that in place, you'll only make installation harder. Out of interest, have you ever tried installing BioPerl (via CPAN)? There is a lot to be said for a single simple to install package as now. Peter From dalloliogm at gmail.com Mon Oct 19 05:46:43 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 19 Oct 2009 11:46:43 +0200 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <320fb6e00910190155j33394e43v2577427c6375a077@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> <320fb6e00910190155j33394e43v2577427c6375a077@mail.gmail.com> Message-ID: <5aa3b3570910190246u7a175626xe9e97781dee460a3@mail.gmail.com> On Mon, Oct 19, 2009 at 10:55 AM, Peter wrote: > On Mon, Oct 19, 2009 at 9:32 AM, Giovanni Marco Dall'Olio > wrote: > > On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher >wrote: > > > >> 2) What is the Biopython stance on introducing third-party > >> dependencies? > > > > > The problem is that biopython is a monolitic packages - you have > > to install it all or nothing. > > And why is that a problem? This is a serious question. NumPy > is a build time dependency (due to the C code), but pure python > dependencies like MySQLdb (or potentially NetworkX) can be > installed after Biopython, if and when then are needed. > I didn't really mean to say it is a problem, I think it has some disadvantages and some advantages, as everything :-) Biopython now is easier to install, because people can just download a package or a module or use easy_install. and some common guidelines, like how to write documentation, the SeqRecord system, make it easier to maintain; but on the other hand, when you propose a new module you have to pay attention to not adding new dependencies, which is something that the bioConductor's developers don't have to care of. Anyway, it is true that without a good system to download and install packages automatically, BioConductor and CRAN would have been different. > > Maybe in a future (it is just a tought) it would be better to > > have it as a repository of packages, like BioConductor. > > If/when PyPI becomes the standard way to deal with Python packages > and interdependencies, then that might be workable. But without some > system like that in place, you'll only make installation harder. I agree with that. By the way, I have heard that there is a lot of discussion in python-dev mailing list about a package called 'distribute' ( http://packages.python.org/distribute/setuptools.html) which in the future may replace setuptools while remaining compatible with it. In fact, setuptools and easy_install have not been updated for a long time now, let's see if with this something will improve soon.... > Out of > interest, have you ever tried installing BioPerl (via CPAN)? There is > a lot to be said for a single simple to install package as now. > No, I didn't.... but I am scared of perl in general :-) > Peter > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From kellrott at gmail.com Mon Oct 19 13:18:03 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 19 Oct 2009 10:18:03 -0700 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) Message-ID: Pfam24 was published last week ( http://pfam.sanger.ac.uk/ ) , it utilizes HMMER3 to do some rather fast HMM based protein identification (of about 11,912 families). I've gotten an initial port of the PfamScan perl script found at ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ ported to BioPython. Currently the layout somewhat mirrors the Perl module layout, but that can be evolved to be more 'pythonesque'. The interface is not yet done (it mainly works just to print out results, internal data structures aren't very clear). Thoughts and suggestions on how people would use this in their Python Scripts would be helpful. And in regards to the current GO conversation that is going on, there is a table the connects Pfam families to GO terms ( ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam24.0/database_files/gene_ontology.sql.gz ), so connecting this work to the suggested GO modules would probably be beneficial. You can find the work at http://github.com/kellrott/biopython/, under the Bio.Pfam module. Kyle From biopython at maubp.freeserve.co.uk Mon Oct 19 13:29:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 18:29:26 +0100 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) In-Reply-To: References: Message-ID: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> On Mon, Oct 19, 2009 at 6:18 PM, Kyle Ellrott wrote: > Pfam24 was published last week ( http://pfam.sanger.ac.uk/ ) , it > utilizes HMMER3 to do some rather fast HMM based protein > identification (of about 11,912 families). ?I've gotten an initial > port of the PfamScan perl script found at > ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ ported to BioPython. Perhaps I have misunderstood you (and I have not looked at the code yet), but have you just re-written the PFAM perl script pfam_scan.pl in python? Is so, what is the aim? OK, it might be a bit faster - but you would be duplicating the work of the PFAM team and creating a long term maintenance burden. I can see the value of having an HMMER3 output parser, and a command line wrapper for calling it. This will be useful for things outside of PFAM. I can see the value of having a pfam_scan.pl output parser (XML, CVS, or the possible JSON), and a command line wrapper for calling it. Peter From kellrott at gmail.com Mon Oct 19 13:46:20 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 19 Oct 2009 10:46:20 -0700 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) In-Reply-To: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> References: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> Message-ID: I've started as a close re-write the the original PfamScan script to make sure the python script works equivalently to the original. Now that it works (for basic tests), I will begin by putting better data interfaces. The Bio.Pfam.HMM module should as a HMMER3 module work by itself. But it needs some examples, and probably some work on making the interface more clean. We could also move the code to Bio.HMMER, rather than having it as a sub modules of Bio.Pfam. This was primarily motivated by the dependency hell associated with trying to get pfam_scan.pl to work on a cluster. pfam_scan.pl relies on BioPerl and Moose. From the readme: 'Moose itself has quite a few dependencies, so don't worry if it looks like you're installing half of CPAN !'. The code I've produced works within the BioPython framework with no additional dependencies. pfam_scan.pl just does format parsing and table linking. The heavy work is done in HMMER. The dependency cost of pfam_scan.pl is just to great consider it's functionality can be easily replicated in BioPython. > Perhaps I have misunderstood you (and I have not looked at > the code yet), but have you just re-written the PFAM perl script > pfam_scan.pl in python? Is so, what is the aim? OK, it might be > a bit faster - but you would be duplicating the work of the PFAM > team and creating a long term maintenance burden. > > I can see the value of having an HMMER3 output parser, and > a command line wrapper for calling it. This will be useful for > things outside of PFAM. > > I can see the value of having a pfam_scan.pl output parser (XML, > CVS, or the possible JSON), and a command line wrapper for > calling it. > > Peter > From biopython at maubp.freeserve.co.uk Mon Oct 19 15:02:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 20:02:56 +0100 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) In-Reply-To: References: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> Message-ID: <320fb6e00910191202g64a74c21tf77da909a0356eb6@mail.gmail.com> On Mon, Oct 19, 2009 at 6:46 PM, Kyle Ellrott wrote: > I've started as a close re-write the the original PfamScan script to > make sure the python script works equivalently to the original. ?Now > that it works (for basic tests), I will begin by putting better data > interfaces. ?The Bio.Pfam.HMM module should as a HMMER3 module work by > itself. ?But it needs some examples, and probably some work on making > the interface more clean. ?We could also move the code to Bio.HMMER, > rather than having it as a sub modules of Bio.Pfam. That sounds good. Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 20 06:45:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 20 Oct 2009 06:45:34 -0400 Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails with blastall 2.2.14 In-Reply-To: Message-ID: <200910201045.n9KAjY4e030244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-20 06:45 EST ------- Fixed in github, tested on the two examples here and also output from BLAST 2.2.20 and 2.2.21 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Oct 20 06:47:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 11:47:29 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated Message-ID: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> Hi all, Just to let you know I've been doing a little work on the NCBI plain text parser, and got it to work on multiquery output from recent versions of BLAST (Bug 2090). http://bugzilla.open-bio.org/show_bug.cgi?id=2090 I would not describe the changes as elegant, but the plain text parser has evolved over time to cope with more and more NCBI variations, so some ugliness is perhaps to be expected. If there are any regressions, please report them and we can extend the test suite. Likewise, if you have an recent plain text BLAST files which didn't work and still don't work - get in touch, it may be easy to fix. [I'd still encourage everyone to use the XML output by default, but there are times when the plain text is the only or best option.] Peter From peter at maubp.freeserve.co.uk Tue Oct 20 07:56:51 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 12:56:51 +0100 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.22 now available In-Reply-To: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> References: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> Message-ID: <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> Hi all, The new NCBI BLAST tools are out now, and I'd only just updated my desktop to BLAST 2.2.21 this morning! It looks like the "old style" blastall etc (which are written in C) are much the same, but we will need to add Bio.Blast.Applications wrappers for the new "BLAST+" tools (written in C++). On the bright side, the Biopython tutorial needed updating anyway to switch from Bio.Blast.NCBIStandalone.blastall(...) to using Bio.Blast.Applications and subprocess. Peter ---------- Forwarded message ---------- From: mcginnis Date: Tue, Oct 20, 2009 at 12:42 PM Subject: [blast-announce] BLAST 2.2.22 now available To: blast-announce at ncbi.nlm.nih.gov BLAST 2.2.22 now available This?release includes?new BLAST+ command-line applications. The BLAST+ applications have a number of advantages over the older applications and users are encouraged to migrate to the new applications.? The new applications can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST? These applications have been built with the NCBI C++ toolkit. Changes from the last release are listed below. The older C toolkit applications (e.g., blastall) are still available at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.22/ Changes from the last release are listed below. Please send questions or comments to blast-help at ncbi.nlm.nih.gov Changes for the BLAST+ applications: * Added entrez_query command line option for restricting remote BLAST databases. * Added support for psi-tblastn to the tblastn command line application via ? the -in_pssm option. * Improved documentation for subject masking feature in user manual. * User interface improvements to windowmasker. * Made the specification of BLAST databases to resolve GIs/accessions ? configurable. * update_blastdb.pl downloads and checks BLAST database MD5 checksum files. * Allow long words with blastp. * Added support for overriding megablast index when importing search strategy ? files. * Added support for best-hit algorithm parameters in strategy files. * Bug fixes in blastx and tblastn with genomic sequences, subject masking, ? blastdbcheck, and the SEG filtering algorithm. Changes for C applications: * Blastall was not able to use BLAST databases with only accessions to format results, this has been fixed. From biopython at maubp.freeserve.co.uk Tue Oct 20 09:24:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 14:24:09 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated In-Reply-To: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> References: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> Message-ID: <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> On Tue, Oct 20, 2009 at 11:47 AM, Peter wrote: > Hi all, > > Just to let you know I've been doing a little work on the NCBI plain > text parser, and got it to work on multiquery output from recent > versions of BLAST (Bug 2090). > > http://bugzilla.open-bio.org/show_bug.cgi?id=2090 > > I would not describe the changes as elegant, but the plain text parser > has evolved over time to cope with more and more NCBI variations, so > some ugliness is perhaps to be expected. > > If there are any regressions, please report them and we can extend the > test suite. Likewise, if you have an recent plain text BLAST files > which didn't work and still don't work - get in touch, it may be easy > to fix. > > [I'd still encourage everyone to use the XML output by default, but > there are times when the plain text is the only or best option.] > > Peter The irony of this isn't lost on me. An hour after fixing the parser and testing it with BLAST 2.2.21 output, the NCBI made a release which broke it again. The output from the latest "classic" blastall 2.2.22 is fine. The output from the "new C++" blastx 2.2.22+ (and likely the other tools like blastp etc which are all separate executables now) breaks our plain text parser. I'll also be trying out the XML output which *should* be fine. Peter From biopython at maubp.freeserve.co.uk Tue Oct 20 11:08:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 16:08:55 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated In-Reply-To: <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> References: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> Message-ID: <320fb6e00910200808m3fa1a56dp2578d387d318cc5a@mail.gmail.com> On Tue, Oct 20, 2009 at 2:24 PM, Peter wrote: > On Tue, Oct 20, 2009 at 11:47 AM, Peter wrote: >> Hi all, >> >> Just to let you know I've been doing a little work on the NCBI plain >> text parser, and got it to work on multiquery output from recent >> versions of BLAST (Bug 2090). >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 >> >> I would not describe the changes as elegant, but the plain text parser >> has evolved over time to cope with more and more NCBI variations, so >> some ugliness is perhaps to be expected. >> >> If there are any regressions, please report them and we can extend the >> test suite. Likewise, if you have an recent plain text BLAST files >> which didn't work and still don't work - get in touch, it may be easy >> to fix. >> >> [I'd still encourage everyone to use the XML output by default, but >> there are times when the plain text is the only or best option.] >> >> Peter > > The irony of this isn't lost on me. An hour after fixing the parser > and testing it with BLAST 2.2.21 output, the NCBI made a release > which broke it again. > > The output from the latest "classic" blastall 2.2.22 is fine. > > The output from the "new C++" blastx 2.2.22+ (and likely the > other tools like blastp etc which are all separate executables > now) breaks our plain text parser. Touch wood, that is now working with the latest code in the public repository. It really needs a few more example files covering more than just the new blastx output... Peter From biopython at maubp.freeserve.co.uk Tue Oct 20 11:59:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 16:59:09 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated In-Reply-To: <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> References: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> Message-ID: <320fb6e00910200859i4c7fa800j10cad1abe10a007a@mail.gmail.com> On Tue, Oct 20, 2009 at 2:24 PM, Peter wrote: > > The output from the "new C++" blastx 2.2.22+ (and likely the > other tools like blastp etc which are all separate executables > now) breaks our plain text parser. > > I'll also be trying out the XML output which *should* be fine. > That seems to be fine - although I did find the BLAST record's database_sequences property wasn't being populated, just the alias num_sequences_in_database - fixed in github. Peter From biopython at maubp.freeserve.co.uk Wed Oct 21 07:39:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 12:39:07 +0100 Subject: [Biopython-dev] First "new" contribution via git Message-ID: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> Hi all, I'd just like to mention that I have just commited a tiny enhancement from Chris Lasher (username gotgenes) to add a verbose option to run_tests.py (using the git check-pick command to grab the single commit for this change). I think this marks the first commit from a non-core developer since we moved to github. I'm sure there will be many more to come (and not just from Chris!) :) Peter From bugzilla-daemon at portal.open-bio.org Wed Oct 21 15:45:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 15:45:14 -0400 Subject: [Biopython-dev] [Bug 2931] New: Error in PDBList() code for get_all_obsolete Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2931 Summary: Error in PDBList() code for get_all_obsolete Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com Hi, I believe this code: # extract pdb codes obsolete = map(lambda x: x[21:25].lower(), filter(lambda x: x[:6] == 'OBSLTE', url.readlines())) Should instead be # extract pdb codes obsolete = map(lambda x: x[20:24].lower(), filter(lambda x: x[:6] == 'OBSLTE', url.readlines())) As-is, it is missing the first characters of the PDB code and reading one character past it's end upon testing. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 16:26:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 16:26:40 -0400 Subject: [Biopython-dev] [Bug 2933] New: PDBList() get_status_list bug Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2933 Summary: PDBList() get_status_list bug Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com Hi, Upon testing, I believe the following code in the PDBList class get_status_list method is based on an older file format for added.pdb, modified.pdb, and obsolete.pdb: # added by S. Lee list = map(lambda x: x[3:7], \ filter(lambda x: x[-4:] == '.ent', \ map(lambda x: x.split()[-1], file))) I think the file format used to be: -rw-r--r-- 1 rcsb rcsb 330156 Oct 14 2003 pdb1cyq.ent -rw-r--r-- 1 rcsb rcsb 333639 Oct 14 2003 pdb1cz0.ent Now the file format is simply: 1cyq 1cz0 etc Therefore, I believe the correct code to be: list = map(lambda x: x[0:4], file) Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 17:53:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 17:53:45 -0400 Subject: [Biopython-dev] [Bug 2931] Error in PDBList() code for get_all_obsolete In-Reply-To: Message-ID: <200910212153.n9LLrjtL006614@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2931 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-21 17:53 EST ------- Have you got an example to demonstrate this issue? e.g. a PDB files and a tiny script which we can turn into a unit test? Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 17:55:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 17:55:12 -0400 Subject: [Biopython-dev] [Bug 2933] PDBList() get_status_list bug In-Reply-To: Message-ID: <200910212155.n9LLtC3f006654@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2933 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-21 17:55 EST ------- Can you give us a tiny script to demonstrate the problem? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 18:39:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 18:39:04 -0400 Subject: [Biopython-dev] [Bug 2931] Error in PDBList() code for get_all_obsolete In-Reply-To: Message-ID: <200910212239.n9LMd4kI007914@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2931 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |TallPaulInJax at yahoo.com ------- Comment #2 from TallPaulInJax at yahoo.com 2009-10-21 18:39 EST ------- Here is some test code: from Bio.PDB import PDBList p = PDBList() obsolete = p.get_all_obsolete() print obsolete Printout is below: these should be full four digit codes, but the first digit is missing and the space after the last char is instead at the end.['16l ', '25d ', '4ps ', '51c ', '56b ', '79l ', 'a0v ', 'a0w ', 'a0x ', 'a0y ', 'a10 ', 'a1y ', 'a6o ', 'a9d ', 'a9k ', 'aa8 ', 'aak ', 'abh ', 'abk ', 'abm ', 'abp ', 'abx ', 'ace ', 'ack ', 'act ', 'ada ', 'adh ', 'adk ', 'adm ', 'afg ', 'afn ', 'ak3 ', 'alo ', 'alp ', 'alr ', 'am3 ', 'amg ', 'amv ', 'anh ', 'ape ', 'app ', 'apr ', 'ar3 ', 'ara ', 'arn ', 'as9 ', 'asi ', 'at7 ', 'atc ', 'atq ', 'aub ', 'axf ', 'ayh ', 'ayq ', 'az9 ', 'aza ', 'b1w ', 'b2n ', 'b3m ', 'b5c ', 'b6n ', 'b6o ', 'b7c ', 'b8b ', 'b91 ', 'baa ', 'baq ', 'bcl ', 'bdp ', 'ber ', 'bfl ', 'bgh ', 'bgr ', 'bjl ', 'bkq ', 'bl2 ', 'blm ', 'blw ', 'bme ', 'bmi ', 'bmy ', 'bn2 ', 'bnh ', 'bqv ', 'br7 ', 'buk ', 'bur ', 'bv0 ', 'bv5 ', 'bv6 ', -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 18:48:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 18:48:08 -0400 Subject: [Biopython-dev] [Bug 2933] PDBList() get_status_list bug In-Reply-To: Message-ID: <200910212248.n9LMm8GK008083@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2933 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |TallPaulInJax at yahoo.com ------- Comment #2 from TallPaulInJax at yahoo.com 2009-10-21 18:48 EST ------- With the code as-is: list = map(lambda x: x[3:7], \ filter(lambda x: x[-4:] == '.ent', \ map(lambda x: x.split()[-1], file))) This test code: p = PDBList() [added,modified,obsolete] = p.get_recent_changes() print "Added=", added print "Modified=", modified print "Obsolete=", obsolete Results in: Added= [] Modified= [] Obsolete= [] Yet visually these (20091016) weekly files have entries in them. Changing the code to list = map(lambda x: x[0:4],file) Results in: Added= ['2k9d', '2k9i', '2kac', '2kap', '2kc5', '2kdx', '2khi', '2khj', '2khs', '2ki0', '2ki2', '2kj4', '2klh', '2klu', '2v57', '2w9y', '2wgb', '2wl1', '2wor', '2wos', '2wri', '2wrj', '2wrk', '2wrl', '2wrn', '2wro', '2wrq', '2wrr', '2wu3', '2wu4', '2wu6', '2wu7', '2wud', '2wue', '2wuf', '2wug', '2wul', '2wuz', '2wv1', '2zuh', '2zui', '2zuj', '3a0n', '3a0r', '3a0s', '3a0t', '3a0u', '3a0v', '3a0w', '3a0x', '3a0y', '3a0z', '3a10', '3a2k', '3a3t', '3a4k', '3a4l', '3a4m', '3a4n', '3eq2', '3es2', '3evh', '3evl', '3ew4', '3ewt', '3ewv', '3f3m', '3f3q', '3f3r', '3f55', '3f68', '3f7e', '3fao', '3fei', '3fej', '3fhp', '3fhu', '3fou', '3ft7', '3g05', '3g7a', '3g9w', '3gea', '3gew', '3gfu', '3ggh', '3gi3', '3glj', '3gne', '3gns', '3gnt', '3gnv', '3gnw', '3gr6', '3gxk', '3gxr', '3h1t', '3h53', '3h54', '3h55', '3h6q', '3h6r', '3h6s', '3h89', '3h8b', '3h8c', '3h8n', '3hi7', '3hig', '3hii', '3hj2', '3hj7', '3hjh', '3hlr', '3hpv', '3hpy', '3hq0', '3hqh', '3hqi', '3hql', '3hqm', '3hqs', '3hqt', '3hrq', '3hrr', '3hsn', '3hso', '3hsp', '3hsv', '3htk', '3htm', '3hu6', '3hvs', '3hvv', '3hvx', '3hx3', '3hy5', '3hyf', '3hzl', '3i43', '3i99', '3i9p', '3igu', '3im0', '3ipj', '3ira', '3is5', '3it8', '3it9', '3ita', '3itb', '3ius', '3iuy', '3ivb', '3ivq', '3ivv', '3jpx', '3jr6', '3jr7', '3jrn', '3jty', '3juh', '3jux', '3jvf', '3jwg', '3jwh', '3jwi', '3jwj', '3jwp', '3jxu', '3jyu', '3jz1', '3jz2', '3k1y', '3k20', '3k2i', '3k2w', '3k31', '3k4s', '3k5e', '3k5h', '3k5i', '3k63', '3k67', '3k6a', '3k6r'] Modified= ['1lug', '1nlf', '1u8d', '1uab', '1vst', '2dxb', '2dxc', '2e52', '2gj5', '2gqg', '2gvl', '2gyt', '2igo', '2jjn', '2jjo', '2jxw', '2kal', '2kdr', '2kis', '2klf', '2klg', '2klv', '2koe', '2kon', '2qrm', '2qrp', '2qrq', '2rfk', '2uzj', '2v0y', '2v1p', '2v4v', '2vn4', '2vn7', '2wio', '2wjg', '2wml', '2wn4', '2wn5', '2wn6', '2wn7', '2wn8', '2wnb', '2wnf', '2wqi', '2wqj', '2wr0', '2wr1', '2wr2', '2wr3', '2wr4', '2wr5', '2wr7', '2wrb', '2wrc', '2wrd', '2wre', '2wrf', '2wrg', '2zc9', '2zda', '2zf9', '2zfp', '2zgx', '2zo3', '2zva', '2zzd', '3a21', '3a22', '3a23', '3a25', '3a26', '3a27', '3a5v', '3dge', '3dgf', '3dhk', '3djo', '3djp', '3djq', '3djv', '3djx', '3dux', '3eft', '3eli', '3ess', '3f30', '3f4y', '3f4z', '3f50', '3fan', '3fje', '3fjf', '3fjh', '3fji', '3fjj', '3fjk', '3ftq', '3fwe', '3g5d', '3g6o', '3g6w', '3gfb', '3gfk', '3gjn', '3gwu', '3gwv', '3gww', '3h3g', '3h90', '3h94', '3h9i', '3h9t', '3hg1', '3hhm', '3hhs', '3hif', '3hiz', '3hjf', '3hk2', '3hm9', '3ho1', '3hom', '3hto', '3htp', '3htq', '3htt', '3htx', '3hvr', '3hxm', '3hxt', '3hy2', '3hy3', '3hy4', '3hy6', '3i53', '3i58', '3i5u', '3i64', '3i65', '3i68', '3i6r', '3iai', '3ibg', '3icv', '3icw', '3idq', '3ig3', '3iiw', '3iiy', '3ij0', '3ij1', '3ijc', '3ijt', '3ikw', '3ilr', '3inn', '3ir8', '3jsm', '3jwq', '3jyi', '3k2x', '3k45', '3k47'] Obsolete= ['2b3n', '2b7w', '2frn', '2fuw', '2wem', '3bw5'] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 07:23:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 07:23:34 -0400 Subject: [Biopython-dev] [Bug 2918] Entrez parser fails on Jython - XMLParser lacks SetParamEntityParsing In-Reply-To: Message-ID: <200910221123.n9MBNYtT032594@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2918 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 07:23 EST ------- In the short term, we'll skip test_Entrez.py under Jython to avoid screenfulls of error messages about SetParamEntityParsing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 08:12:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:12:59 -0400 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200910221212.n9MCCx0p001591@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:12 EST ------- What specifically is our parser failing to extract from this example PSI BLAST XML file? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 08:42:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:42:12 -0400 Subject: [Biopython-dev] [Bug 2931] Error in PDBList() code for get_all_obsolete In-Reply-To: Message-ID: <200910221242.n9MCgC39002369@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2931 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:42 EST ------- Fixed in git repository. Thank you for your report, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 08:52:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:52:45 -0400 Subject: [Biopython-dev] [Bug 2933] PDBList() get_status_list bug In-Reply-To: Message-ID: <200910221252.n9MCqjdR002671@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2933 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:52 EST ------- Fixed in the git repository. Thank you for your report. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 08:55:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:55:00 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200910221255.n9MCt0a1002767@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:55 EST ------- (In reply to comment #4) > Peter, > > yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, > I'll take a look at it and post them here. Hi Christian, Did you identify any other problem PDB files? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 09:08:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 09:08:11 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200910221308.n9MD8B4a003222@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #7 from schafer at rostlab.org 2009-10-22 09:08 EST ------- > Did you identify any other problem PDB files? Peter, not yet, sorry. I'm in the middle of publishing a paper. But I'm still on it. I'll let you know. Chris -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Oct 22 09:50:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 14:50:44 +0100 Subject: [Biopython-dev] First "new" contribution via git In-Reply-To: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> References: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> Message-ID: <320fb6e00910220650y2695cef4qd2ede67280942874@mail.gmail.com> On Wed, Oct 21, 2009 at 12:39 PM, Peter wrote: > Hi all, > > I'd just like to mention that I have just committed a tiny enhancement > from Chris Lasher (username gotgenes) to add a verbose option to > run_tests.py (using the git check-pick command to grab the single > commit for this change). > > I think this marks the first commit from a non-core developer since we > moved to github. I'm sure there will be many more to come (and not > just from Chris!) :) I should perhaps clarify this is the first commit from a non-core developer handled via git, and thus appearing under their git username. We have previously manually committed fixes posted on a git branch (e.g. some of Kyle's work for Jython). Peter From dalloliogm at gmail.com Thu Oct 22 10:54:46 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 22 Oct 2009 16:54:46 +0200 Subject: [Biopython-dev] First "new" contribution via git In-Reply-To: <320fb6e00910220650y2695cef4qd2ede67280942874@mail.gmail.com> References: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> <320fb6e00910220650y2695cef4qd2ede67280942874@mail.gmail.com> Message-ID: <5aa3b3570910220754p42cc5d14nddb9e8862bf4bc9d@mail.gmail.com> On Thu, Oct 22, 2009 at 3:50 PM, Peter wrote: > > I should perhaps clarify this is the first commit from a non-core > developer handled via git, and thus appearing under their git > username. We have previously manually committed fixes posted > on a git branch (e.g. some of Kyle's work for Jython). great! :-) -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 22 15:54:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 20:54:28 +0100 Subject: [Biopython-dev] Biopython on 64 bit Windows Message-ID: <320fb6e00910221254s9df2270h335e9b2c15c70993@mail.gmail.com> Hi all, Prompted by Mike Lisanke's query on the main list, we should try and provide installers for 64 bit Windows. Do any of you here on the mailing list have a 64 bit Windows machine you would be willing to try installing Biopython from source on? Ideally of course, assuming it all works, I'd like a volunteer to provide installers for us too. I don't know what difference XP versus Vista versus Win 7 will make (even in 32 bit land). Thanks, Peter From bugzilla-daemon at portal.open-bio.org Sun Oct 25 10:01:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 25 Oct 2009 10:01:27 -0400 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200910251401.n9PE1RRQ019544@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #3 from ibdeno at gmail.com 2009-10-25 10:01 EST ------- (In reply to comment #2) > What specifically is our parser failing to extract from this example PSI BLAST > XML file? > (Sorry, I've been away) Well, currently the code tries to get several pieces of information from the Blast.Record.PSIBlast (brecord): brecord.converged brecord.query brecord.query_letters brecord.rounds brecord.rounds.alignments brecord.rounds.alignments.title brecord.rounds.alignments.hsps then in the hsps: hsp.identities hsp.positives hsp.query hsp.sbjct hsp.match hsp.expect hsp.query_start hsp.query_end hsp.sbjct_start hsp.sbjct_end With different XML-tag names I think that all this information is present. As I said on the mail-list, it would be ideal if the XML parser for PSI-Blast would work in the same way as the current text-mode PSIBlast parser. Please, let me know if that was not clear or if you need further information. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Mon Oct 26 11:53:06 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 12:53:06 -0300 Subject: [Biopython-dev] sff file Message-ID: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> Where I can get an sff to make some tests? Biopython doens't include any sample sff file, I guess this is due to file size limitation. I would be happy if I can download a small (less than 10Mb) sff file. I went to ENTREZ Sequence Read Archive, and seems they only provide fastq. From biopython at maubp.freeserve.co.uk Mon Oct 26 12:14:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 16:14:54 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> Message-ID: <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> On Mon, Oct 26, 2009 at 3:53 PM, Sebastian Bassi wrote: > Where I can get an sff to make some tests? I have some on my SFF branch (under Tests/Roche): http://github.com/peterjc/biopython/tree/index (I used to have a branch called "sff", and another "index" for what ended up on the trunk as the new Bio.SeqIO.index function, but the two were linked for indexing SFF files). If you fancy trying the code, it offers reading, writing and indexing of SFF files, showing the full untrimmed sequence in the SeqRecord. > Biopython doens't include any sample sff file, I guess this is due to > file size limitation. Size isn't an issue - the Roche tools will let you create reduced SFF files using just some of the records (e.g. a random subset). > I would be happy if I can download a small (less than 10Mb) sff file. > I went to ENTREZ Sequence Read Archive, and seems they only provide > fastq. I gather the NCBI Short Read Archive used to offer SFF files, but sadly that do not any more. I am aware of a few projects at Sanger with public SFF files - but these will all be large files, see: http://lists.open-bio.org/pipermail/biopython/2009-August/005443.html Peter From sbassi at clubdelarazon.org Mon Oct 26 12:48:04 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 13:48:04 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> Message-ID: <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> On Mon, Oct 26, 2009 at 1:14 PM, Peter wrote: > I have some on my SFF branch (under Tests/Roche): > http://github.com/peterjc/biopython/tree/index Thank you very much. Just for the record, the URL of the file is: http://github.com/peterjc/biopython/raw/index/Tests/Roche/E3MFGYR02_random_10_reads.sff Best, SB. From biopython at maubp.freeserve.co.uk Mon Oct 26 13:16:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 17:16:01 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> Message-ID: <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> On Mon, Oct 26, 2009 at 4:48 PM, Sebastian Bassi wrote: > On Mon, Oct 26, 2009 at 1:14 PM, Peter wrote: >> I have some on my SFF branch (under Tests/Roche): >> http://github.com/peterjc/biopython/tree/index > > Thank you very much. > Just for the record, the URL of the file is: > http://github.com/peterjc/biopython/raw/index/Tests/Roche/E3MFGYR02_random_10_reads.sff If you want a general example, that is a good choice. It is an unmodified file created by the Roche tools containing just 10 random reads. The folder also contains FASTA and QUAL files (with and without trimming), as converted by the Roche tools. Additionally I have some extra SFF files which are a little less typical... Peter From sbassi at clubdelarazon.org Mon Oct 26 13:54:53 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 14:54:53 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> Message-ID: <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> On Mon, Oct 26, 2009 at 2:16 PM, Peter wrote: > If you want a general example, that is a good choice. ?It is an unmodified file > created by the Roche tools containing just 10 random reads. The folder also > contains FASTA and QUAL files (with and without trimming), as converted > by the Roche tools. Looks like there is a problem: >>> from Bio import SeqIO >>> fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff') >>> for rec in SeqIO.parse(fh,'sff'): print rec.id E3MFGYR02JWQ7T E3MFGYR02JA6IL (... cut ...) >>> for rec in SeqIO.parse(fh,'sff'): print rec.seq Traceback (most recent call last): File "", line 1, in for rec in SeqIO.parse(fh,'sff'): File "/usr/local/lib/python2.6/dist-packages/biopython-1.52-py2.6-linux-i686.egg/Bio/SeqIO/SffIO.py", line 354, in SffIterator = _sff_file_header(handle) File "/usr/local/lib/python2.6/dist-packages/biopython-1.52-py2.6-linux-i686.egg/Bio/SeqIO/SffIO.py", line 56, in _sff_file_header raise ValueError("Wrong SFF magic number in header") ValueError: Wrong SFF magic number in header I have 1.52 plus SffIO.py and __init__.py from your branch. From biopython at maubp.freeserve.co.uk Mon Oct 26 14:00:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 18:00:32 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> Message-ID: <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> On Mon, Oct 26, 2009 at 5:54 PM, Sebastian Bassi wrote: > On Mon, Oct 26, 2009 at 2:16 PM, Peter wrote: >> If you want a general example, that is a good choice. ?It is an unmodified file >> created by the Roche tools containing just 10 random reads. The folder also >> contains FASTA and QUAL files (with and without trimming), as converted >> by the Roche tools. > > Looks like there is a problem: > >>>> from Bio import SeqIO >>>> fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff') Try binary mode - although on my tests it isn't essential on Unix, it is on Windows. >>>> for rec in SeqIO.parse(fh,'sff'): > ? ? ? ?print rec.id > > > E3MFGYR02JWQ7T > E3MFGYR02JA6IL > (... cut ...) OK, so that looks good. >>>> for rec in SeqIO.parse(fh,'sff'): > ? ? ? ?print rec.seq > > Traceback (most recent call last): > ... > ValueError: Wrong SFF magic number in header > > > I have 1.52 plus SffIO.py and __init__.py from your branch. Are you using the same (finished) handle in the second example? That would be like opening an empty file... I think it is just the error message that is misleading here. Peter From sbassi at clubdelarazon.org Mon Oct 26 14:09:57 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 15:09:57 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> Message-ID: <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> On Mon, Oct 26, 2009 at 3:00 PM, Peter wrote: > Try binary mode - although on my tests it isn't essential on Unix, > it is on Windows. I am in Linux Ubuntu 9.04 and I followed your advice and now I did: fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff','rb') And now it works: >>> for rec in SeqIO.parse(fh,'sff'): print rec.seq tcagGGTCTACATGTTGGTTAACCCGTACTGATTTGAATTGGCTCTTTGTCTTTCCAAAGGGAATTCATCTTCTTATGGCACACATAAAGGATAAATACAAGAATCTTCCTATTTACATCACTGAAAATGGCATGGCTGAATCAAGGAATGACTCAATACCAGTCAATGAAGCCCGCAAGGATAGTATAAGGATTAGATACCATGATGGCCATCTTAAATTCCTTCTTCAAGCGATCAAGGAAGGTGTTAATTTGAAGGGGCTTa tcagTTTTTTTTGGAAAGGAAAACGGACGTACTCATAGATGGATCATACTGACGTTAGGAAAATAATTCATAAGACAATAAGGAAACAAAGTGTAAAAAAAAAACCTAAATGCTCAAGGAAAATACATAGCCATCTGAACAGATTTCTGCTGGAAGCCACATTTCTCGTAGAACGCCTTGTTCTCGACGCTGCAATCAAGAATCACCTTGTAGCATCCCATTGAACGCGCATGCTCCGTGAGGAACTTGATGATTCTCTTTCCCAAATGcc (.... cut ...) So looks that binary mode is also needed for Linux. From biopython at maubp.freeserve.co.uk Mon Oct 26 14:43:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 18:43:50 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> Message-ID: <320fb6e00910261143x393d8d49s39b41253dcb02cb7@mail.gmail.com> On Mon, Oct 26, 2009 at 5:54 PM, Sebastian Bassi wrote: > > I have 1.52 plus SffIO.py and __init__.py from your branch. > You'll also want to get Bio/SeqIO/_index.py if you want to test random access to reads in an SFF file via the new Bio.SeqIO.index() function. This will read the Roche style SFF index if present (which is very fast) or just index the file directly (which is still reasonably quick). Peter From biopython at maubp.freeserve.co.uk Mon Oct 26 15:17:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 19:17:25 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> Message-ID: <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> On Mon, Oct 26, 2009 at 6:09 PM, Sebastian Bassi wrote: > > On Mon, Oct 26, 2009 at 3:00 PM, Peter wrote: >> Try binary mode - although on my tests it isn't essential on Unix, >> it is on Windows. > > I am in Linux Ubuntu 9.04 and I followed your advice and now I did: > > fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff','rb') > > And now it works: Opening SFF files in binary mode is good practice (as it is required for Windows), but is unrelated to your problem. It was just a simple "user error" coupled with a very unhelpful error message. I have updated my code so if you try and "re-parse" the same handle (without first doing handle.seek(0) to reset it), you get this: >>> from Bio import SeqIO >>> handle = open("E3MFGYR02_random_10_reads.sff", "rb") >>> for record in SeqIO.parse(handle, "sff") : print record.id ... E3MFGYR02JWQ7T E3MFGYR02JA6IL E3MFGYR02JHD4H E3MFGYR02GFKUC E3MFGYR02FTGED E3MFGYR02FR9G7 E3MFGYR02GAZMS E3MFGYR02HHZ8O E3MFGYR02GPGB1 E3MFGYR02F7Z7G >>> for record in SeqIO.parse(handle, "sff") : print record.id ... Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/SffIO.py", line 378, in SffIterator = _sff_file_header(handle) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/SffIO.py", line 77, in _sff_file_header raise ValueError("SFF handle seems to be at index block, not start") ValueError: SFF handle seems to be at index block, not start The code is now here, I wanted to get this semi-ready for merging to the trunk - depending on user feedback of course ;) http://github.com/peterjc/biopython/tree/sff-seqio (I feel the old index branch has served its purpose.) Peter From bioinformed at gmail.com Mon Oct 26 17:32:41 2009 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 26 Oct 2009 17:32:41 -0400 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> Message-ID: <2e1434c10910261432l314c0cf5i304a04a9055d72b1@mail.gmail.com> At the risk of asking a dumb question, is this native SFF support better than what is available via BioLib? ~Kevin From biopython at maubp.freeserve.co.uk Mon Oct 26 18:24:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 22:24:21 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <2e1434c10910261432l314c0cf5i304a04a9055d72b1@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <2e1434c10910261432l314c0cf5i304a04a9055d72b1@mail.gmail.com> Message-ID: <320fb6e00910261524j6255a8bp56ab79c0b436eb72@mail.gmail.com> On Mon, Oct 26, 2009 at 9:32 PM, Kevin Jacobs wrote: > At the risk of asking a dumb question, is this native SFF support > better than what is available via BioLib? > ~Kevin What do mean by better? >From my point of view, the nice thing about this (the Biopython SFF code) is it is integrated into the Bio.SeqIO system using SeqRecord objects, so you can use the same scripts etc that you may have written for processing FASTA or FASTQ files. Also, it is pure Python which may be important for cross platform use (e.g. Jython, IronPython, ...). According to their webpage, the BioLib SFF support is via the Staden io_lib, which is probably pretty efficient. http://biolib.open-bio.org/wiki/Main_Page Peter From sbassi at clubdelarazon.org Mon Oct 26 23:35:28 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 27 Oct 2009 00:35:28 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> Message-ID: <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> On Mon, Oct 26, 2009 at 4:17 PM, Peter wrote: > ? ?raise ValueError("SFF handle seems to be at index block, not start") > ValueError: SFF handle seems to be at index block, not start I see, the new error message is better now since gives the user a hint of the user error. I didn't realize my mistake at first because I am used to have an empty result when I make that mistake using a text format like fasta. From biopython at maubp.freeserve.co.uk Tue Oct 27 06:03:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 10:03:05 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> Message-ID: <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> On Tue, Oct 27, 2009 at 3:35 AM, Sebastian Bassi wrote: > > On Mon, Oct 26, 2009 at 4:17 PM, Peter wrote: >> ? ?raise ValueError("SFF handle seems to be at index block, not start") >> ValueError: SFF handle seems to be at index block, not start > > I see, the new error message is better now since gives the user a hint > of the user error. > I didn't realize my mistake at first because I am used to have an > empty result when I make that mistake using a text format like fasta. Great - your feedback has made a difference :) I suppose an empty file could be allowed for SFF, but I don't really like this idea. Peter From biopython at maubp.freeserve.co.uk Tue Oct 27 07:50:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 11:50:38 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> Message-ID: <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> On Tue, Oct 27, 2009 at 10:03 AM, Peter wrote: > > Great - your feedback has made a difference :) > As part of the polishing in anticipation of merging the SFF support into the trunk, I've just made some big additions to the docstring (with doctest examples) on the branch - it would be great if you could read over this at some point. http://github.com/peterjc/biopython/tree/sff-seqio What do you think of the current rather pragmatic way I'm handling trimming the SeqRecord objects? i.e. SeqIO file format "sff" gives the full data and supports reading and writing, while SeqIO format "sff-trim" only supports reading and gives trimmed sequences without the flow data. This is a bit of a hack, and the "sff-trim" format could be left out - but then we would need a nice way to trim the full length SeqRecord objects... Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 27 11:50:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 27 Oct 2009 11:50:03 -0400 Subject: [Biopython-dev] [Bug 2938] New: Bio.Entrez.read() returns empty string for HTML (not an error) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2938 Summary: Bio.Entrez.read() returns empty string for HTML (not an error) Product: Biopython Version: 1.52 Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk If given HTML instead of XML, Bio.Entrez.read() returns an empty string. I would have expected a helpful error message. e.g. >>> from Bio import Entrez >>> handle = Entrez.efetch(db="pubmed", id="17206916") >>> handle.readline() 'PmFetch response\n' Try parsing this HTML as if it were XML ... >>> handle = Entrez.efetch(db="pubmed", id="17206916") >>> "" == Entrez.read(handle) True i.e. Entrez.read is returning an empty string. Problem spotted based on a mailing list query, see this thread: http://lists.open-bio.org/pipermail/biopython/2009-October/005774.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 28 06:16:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Oct 2009 06:16:32 -0400 Subject: [Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error) In-Reply-To: Message-ID: <200910281016.n9SAGW9p017546@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2009-10-28 06:16 EST ------- It is relatively easy to check if the file starts with Message-ID: <200910281057.n9SAvgpY018500@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-28 06:57 EST ------- Good point - and hopefully the NCBI will make all their XML consistent. In the meantime, instead of the white list, how about a blacklist? i.e. If the data starts " Message-ID: <200910281112.n9SBC5Md018851@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2009-10-28 07:12 EST ------- (In reply to comment #2) > In the meantime, instead of the white list, how about a blacklist? > i.e. If the data starts " We could also spot things like FASTA and GenBank files etc, and > as all we want to do is spot non-XML, this should be reliable. > One important point is that the initial tag is not handled as a regular XML tag by the parser. There is a separate handler method specific for parsing the tag. This makes it much easier to check if an XML document is really XML: If this special handler is never called, it's not XML. Checking for a FASTA and GenBank file is also relatively easy; the parser raises an xml.parsers.expat.ExpatError syntax error, which we can catch and transform in a more informative message. Checking for HTML is trickier. The parser will not raise an error, because except for the missing initial tag, the HTML could in principle be regarded as XML. To check if the input starts with , we'd have to read some data ahead, check for the , and pass the data to the parser if it seems to be OK. So I suggest we add a check for xml.parsers.expat.ExpatError syntax errors now, and add a check for the initial once NCBI has fixed the XML output to always contain this tag, but don't check for . -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 28 07:26:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Oct 2009 07:26:33 -0400 Subject: [Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error) In-Reply-To: Message-ID: <200910281126.n9SBQXo9019213@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-28 07:26 EST ------- (In reply to comment #3) > (In reply to comment #2) > > In the meantime, instead of the white list, how about a blacklist? > > i.e. If the data starts " > We could also spot things like FASTA and GenBank files etc, and > > as all we want to do is spot non-XML, this should be reliable. > > > One important point is that the initial tag is not handled as a > regular XML tag by the parser. There is a separate handler method specific for > parsing the tag. This makes it much easier to check if an XML > document is really XML: If this special handler is never called, it's not XML. > > Checking for a FASTA and GenBank file is also relatively easy; the parser > raises an xml.parsers.expat.ExpatError syntax error, which we can catch and > transform in a more informative message. Sounds good. > Checking for HTML is trickier. The parser will not raise an error, because > except for the missing initial tag, the HTML could in principle be > regarded as XML. To check if the input starts with , we'd have to read > some data ahead, check for the , and pass the data to the parser if it > seems to be OK. Understood. > So I suggest we add a check for xml.parsers.expat.ExpatError syntax errors > now, and add a check for the initial once NCBI has fixed the XML > output to always contain this tag, but don't check for . +1 on adding the syntax error check now, that will be a worthwhile improvement in itself. Regarding flagging , is it currently a safe assumption that anything starting is NOT an NCBI XML file? If the NCBI will fix all their XML output to always start then great. I suspect it will take a while though. If you want to wait, fine. I'm happy to leave this decision to you - it's your module after all ;) Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Oct 28 08:07:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Oct 2009 12:07:57 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features Message-ID: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> Hi all, I've been following a thread on the BioPerl mailing list about how to get the mature peptide amino acid sequences for mat_peptide features in a GenBank file (given in general these features do not include the translation, nor a GI number of Protein ID which can be looked up online). Chris summarised a working approach here: http://lists.open-bio.org/pipermail/bioperl-l/2009-October/031493.html Step one of this process is to be able to take a GenBank feature (here a mat_peptide) and use the location information to extract the relevant part of the parent nucleotide sequence (at the foot of the file). For example, http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank Consider mat_peptide nsp12, whose location is a little complex, join(12332..12358,12358..15117) - in Python terms, we need seq[12331:12358] + seq[12357:15117], although in general there are other concerns like the strand. Step two (in Chris' workflow) is to translate this into amino acids, and as a precaution, verify this is a subsequence of the precursor protein given in the previous CDS entry (protein ID ABI14446.1 in this case). This is quite straightforward. The first operation is tricky, but is actually a very general problem, and has come up before on the Biopython mailing lists, e.g. http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005991.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005997.html As noted in the linked threads, I have some (apparently) working code as function get_feature_nuc in the unit test file test_SeqIO_features.py I think this should be part of Biopython proper (with unit tests etc), and would like to discuss where to put it. My ideas include: (1) Method of the SeqFeature object taking the parent sequence (as a string, Seq, ...?) as a required argument. Would return an object of the same type as the parent sequence passed in. (2) Separate function, perhaps in Bio.SeqUtils taking the parent sequence (as a string, Seq, ...?) and a SeqFeature object. Would return an object of the same type as the parent sequence passed in. (3) Method of the Seq object taking a SeqFeature, returning a Seq. [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature] (4) Method of the SeqRecord object taking a SeqFeature. Could return a SeqRecord using annotation from the SeqFeature. Complex. Any other ideas? We could even offer more than one of these approaches, but ideally there should be one obvious way for the end user to do this. My question is, which is most intuitive? I quite like idea (1). In terms of code complexity, I expect (1), (2) and (3) to be about the same. Building a SeqRecord in (4) is trickier. Options (1) and (2) are not tied to the sequence object, and could in theory support any Seq like object, plain strings - or in future even a SeqRecord: I have a git branch where the SeqRecord object supports addition and the reverse_complement method, which would work nicely here. See: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html Peter From chapmanb at 50mail.com Wed Oct 28 08:07:33 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 28 Oct 2009 08:07:33 -0400 Subject: [Biopython-dev] [Biopython] Alignment object In-Reply-To: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> Message-ID: <20091028120733.GB22395@sobchak.mgh.harvard.edu> Peter and Eric; [Moving this over to biopython-dev and changing the subject] > > Here's +1 for Python counting. That would match SeqFeature and the > > ProteinDomain class in Bio.Tree.PhyloXML. > > > > While we're on this topic -- I have some unpublished code for rendering an > > alignment object in HTML, with plans for colorization, conservation > > profiles, etc. I rolled my own alignment class since the one in > > Bio.Align.Generic didn't have the attributes (start, end, selected columns) > > for a particular file format I was parsing. It's not urgent, but at some > > point could you publish your plans for the Alignment classes so I (and > > probably others) can stay/become compatible? > > My rough work in progress in on github - at the moment I'm still trying > things out, and don't assume anything is set in stone. If you want to > have a play with this code, feedback is very welcome - probably best > on the dev list rather than here. See: > > http://github.com/peterjc/biopython/tree/seqrecords > > (a lot of the alignment things I want to support, like slicing and adding > are very closely linked to doing the same operations to SeqRecords) > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From chapmanb at 50mail.com Wed Oct 28 08:18:33 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 28 Oct 2009 08:18:33 -0400 Subject: [Biopython-dev] [Biopython] Alignment object In-Reply-To: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> Message-ID: <20091028121833.GC22395@sobchak.mgh.harvard.edu> Peter and Eric; [Moving this over to biopython-dev and changing the subject] > > Here's +1 for Python counting. That would match SeqFeature and the > > ProteinDomain class in Bio.Tree.PhyloXML. Agreed. My opinion on the 0/1 mess is that data objects in code should expose all of the coordinates as 0-based, and that output and display files meant for biologists should be 1-based. > > While we're on this topic -- I have some unpublished code for rendering an > > alignment object in HTML, with plans for colorization, conservation > > profiles, etc. I rolled my own alignment class since the one in > > Bio.Align.Generic didn't have the attributes (start, end, selected columns) > > for a particular file format I was parsing. It's not urgent, but at some > > point could you publish your plans for the Alignment classes so I (and > > probably others) can stay/become compatible? > > My rough work in progress in on github - at the moment I'm still trying > things out, and don't assume anything is set in stone. If you want to > have a play with this code, feedback is very welcome - probably best > on the dev list rather than here. See: > > http://github.com/peterjc/biopython/tree/seqrecords > > (a lot of the alignment things I want to support, like slicing and adding > are very closely linked to doing the same operations to SeqRecords) The bx-python alignment object is nice and goes to/from MAF and AXT formats: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py This supports slicing by alignment coordinates and by reference coordinates for a species in the alignment. Some other useful features are limiting the alignment to specific species and removing all gap columns that can result. The representation is a high level Alignment object containing multiple Components. You can also index the files for quick lookup via range queries: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/ It's a nice implementation; it would be good to stay compatible with it and leverage as much as we can from what they've done. Brad From biopython at maubp.freeserve.co.uk Wed Oct 28 08:50:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Oct 2009 12:50:55 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> Message-ID: <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> On Wed, Oct 28, 2009 at 12:07 PM, Peter wrote: > I think this should be part of Biopython proper (with unit tests etc), and > would like to discuss where to put it. My ideas include: > > (1) Method of the SeqFeature object taking the parent sequence (as a > string, Seq, ...?) as a required argument. Would return an object of the > same type as the parent sequence passed in. > > (2) Separate function, perhaps in Bio.SeqUtils taking the parent > sequence (as a string, Seq, ...?) and a SeqFeature object. Would > return an object of the same type as the parent sequence passed in. > > (3) Method of the Seq object taking a SeqFeature, returning a Seq. > [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature] > > (4) Method of the SeqRecord object taking a SeqFeature. Could > return a SeqRecord using annotation from the SeqFeature. Complex. > > Any other ideas? > > We could even offer more than one of these approaches, but ideally > there should be one obvious way for the end user to do this. My > question is, which is most intuitive? I quite like idea (1). > > In terms of code complexity, I expect (1), (2) and (3) to be about the > same. Building a SeqRecord in (4) is trickier. Actually, thinking about this over lunch, for many of the use cases we do want to turn a SeqFeature into a SeqRecord - either for the nucleotides, or in some cases their translation. And if doing this, do something sensible with the SeqFeature annotation (qualifiers) seems generally to be useful. This could still be done with approaches (1) and (2) as well as (4). Peter From biopython at maubp.freeserve.co.uk Wed Oct 28 08:52:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Oct 2009 12:52:28 +0000 Subject: [Biopython-dev] [Biopython] Alignment object In-Reply-To: <20091028121833.GC22395@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00910280552x7c5bfa6aw7b02a2e1dd0f8a7e@mail.gmail.com> On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman wrote: >> >> My rough work in progress in on github - at the moment I'm still trying >> things out, and don't assume anything is set in stone. If you want to >> have a play with this code, feedback is very welcome - probably best >> on the dev list rather than here. See: >> >> http://github.com/peterjc/biopython/tree/seqrecords >> >> (a lot of the alignment things I want to support, like slicing and adding >> are very closely linked to doing the same operations to SeqRecords) > > The bx-python alignment object is nice and goes to/from MAF and AXT > formats: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py > > This supports slicing by alignment coordinates and by reference > coordinates for a species in the alignment. Some other useful > features are limiting the alignment to specific species and removing > all gap columns that can result. The representation is a high level > Alignment object containing multiple Components. > > You can also index the files for quick lookup via range queries: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py > http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/ > > It's a nice implementation; it would be good to stay compatible with it and leverage > as much as we can from what they've done. We also have to try and stay compatible with the existing Biopython alignment object though. But thanks for the bx links, I should take a look. Peter From sbassi at clubdelarazon.org Wed Oct 28 10:24:02 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 28 Oct 2009 11:24:02 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> Message-ID: <9e2f512b0910280724g2cb8d98o61fdd9aaae5a8965@mail.gmail.com> On Tue, Oct 27, 2009 at 8:50 AM, Peter wrote: > (with doctest examples) on the branch - it would be great if you > could read over this at some point. > http://github.com/peterjc/biopython/tree/sff-seqio I will take a look at it tonight. Best, SB. From sbassi at clubdelarazon.org Thu Oct 29 10:02:39 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Thu, 29 Oct 2009 11:02:39 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> Message-ID: <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> On Tue, Oct 27, 2009 at 8:50 AM, Peter wrote: > As part of the polishing in anticipation of merging the SFF support > into the trunk, I've just made some big additions to the docstring > (with doctest examples) on the branch - it would be great if you > could read over this at some point. > http://github.com/peterjc/biopython/tree/sff-seqio I've read it (you mean the code in SffIO.py). Regarding your questions: > What do you think of the current rather pragmatic way I'm > handling trimming the SeqRecord objects? i.e. SeqIO file format > "sff" gives the full data and supports reading and writing, while > SeqIO format "sff-trim" only supports reading and gives trimmed > sequences without the flow data. This is a bit of a hack, and the > "sff-trim" format could be left out - but then we would need a nice > way to trim the full length SeqRecord objects... sff-trim is OK for me but I am not familiar with this format. I see there are some mixed upper and lower case dna sequence, why? Are lower case bases with less quality? (like the both extremes in standards read). From biopython at maubp.freeserve.co.uk Thu Oct 29 10:09:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Oct 2009 14:09:07 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> Message-ID: <320fb6e00910290709r36ac58aeid4728004965481bf@mail.gmail.com> On Thu, Oct 29, 2009 at 2:02 PM, Sebastian Bassi wrote: > > On Tue, Oct 27, 2009 at 8:50 AM, Peter wrote: >> As part of the polishing in anticipation of merging the SFF support >> into the trunk, I've just made some big additions to the docstring >> (with doctest examples) on the branch - it would be great if you >> could read over this at some point. >> http://github.com/peterjc/biopython/tree/sff-seqio > > I've read it (you mean the code in SffIO.py). Regarding your questions: I meant the docstrings in Bio/SeqIO/SffIO.py (i.e. the comments which get exposed as the API help). >> What do you think of the current rather pragmatic way I'm >> handling trimming the SeqRecord objects? i.e. SeqIO file format >> "sff" gives the full data and supports reading and writing, while >> SeqIO format "sff-trim" only supports reading and gives trimmed >> sequences without the flow data. This is a bit of a hack, and the >> "sff-trim" format could be left out - but then we would need a nice >> way to trim the full length SeqRecord objects... > > sff-trim is OK for me but I am not familiar with this format. I see > there are some mixed upper and lower case dna sequence, why? > Are lower case bases with less quality? (like the both extremes in > standards read). Yes, they are in mixed case, and this is linked to the quality and adaptor sequences . I tried to explain in the SffIO docstring (near the top of Bio/SeqIO/SffIO.py) with examples and the following text: >> ... Notice that the sequence is given in mixed case, [there] the >> central upper case region corresponds to the trimmed sequence. >> This matches the output of the Roche tools (and the 3rd party tool >> sff_extract) for SFF to FASTA. I think I need to remove the word "there" from that paragraph ;) Peter From biopython at maubp.freeserve.co.uk Thu Oct 29 13:31:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Oct 2009 17:31:14 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910290709r36ac58aeid4728004965481bf@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> <320fb6e00910290709r36ac58aeid4728004965481bf@mail.gmail.com> Message-ID: <320fb6e00910291031s4ba6fdabj8de26b123d1a4126@mail.gmail.com> On Thu, Oct 29, 2009 at 2:09 PM, Peter wrote: > > I meant the docstrings in Bio/SeqIO/SffIO.py (i.e. the comments > which get exposed as the API help). > ... > I think I need to remove the word "there" from that paragraph ;) > That typo is fixed, and I have also added a docstring/doctest example showing how to do simple primer trimming of an SFF file giving a new SFF file where the clipping co-ordinates have been updated. Some of these examples will probably get moved into the tutorial if/when I merge this to the trunk. Peter From bugzilla-daemon at portal.open-bio.org Fri Oct 30 06:17:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 30 Oct 2009 06:17:44 -0400 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: Message-ID: <200910301017.n9UAHi0G013552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2924 fkauff at biologie.uni-kl.de changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1370 is|0 |1 obsolete| | AssignedTo|biopython-dev at biopython.org |fkauff at biologie.uni-kl.de Status|NEW |ASSIGNED ------- Comment #3 from fkauff at biologie.uni-kl.de 2009-10-30 06:17 EST ------- Created an attachment (id=1380) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1380&action=view) fixed memory leak of this bug (basicall same as attachmenent from Joseph) and a second one (now line 67) Fixed above memory leak (basically doing the same as Joseph) and fixed another in line 67 where it should read free(scanned_start) instead of free(scanned) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Fri Oct 30 06:20:57 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Fri, 30 Oct 2009 11:20:57 +0100 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: <200910301017.n9UAHi0G013552@portal.open-bio.org> References: <200910301017.n9UAHi0G013552@portal.open-bio.org> Message-ID: <4AEABE09.8000003@biologie.uni-kl.de> ... and once I've learned this git stuff I'll submit the corrected cnexus.c to github... Frank On 10/30/2009 11:17 AM, bugzilla-daemon at portal.open-bio.org wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2924 > > > fkauff at biologie.uni-kl.de changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Attachment #1370 is|0 |1 > obsolete| | > AssignedTo|biopython-dev at biopython.org |fkauff at biologie.uni-kl.de > Status|NEW |ASSIGNED > > > > > ------- Comment #3 from fkauff at biologie.uni-kl.de 2009-10-30 06:17 EST ------- > Created an attachment (id=1380) > --> (http://bugzilla.open-bio.org/attachment.cgi?id=1380&action=view) > fixed memory leak of this bug (basicall same as attachmenent from Joseph) and a > second one (now line 67) > > Fixed above memory leak (basically doing the same as Joseph) and fixed another > in line 67 where it should read free(scanned_start) instead of free(scanned) > > > From bugzilla-daemon at portal.open-bio.org Fri Oct 30 07:32:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 30 Oct 2009 07:32:31 -0400 Subject: [Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error) In-Reply-To: Message-ID: <200910301132.n9UBWVRD016588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2009-10-30 07:32 EST ------- I've added a syntax error check. This will raise a more informative error message if the data given to the parser is not in XML format (e.g., plain text). This does not yet check for HTML input though, so I'm leaving this bug open. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Oct 30 07:35:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 30 Oct 2009 07:35:49 -0400 Subject: [Biopython-dev] [Bug 2771] Bio.Entrez.read can't parse XML files from dbSNP (snp database) In-Reply-To: Message-ID: <200910301135.n9UBZnkv016735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2771 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2009-10-30 07:35 EST ------- I've modified the parser such that it will raise an informative error message if an XML schema / namespace is encountered. Leaving this bug report open; we still need a parser for XML data using an XML schema. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Oct 1 09:04:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 1 Oct 2009 10:04:17 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? In-Reply-To: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> Message-ID: <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com> On Wed, Sep 30, 2009 at 4:27 PM, Peter wrote: > > This has meant that generally the current status quo isn't > a problem (at least for me). However, what prompted me > to work on this issue was a real world example. > > We have a draft genome where after doing a basic > annotation, it would make sense to flip the strands. I > want to be able to load our current GenBank file, apply > the reverse complement, and have all the annotated > features recalculated to match. With more and more > sequencing projects, this isn't such an odd thing to > want to do. The github branch has SeqRecord reverse complement working pretty well (with plenty of tests covering the fuzzy locations), and a first attempt at SeqRecord addition too: http://github.com/peterjc/biopython/commits/seqrecords This lets me solve my motivating example like so: from Bio import SeqIO old_record = SeqIO.parse(open("pBAD30.gb"), "gb") handle = open("pBAD30_rc.gb", "w") SeqIO.write([old_record.reverse_complement(...)], handle, "gb") handle.close() If I wanted to shift the origin, this would be possible by combining SeqRecord slicing and addition: from Bio import SeqIO cut = 3765 old_record = SeqIO.parse(open("pBAD30.gb"), "gb") handle = open("pBAD30_rc.gb", "w") new_record = old_record[cut:] + old_record[:cut] SeqIO.write([new_record], handle, "gb") handle.close() And of course you can do both (which is probably what I will be using for the real task from work that this example is based on): from Bio import SeqIO cut = 3765 old_record = SeqIO.parse(open("pBAD30.gb"), "gb") handle = open("pBAD30_rc.gb", "w") new_record = (old_record[cut:] + old_record[:cut]).reverse_complement(...) SeqIO.write([new_record], handle, "gb") handle.close() The general scheme is nice and simple I think, but the trouble is in the details. For this particular example, it makes sense for all the annotation to preserved. For the reverse complement this is possible (although currently on my branch, this is not the default - hence the dot dot dot in the example above where right now this need to be requested explicitly). However, currently on SeqRecord slicing we take the cautious approach to the annotation, and the annotation dictionary and dbxrefs list are lost. On reflection, perhaps the more liberal straight forward approach is more useful: copy all the annotation (and leave it to the user to remove anything that becomes inappropriate). Then this code would "work": new_record = old_record[cut:] + old_record[:cut] Right now, based on the current slicing in the trunk, you have to copy these annotations manually: new_record = old_record[cut:] + old_record[:cut] new_record.annotations = old_record.annotations.copy() new_record.dbxrefs = old_record.dbxrefs[:] The question is which is preferable? The current slicing makes the user think about their annotation explicitly. The alternative is to blindly copy it, knowing that in some cases it will not be appropriate to the sub-record. Peter P.S. For those of you interested in multiple sequence alignments, once SeqRecord addition is dealt with, adding alignments becomes practical. i.e. taking two gene alignments for N species, and then concatenating them as discussed on Bug 2552: http://bugzilla.open-bio.org/show_bug.cgi?id=2552 From bugzilla-daemon at portal.open-bio.org Tue Oct 6 06:31:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 02:31:44 -0400 Subject: [Biopython-dev] [Bug 2924] New: memory leak in cnexus.c Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2924 Summary: memory leak in cnexus.c Product: Biopython Version: 1.52 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jheled at gmail.com There seem to be a serious leak in cnexus. The python documentation says, When memory buffers are passed as parameters to supply data to build objects, as for the `s' and `s#' formats, the required data is copied. Buffers provided by the caller are never referenced by the objects created by `Py_BuildValue()'. In other words, if your code invokes `malloc()' and passes the allocated memory to `Py_BuildValue()', your code is responsible for calling `free()' for that memory once `Py_BuildValue()' returns. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 6 08:18:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 04:18:45 -0400 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: Message-ID: <200910060818.n968Ij2i002732@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2924 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2009-10-06 04:18 EST ------- I think this really is a bug. The problem is in line 91: return Py_BuildValue("s",scanned_start); This should be something like: PyObject* result = Py_BuildValue("s",scanned_start); free(scanned); return result; Frank, if you agree, could you make this change? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 6 08:33:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 04:33:17 -0400 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: Message-ID: <200910060833.n968XHCf003024@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2924 ------- Comment #2 from jheled at gmail.com 2009-10-06 04:33 EST ------- Created an attachment (id=1370) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1370&action=view) fix the memory leak bug fix -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bioinformed at gmail.com Tue Oct 6 21:08:38 2009 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 6 Oct 2009 17:08:38 -0400 Subject: [Biopython-dev] population genetics, SNP data management and more In-Reply-To: <2e1434c10910061355q1098459crabf2a850a7bcaa1c@mail.gmail.com> References: <2e1434c10910061355q1098459crabf2a850a7bcaa1c@mail.gmail.com> Message-ID: <2e1434c10910061408j64576828m8a34579d628e443c@mail.gmail.com> Hi all, I'm the primary author of a suite of tools called GLU (Genotype Library & Utilities) that seems to have some features that may of of interest to BioPython developers. It is implemented in Python, uses NumPy, SciPy, PyTables (or h5py) and a few other common Python libraries, has the performance critical portions transcribed in C, and is available as open source under a BSD-like license. GLU implements a robust set of data management features for large SNP and general polymorphism data (human/mammalian for now, since we only support diploid and haploid genotypes). We regularly use it to manage datasets with 50 billion of SNP genotypes (>50k samples & > 1M SNPs). We define our own on-disk data representations in text, compressed text, and optimized binary formats, plus support PLINK and about a dozen other common formats. Our native binary storage is based on HDF5 and is quite robust and scalable. As a point of reference the Phase I-III of the International HapMap data is ~13 GB in their text format, 1.3 GB with gzip compression, and 472 MB in GLU's HDF5-based LBAT format. GLU includes modules that compute a range of descriptive statistics on genotype data quality, concordance, Mendelian consistency, relationship testing, consistency with Hardy-Weinberg proportions, and more. In addition, GLU includes modules to explore population structure, including estimation of admixture coefficients (like STRUCTURE, but with fixed source populations and frequencies) and principle components based on genetic correlations (like EIGENSTRAT and its ilk). GLU also allows supports high-throughput association testing between dichotomous, poloytomous, and continuous (Gaussian) variables and genetic effects (numerous models), covariates, and arbitrary interactions. Also supported is the rapid evaluation of pairwise linkage disequilibrium statistics and an advanced pairwise SNP tagging algorithm. There are many other features in GLU, though it is not yet feature complete and the documentation is currently a bit of a work in progress. Feel free to take a look at: http://code.google.com/p/glu-genetics >From the PopGen wiki, it seems that there is a desire to implement some of these features within BioPython. I'm happy to help, contribute code from GLU where applicable, or at minimum share some of my experiences. Best regards, -Kevin Jacobs From bugzilla-daemon at portal.open-bio.org Wed Oct 7 03:57:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 6 Oct 2009 23:57:14 -0400 Subject: [Biopython-dev] [Bug 2925] New: false exception in Bio.PDB.NeighborSearch.search Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2925 Summary: false exception in Bio.PDB.NeighborSearch.search Product: Biopython Version: 1.52 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: 2lizhenhua at gmail.com Bio.PDB.NeighborSearch.search if there is no atom nearby(within a distance smaller than radius) and the level is not "A", an exception will be thrown. it would be better to return an empty list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 7 09:27:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Oct 2009 05:27:09 -0400 Subject: [Biopython-dev] [Bug 2925] false exception in Bio.PDB.NeighborSearch.search In-Reply-To: Message-ID: <200910070927.n979R9Aa003927@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2925 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-07 05:27 EST ------- Could you show us a short script that demonstrates this problem please? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Mon Oct 12 18:08:26 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 12 Oct 2009 20:08:26 +0200 Subject: [Biopython-dev] [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <8b34ec180909240551v649769f3keeae64f1ef31a633@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> <20090924122730.GL13500@sobchak.mgh.harvard.edu> <8b34ec180909240551v649769f3keeae64f1ef31a633@mail.gmail.com> Message-ID: <8b34ec180910121108l52ed4ec7l1461f6d3e1c00f62@mail.gmail.com> Hi all, On Thu, Sep 24, 2009 at 2:51 PM, Bartek Wilczynski wrote: > On Thu, Sep 24, 2009 at 2:27 PM, Brad Chapman wrote: >> >> A separate news post mentioning the C option speed and showing usage >> examples from both is a great idea. Responsiveness to new methods is >> the fun part of science. >> > I'll try to write that up and send it to the list. This took me, unfortunately more than I thought it would... The reasons are partially non-related (finishing a paper) and partially related to the matter. To put it short, my original plan was to include a wrapper for MOODS as a patch to biopython (if it is in the system -> use it) and include that information in this blog post. However, as I performed more tests of MOODS, I found out, that it might not be such a great idea. While the C module written for biopython by Michiel is working like a breeze, the MOODS package is a bit more moody... I needed to tweak the makefile to compile it on my mac, but it was working (most of the time) afterwards. Then I wanted to try on my linux box where it compiled with no problems, but it was giving me segfaults on my scripts which ran fine on a mac (it did run the simple examples though...). In addition to that, I found that the performance of MOODS was not always better than that of the brute force algorithm, which is already in Biopython. At the same time the maintainability of the Michiel's code is incomparable with the complex stuff they have. In conclusion, I don't think it is worth to put too much into the integration efforts now. I will try to contact the MOODS team about the issues I encountered and see whether they are interested in getting it integrated into biopython. If so, I can try to help with this, but it might be that we will just provide a function for getting a properly formatted log-odds matrix from Biopython motif for usage in MOODS. After all I think that not that many applications require the performance gains of MOODS over our current implemmentation. For the purpose of the blog post I've written a short script: http://github.com/barwil/biopython/blob/df0dfa8feeb15ce50d027d1492913f2d8920c9b3/Tests/Motif/moods_motif_benchmark.py which can be run assuming you have at least biopython 1.51+ installed and MOODS python bindings. It uses two different motifs showing possible behavior of Bio.Motif._pwm and MOODS. The output from my machine is as follows: reading the sequence took 0.603 seconds First motif: SRF MOODS calculation took 0.768 seconds on average Bio.Motif fast calculation took 2.407 seconds on average Second motif: Broad complex II MOODS calculation took 5.72 seconds on average Bio.Motif fast calculation took 2.687 seconds on average The averages are calculated from 10 runs, and they do not change substantially across different executions. I've made a biopython branch including this script and the additional function in Bio.Motif (for extracting log-odds in MOODS compatible format). I've also drafted a blog post, but i would greatly appreciate any help from people who are more skilled in writing. There's how it goes: """ In a recent article, Janne Korhonen et al. (Bioinformatics, 2009) introduce a new fast software library for finding motif occurences in DNA sequences. They also compare performance of their tool with currently available solutions from Bioperl and Biopython. Unfortunately, biopython is the only tool in the comparison whose performance is measured based on a solution written in an interpreted language, while both MOODS and bioperl are written in compiled languages (C++ and C, respectively). This, not surprisingly, shows biopython as by far the slowest of the three. Since the authors made their comparisons, however, we have moved on, and thanks to the C code contributed by Michiel de Hoon and included in the 1.51 release Biopython's motif finding library improved greatly and is performing comparably to the MOODS package. The results of a quick benchmark script (http://github.com/barwil/biopython/blob/df0dfa8feeb15ce50d027d1492913f2d8920c9b3/Tests/Motif/moods_motif_benchmark.py) indicate that a simple algorithm implemented in C is able to scan a whole chromosome (>23Mb) in less than 3s for a typical DNA motif. Depending on a motif, the advanced linear algorithm from MOODS package can decrease (or in some cases even increase) this running time by a few seconds. """ It sounds quite dull to me, so I would greatly appreciate ideas on improving the text and making it less formal and boring... Cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Oct 13 06:58:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Oct 2009 02:58:37 -0400 Subject: [Biopython-dev] [Bug 2927] New: Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2927 Summary: Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser Product: Biopython Version: 1.52 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: blocker Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: ibdeno at gmail.com This is a problem with NCBIStandalone.PSIBlastParser, which I need to use instead of NCBIXML since the latter one lacks some record properties that I need. My code used to work until recently (say three months) and now it seems something has changed in the latest biopython (1.52-1, I install it on an intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether I use python 2.5 or 2.6 and also the same for blastpgp 2.2.18 and 2.2.22 Here follows the relevant part of the code: #### blast_out, error_info = NCBIStandalone.blastpgp( blastcmd='/usr/local/blast-2.2.18/bin/blastpgp', database='/opt/BlastDBs/' + db, infile=file, npasses=passes, program='blastpgp', descriptions='500', alignments='1000', align_view='0', matrix_outfile=outbase + '.' + db + '.' + str(passes) + '.pssm') b_parser = NCBIStandalone.PSIBlastParser() b_record = b_parser.parse(blast_out) #### And this is the error that I now get: #### File "/Users/mol/bin/lpbl.py", line 64, in doblast b_record = b_parser.parse(blast_out) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 777, in parse self._scanner.feed(handle, self._consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 97, in feed self._scan_rounds(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 234, in _scan_rounds self._scan_alignments(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 376, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 386, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 398, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 433, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 464, in _scan_hsp_alignment read_and_call(uhandle, consumer.query, start='Query') File "/sw/lib/python2.6/site-packages/Bio/ParserSupport.py", line 303, in read_and_call method(line) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 1138, in query raise ValueError("I could not find the query in line\n%s" % line) ValueError: I could not find the query in line Query: 0 - #### Now, the interesting thing is that if I run blastpgp directly and catch the output to a file, this file never includes such a line as: Query: 0 - Actually, if I modify my code so it reads this output file, the PSIBlastParser processes it without error. Not sure if this is relevant, but I have found that something may have changed in NCBIStandalone recently, namely, this bit: _query_re = re.compile(r"Query(:?) \s*(\d+)\s*(.+) (\d+)") def query(self, line): m = self._query_re.search(line) if m is None: raise ValueError("I could not find the query in line\n%s" % line) I will post log files in plain text and xml after submitting this bug report. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 13 09:20:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Oct 2009 05:20:29 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910130920.n9D9KTVH018797@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-13 05:20 EST ------- Miguel has sent me his example text and XML output files directly by email (Bugzilla said they were too big). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 13 10:59:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Oct 2009 06:59:11 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910131059.n9DAxBR4022148@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-13 06:59 EST ------- I have tried parsing your sample output, and it seems fine: from Bio.Blast.NCBIStandalone import PSIBlastParser b_parser = PSIBlastParser() handle = open("Q3V4Q0.psiblast.txt") b_record = b_parser.parse(handle) handle.close() for b_round in b_record.rounds : print "Round %i has %i alignments" \ % (b_round.number, len(b_round.alignments)) Gives: Round 1 has 385 alignments Round 2 has 1000 alignments Round 3 has 1000 alignments Round 4 has 1000 alignments Round 5 has 1000 alignments I don't think the real problem is in the parser, but I will reply with more details on the mailing list: http://lists.open-bio.org/pipermail/biopython/2009-October/005660.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 14 12:41:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 14 Oct 2009 08:41:23 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910141241.n9ECfNBP029019@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-14 08:41 EST ------- As discussed on the mailing list, something about how blastpgp was being called via subprocess could lead to strange unparseable output. There may be some subtle issue with Bio.Blast.NCBIStandalone.blastpgp here, but in the long term that function will be phased out anyway. Miguel is now using Bio.Blast.Applications and subprocess to get BLAST to record its output directly to a file, and the problem has gone away. I'm closing this bug by marking it as "WORKSFORME" rather than "FIXED". -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 15 15:17:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Oct 2009 11:17:29 -0400 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200910151517.n9FFHTGn004094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-15 11:17 EST ------- After further discussion on the mailing list, it is still not clear what triggers these "Query: 0" lines, but it affects multiple versions of blastpgp and can apparently be seen at the command line (which means it isn't Biopython's fault in any way). We should update Bio.Blast.NCBStandalone.PSIBlastParser to cope (probably by ignoring the "Query: 0" lines). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 15 16:19:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Oct 2009 12:19:38 -0400 Subject: [Biopython-dev] [Bug 2929] New: NCBIXML PSI-Blast parser should gather all information from XML blastgpg output Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2929 Summary: NCBIXML PSI-Blast parser should gather all information from XML blastgpg output Product: Biopython Version: 1.52 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: ibdeno at gmail.com With the problems encountered while parsing plain text output from blastpgp, perhaps an answer would be to use the XML output of this program. The XML output seems to have evolved in recent versions of blastpgp and now all the info gets in a single proper XML file (not several concatenated files) and, in principle, it would seem that all the information in the plain text format can also be found in the XML one. I will attach an XML output for a PSI-Blast search that converges after 3 passes. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 15 16:20:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Oct 2009 12:20:33 -0400 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200910151620.n9FGKX1j006052@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #1 from ibdeno at gmail.com 2009-10-15 12:20 EST ------- Created an attachment (id=1374) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1374&action=view) XML output from a converged run of blastpgp -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Fri Oct 16 09:02:33 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 16 Oct 2009 11:02:33 +0200 Subject: [Biopython-dev] [Biopython] Adaptor trimmer and dimers In-Reply-To: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> References: <355533.31188.qm@web52001.mail.re2.yahoo.com> <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> Message-ID: <200910161102.33821.jblanca@btc.upv.es> We also have some code to do that using exonerate. Take a look at the function create_vector_striper_by_alignment in http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/seq_cleaner.py Jose Blanca On Thursday 15 October 2009 18:20:47 Peter wrote: > On Thu, Oct 15, 2009 at 5:00 PM, natassa wrote: > > Hallo Biopythoners, > > I followed a recent thread conversation about adaptor trimming, > > which I intend to do on Illumina runs, and I am not sure I know > > where exactly in github I could find Brad Chapman's code for > > trimming AFTER modifications that he has done based on the > > thread conversation. ... > > I guess you mean Brad's August Blog Post: > http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-seq >uences/ and the following mailing list thread which included some tips on > speeding up the Biopython side of things: > http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html > > For anyone else interested, there are some simple examples in the > tutorial (using SeqRecord slicing - elegant and simple, but a bit slow): > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off >-primer > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off >-adaptor > > And I did a blog post about low level FASTQ handling for speed > at the cost of flexibility and simplicity (using some of the same > ideas from the August mailing list discussion): > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From chris.lasher at gmail.com Sun Oct 18 05:22:29 2009 From: chris.lasher at gmail.com (Chris Lasher) Date: Sun, 18 Oct 2009 01:22:29 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython Message-ID: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> I have a need to work with the gene ontology (GO) and gene ontology annotations (GOAs) for my research. It seems Biopython still lacks GO support despite a few threads from several years ago. I'd like to make GO support in Biopython a reality now. I would really appreciate any help and suggestions. Bioperl has solid GO support. I don't find their code straightforward at all; I haven't picked out what component is responsible for what task. Nonetheless, it could provide starting points to build support for Biopython. Beyond looking through Bioperl code, though, I have several questions and I really welcome suggestions: 1) First off, does anyone have any gene ontology Python code laying around? 2) What is the Biopython stance on introducing third-party dependencies? The gene ontology is represented a directed acyclic graph (DAG) and I want to use an existing graph library rather than roll our own. What would be the aversion to requiring either NetworkX or igraph as a dependency for the GO library. (I have experience with NetworkX and would prefer it, though I imagine igraph would be very similar for nearly all the methods we'd need access to to construct the DAG) 3) What are parsers written using these days? I checked the tutorial section on them (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but this wasn't explicitly covered. Any pointers to recently written parsers? I seem to recall Biopython has moved away from Martel parsers, correct? Has anything been done with pyparsing or some other parser, or is it strictly manual now? Also, I'm welcoming tips on the architecture of parsers in general. 4) Tying the GO Annotations to a fundamental Biopython data structure. This can't really be a SeqRecord object. SeqRecord.annotations makes sense, however, I can't guarantee a SeqRecord object will exist because the annotations don't come with the sequence itself. (A sequence is required to instantiate a SeqRecord object). Any suggestions on this? 5) BioSQL support. Not having used BioSQL in the past, I'm a bit wary of adding this feature, but it is implemented in Bioperl. I haven't yet figured out if it's used as the default data store for their parsers or if it is only an optional store. Comments most welcome. Best, Chris From mjldehoon at yahoo.com Sun Oct 18 08:05:10 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 18 Oct 2009 01:05:10 -0700 (PDT) Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <426498.6116.qm@web62403.mail.re1.yahoo.com> --- On Sun, 10/18/09, Chris Lasher wrote: > I'd like to make GO support in Biopython a reality now. That would be nice. > Bioperl has solid GO support. I don't find their code > straightforward at all; I haven't picked out what component is > responsible for what task. To arrive at a good design of a Biopython module, sometimes it helps to write its documentation first, before writing the actual code. > 2) What is the Biopython stance on introducing third-party > dependencies? I think we should avoid them as much as possible. In addition to the additional hassle for users and developers, unforeseen changes in third-party dependencies may break your module. > What would be the aversion to requiring either NetworkX or igraph > as a dependency for the GO library. Are these Python modules or C software? Do NetworkX or igraph have their own third-party dependencies? Do we need the full NetworkX or igraph or just a part of it? In the latter case, assuming that these are open-source software packages, we may simply include the parts we need into Biopython. Also, how far do you get by using NumPy? > 3) What are parsers written using these days? Current parsers typically work as follows, assuming that a data file contains exactly one record: >>> handle = open("mydatafile") >>> from Bio import SomeModule >>> record = SomeModule.read(handle) # record is now a SomeModule.Record object If one data file typically contains multiple records, use a "parse" function to return an iterator: >>> handle = open("mydatafile") >>> from Bio import SomeModule >>> records = SomeModule.parse(handle) >>> for record in records: ... # record is now a SomeModule.Record object > Any pointers to recently written parsers? Bio.SeqIO.read and parse are good examples. Also you can look at Bio.Medline for a simple parser using this approach. > I seem to recall Biopython has moved away from Martel > parsers, correct? Yes. > Has anything been done with pyparsing or some other > parser, or is it strictly manual now? Not as far as I know. > Also, I'm welcoming tips on the > architecture of parsers in general. See above. Also note that few parsers nowadays use Bio.ParserSupport. This was previously used to implement parsers in Biopython (with parsers, scanners, and consumers). I would avoid Bio.ParserSupport and simply write a straightforward parser using the Python standard library. > 4) Tying the GO Annotations to a fundamental Biopython data > structure. > Any suggestions on this? A SeqRecord doesn't seem to be appropriate for gene ontology. How about a Record class specifically for GO? Also, what should such a class contain? Best, --Michiel. From biopython at maubp.freeserve.co.uk Sun Oct 18 10:34:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Oct 2009 11:34:21 +0100 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <320fb6e00910180334ke404ea3gad7e466e5d76c072@mail.gmail.com> On Sun, Oct 18, 2009 at 6:22 AM, Chris Lasher wrote: > I have a need to work with the gene ontology (GO) and gene ontology > annotations (GOAs) for my research. It seems Biopython still lacks GO > support despite a few threads from several years ago. I'd like to make > GO support in Biopython a reality now. I would really appreciate any > help and suggestions. In terms of missing functionality, it would help me greatly if you could describe the kind of things you want to achieve (and therefore how it may or may not need to connect to existing code like the SeqRecord and SeqFeature objects). > Bioperl has solid GO support. I don't find their code straightforward > at all; I haven't picked out what component is responsible for what > task. Nonetheless, it could provide starting points to build support > for Biopython. Yeah - I think Hilmar commented on some of these threads. Doing ontologies properly is hard work. > Beyond looking through Bioperl code, though, I have several questions > and I really welcome suggestions: > > 1) First off, does anyone have any gene ontology Python code > laying around? Note quite what you wanted, but Ed Cannon has an OBO to OWL parser in his github repository, http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006701.html > 2) What is the Biopython stance on introducing third-party > dependencies? The gene ontology is represented a directed acyclic > graph (DAG) and I want to use an existing graph library rather than > roll our own. What would be the aversion to requiring either NetworkX > or igraph as a dependency for the GO library. (I have experience with > NetworkX and would prefer it, though I imagine igraph would be very > similar for nearly all the methods we'd need access to to construct > the DAG) As Micheil said, we prefer to avoid 3rd party dependencies *especially* build time ones. Wrappers for 3rd party command line tools are fine. Currently we do have a number of optional python dependencies for specific functionality - e.g. ReportLab for graphics, and assorted SQL database backends. The python library NetworkX may fall into this category. Adding another dependency should not be done lightly. > 3) What are parsers written using these days? I checked the tutorial > section on them > (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but > this wasn't explicitly covered. Any pointers to recently written > parsers? I seem to recall Biopython has moved away from Martel > parsers, correct? Has anything been done with pyparsing or some other > parser, or is it strictly manual now? Also, I'm welcoming tips on the > architecture of parsers in general. Martel is gone. Everything is done in plain python these days. The coding styles vary - some are scanner/consumer, but using iterators for large files (returning natural chunks of data in steps) is normal. For things like XML, there are (several) parsers in the python standard libraries. > 4) Tying the GO Annotations to a fundamental Biopython data structure. > This can't really be a SeqRecord object. SeqRecord.annotations makes > sense, however, I can't guarantee a SeqRecord object will exist > because the annotations don't come with the sequence itself. (A > sequence is required to instantiate a SeqRecord object). Any > suggestions on this? Background to the task would help. Note you can create a SeqRecord without a sequence, but it may not be sensible. See for example the QUAL file parser which uses the new UnknownSeq object where we just know the sequence length. > 5) BioSQL support. Not having used BioSQL in the past, I'm a bit wary > of adding this feature, but it is implemented in Bioperl. I haven't > yet figured out if it's used as the default data store for their > parsers or if it is only an optional store. I would describe BioSQL as an optional data store, particularly suited to holding GenBank or EMBL files. Biopython has BioSQL support (as do BioJava etc). We follow BioPerl and use a loose ad-hoc ontology, but the BioSQL schema is designed to allow proper ontologies. This is something I have raised on the BioSQL mailing list. Related to this, EMBOSS have done a lot of work mapping between the ontologies used in GenBank, EMBL, UniProt and the standard sequence ontology - something I'm hoping we may be able to re-use in our planned support for GFF3 files. Peter From chapmanb at 50mail.com Sun Oct 18 16:34:36 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 18 Oct 2009 12:34:36 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <20091018163436.GA66322@kunkel> Hi Chris; > I'd like to make GO support in Biopython a reality now. Awesome. Great to have you working on this. > Bioperl has solid GO support. I don't find their code straightforward > at all; I haven't picked out what component is responsible for what > task. Nonetheless, it could provide starting points to build support > for Biopython. Agreed. I worked on a tiny bit of Gene Ontology stuff a while ago and Chris Mungall was very helpful in explaining some of the high level decisions. > 1) First off, does anyone have any gene ontology Python code laying around? I have a couple of things here: http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/GO/ CVS says they haven't been touched in 7 years. Feel free to use it if it's helpful. I took the approach of working directly off an installed database as opposed to flat files. > 2) What is the Biopython stance on introducing third-party > dependencies? I think Michiel and Peter tackled this, but generally the approach has been to keep Biopython as a base library that doesn't require a lot of installs to get going. As far as graph libraries go, networkx is good and Eric did some work with it for the PhyloXML library this summer. Thanks again for taking this on, Brad From chapmanb at 50mail.com Sun Oct 18 16:34:36 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 18 Oct 2009 12:34:36 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <20091018163436.GA66322@kunkel> Hi Chris; > I'd like to make GO support in Biopython a reality now. Awesome. Great to have you working on this. > Bioperl has solid GO support. I don't find their code straightforward > at all; I haven't picked out what component is responsible for what > task. Nonetheless, it could provide starting points to build support > for Biopython. Agreed. I worked on a tiny bit of Gene Ontology stuff a while ago and Chris Mungall was very helpful in explaining some of the high level decisions. > 1) First off, does anyone have any gene ontology Python code laying around? I have a couple of things here: http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/GO/ CVS says they haven't been touched in 7 years. Feel free to use it if it's helpful. I took the approach of working directly off an installed database as opposed to flat files. > 2) What is the Biopython stance on introducing third-party > dependencies? I think Michiel and Peter tackled this, but generally the approach has been to keep Biopython as a base library that doesn't require a lot of installs to get going. As far as graph libraries go, networkx is good and Eric did some work with it for the PhyloXML library this summer. Thanks again for taking this on, Brad From chris.lasher at gmail.com Mon Oct 19 04:26:48 2009 From: chris.lasher at gmail.com (Chris Lasher) Date: Mon, 19 Oct 2009 00:26:48 -0400 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <20091018163436.GA66322@kunkel> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <20091018163436.GA66322@kunkel> Message-ID: <128a885f0910182126i74f7712bo2cb6e7d612532e5@mail.gmail.com> On Sun, Oct 18, 2009 at 12:34 PM, Brad Chapman wrote: > > Hi Chris; > > > I'd like to make GO support in Biopython a reality now. > > Awesome. Great to have you working on this. > > > Bioperl has solid GO support. I don't find their code straightforward > > at all; I haven't picked out what component is responsible for what > > task. Nonetheless, it could provide starting points to build support > > for Biopython. > > Agreed. I worked on a tiny bit of Gene Ontology stuff a while ago > and Chris Mungall was very helpful in explaining some of the high > level decisions. > > > 1) First off, does anyone have any gene ontology Python code laying around? > > I have a couple of things here: > > http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/GO/ > > CVS says they haven't been touched in 7 years. Feel free to use it > if it's helpful. I took the approach of working directly off an > installed database as opposed to flat files. > > > 2) What is the Biopython stance on introducing third-party > > dependencies? > > I think Michiel and Peter tackled this, but generally the approach > has been to keep Biopython as a base library that doesn't require a > lot of installs to get going. > > As far as graph libraries go, networkx is good and Eric did some > work with it for the PhyloXML library this summer. > > Thanks again for taking this on, > Brad Right, well, first off, thanks for your input so far, guys. I don't have time tonight to reply to individual points but I went ahead and started a wiki page to coordinate this. http://biopython.org/wiki/Gene_Ontology It's a wiki, so you know what to do if you have an idea or a question. I'm going to go ahead and make the executive decision to use NetworkX. I think BioPerl's Ontology framework has both third-party dependency-based (Graph.pm) and non-dependency-based solutions. Maybe we can figure out something similar, but NetworkX is such an easy dependency to satisfy that I'm going with it. Looks like this is going to be a busy week. Chris From dalloliogm at gmail.com Mon Oct 19 08:32:31 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 19 Oct 2009 10:32:31 +0200 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> Message-ID: <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher wrote: > 2) What is the Biopython stance on introducing third-party > dependencies? The gene ontology is represented a directed acyclic > graph (DAG) and I want to use an existing graph library rather than > roll our own. What would be the aversion to requiring either NetworkX > or igraph as a dependency for the GO library. (I have experience with > NetworkX and would prefer it, though I imagine igraph would be very > similar for nearly all the methods we'd need access to to construct > the DAG) > introducing networkx as a dependency would also open the road to modules to work with pathways and networkx with biopython. For example, I have a partially complete script to parse Kegg's KGML files for pathway and put them into a networkx object. The problem is that biopython is a monolitic packages - you have to install it all or nothing. Maybe in a future (it is just a tought) it would be better to have it as a repository of packages, like BioConductor. > 3) What are parsers written using these days? I checked the tutorial > section on them > (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but > this wasn't explicitly covered. Any pointers to recently written > parsers? I seem to recall Biopython has moved away from Martel > parsers, correct? Has anything been done with pyparsing or some other > parser, or is it strictly manual now? Also, I'm welcoming tips on the > architecture of parsers in general. > > 4) Tying the GO Annotations to a fundamental Biopython data structure. > This can't really be a SeqRecord object. SeqRecord.annotations makes > sense, however, I can't guarantee a SeqRecord object will exist > because the annotations don't come with the sequence itself. (A > sequence is required to instantiate a SeqRecord object). Any > suggestions on this? > What about using zope component and zope interface? It is an alternative approach to object programming, based on the experience that zope developers matured with Zope 2, which was a mess of similar classes and objects that became too difficult to maintain. - http://wiki.zope.org/zope3/ComponentArchitectureApproach Comments most welcome. > > Best, > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Oct 19 08:55:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 09:55:59 +0100 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> Message-ID: <320fb6e00910190155j33394e43v2577427c6375a077@mail.gmail.com> On Mon, Oct 19, 2009 at 9:32 AM, Giovanni Marco Dall'Olio wrote: > On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher wrote: > >> 2) What is the Biopython stance on introducing third-party >> dependencies? The gene ontology is represented a directed acyclic >> graph (DAG) and I want to use an existing graph library rather than >> roll our own. What would be the aversion to requiring either NetworkX >> or igraph as a dependency for the GO library. (I have experience with >> NetworkX and would prefer it, though I imagine igraph would be very >> similar for nearly all the methods we'd need access to to construct >> the DAG) > > introducing networkx as a dependency would also open the road to > modules to work with pathways and networkx with biopython. > For example, I have a partially complete script to parse Kegg's KGML > files for pathway and put them into a networkx object. I've not used NetworkX personally, but it looks cool. The only network analysis I've done in Python used NumPy for adjacency matrices, and GraphViz via pydot for graphical output. > The problem is that biopython is a monolitic packages - you have > to install it all or nothing. And why is that a problem? This is a serious question. NumPy is a build time dependency (due to the C code), but pure python dependencies like MySQLdb (or potentially NetworkX) can be installed after Biopython, if and when then are needed. > Maybe in a future (it is just a tought) it would be better to > have it as a repository of packages, like BioConductor. If/when PyPI becomes the standard way to deal with Python packages and interdependencies, then that might be workable. But without some system like that in place, you'll only make installation harder. Out of interest, have you ever tried installing BioPerl (via CPAN)? There is a lot to be said for a single simple to install package as now. Peter From dalloliogm at gmail.com Mon Oct 19 09:46:43 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 19 Oct 2009 11:46:43 +0200 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <320fb6e00910190155j33394e43v2577427c6375a077@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <5aa3b3570910190132p1f8ef258uff95f912f50d9ea5@mail.gmail.com> <320fb6e00910190155j33394e43v2577427c6375a077@mail.gmail.com> Message-ID: <5aa3b3570910190246u7a175626xe9e97781dee460a3@mail.gmail.com> On Mon, Oct 19, 2009 at 10:55 AM, Peter wrote: > On Mon, Oct 19, 2009 at 9:32 AM, Giovanni Marco Dall'Olio > wrote: > > On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher >wrote: > > > >> 2) What is the Biopython stance on introducing third-party > >> dependencies? > > > > > The problem is that biopython is a monolitic packages - you have > > to install it all or nothing. > > And why is that a problem? This is a serious question. NumPy > is a build time dependency (due to the C code), but pure python > dependencies like MySQLdb (or potentially NetworkX) can be > installed after Biopython, if and when then are needed. > I didn't really mean to say it is a problem, I think it has some disadvantages and some advantages, as everything :-) Biopython now is easier to install, because people can just download a package or a module or use easy_install. and some common guidelines, like how to write documentation, the SeqRecord system, make it easier to maintain; but on the other hand, when you propose a new module you have to pay attention to not adding new dependencies, which is something that the bioConductor's developers don't have to care of. Anyway, it is true that without a good system to download and install packages automatically, BioConductor and CRAN would have been different. > > Maybe in a future (it is just a tought) it would be better to > > have it as a repository of packages, like BioConductor. > > If/when PyPI becomes the standard way to deal with Python packages > and interdependencies, then that might be workable. But without some > system like that in place, you'll only make installation harder. I agree with that. By the way, I have heard that there is a lot of discussion in python-dev mailing list about a package called 'distribute' ( http://packages.python.org/distribute/setuptools.html) which in the future may replace setuptools while remaining compatible with it. In fact, setuptools and easy_install have not been updated for a long time now, let's see if with this something will improve soon.... > Out of > interest, have you ever tried installing BioPerl (via CPAN)? There is > a lot to be said for a single simple to install package as now. > No, I didn't.... but I am scared of perl in general :-) > Peter > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From kellrott at gmail.com Mon Oct 19 17:18:03 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 19 Oct 2009 10:18:03 -0700 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) Message-ID: Pfam24 was published last week ( http://pfam.sanger.ac.uk/ ) , it utilizes HMMER3 to do some rather fast HMM based protein identification (of about 11,912 families). I've gotten an initial port of the PfamScan perl script found at ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ ported to BioPython. Currently the layout somewhat mirrors the Perl module layout, but that can be evolved to be more 'pythonesque'. The interface is not yet done (it mainly works just to print out results, internal data structures aren't very clear). Thoughts and suggestions on how people would use this in their Python Scripts would be helpful. And in regards to the current GO conversation that is going on, there is a table the connects Pfam families to GO terms ( ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam24.0/database_files/gene_ontology.sql.gz ), so connecting this work to the suggested GO modules would probably be beneficial. You can find the work at http://github.com/kellrott/biopython/, under the Bio.Pfam module. Kyle From biopython at maubp.freeserve.co.uk Mon Oct 19 17:29:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 18:29:26 +0100 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) In-Reply-To: References: Message-ID: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> On Mon, Oct 19, 2009 at 6:18 PM, Kyle Ellrott wrote: > Pfam24 was published last week ( http://pfam.sanger.ac.uk/ ) , it > utilizes HMMER3 to do some rather fast HMM based protein > identification (of about 11,912 families). ?I've gotten an initial > port of the PfamScan perl script found at > ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ ported to BioPython. Perhaps I have misunderstood you (and I have not looked at the code yet), but have you just re-written the PFAM perl script pfam_scan.pl in python? Is so, what is the aim? OK, it might be a bit faster - but you would be duplicating the work of the PFAM team and creating a long term maintenance burden. I can see the value of having an HMMER3 output parser, and a command line wrapper for calling it. This will be useful for things outside of PFAM. I can see the value of having a pfam_scan.pl output parser (XML, CVS, or the possible JSON), and a command line wrapper for calling it. Peter From kellrott at gmail.com Mon Oct 19 17:46:20 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 19 Oct 2009 10:46:20 -0700 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) In-Reply-To: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> References: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> Message-ID: I've started as a close re-write the the original PfamScan script to make sure the python script works equivalently to the original. Now that it works (for basic tests), I will begin by putting better data interfaces. The Bio.Pfam.HMM module should as a HMMER3 module work by itself. But it needs some examples, and probably some work on making the interface more clean. We could also move the code to Bio.HMMER, rather than having it as a sub modules of Bio.Pfam. This was primarily motivated by the dependency hell associated with trying to get pfam_scan.pl to work on a cluster. pfam_scan.pl relies on BioPerl and Moose. From the readme: 'Moose itself has quite a few dependencies, so don't worry if it looks like you're installing half of CPAN !'. The code I've produced works within the BioPython framework with no additional dependencies. pfam_scan.pl just does format parsing and table linking. The heavy work is done in HMMER. The dependency cost of pfam_scan.pl is just to great consider it's functionality can be easily replicated in BioPython. > Perhaps I have misunderstood you (and I have not looked at > the code yet), but have you just re-written the PFAM perl script > pfam_scan.pl in python? Is so, what is the aim? OK, it might be > a bit faster - but you would be duplicating the work of the PFAM > team and creating a long term maintenance burden. > > I can see the value of having an HMMER3 output parser, and > a command line wrapper for calling it. This will be useful for > things outside of PFAM. > > I can see the value of having a pfam_scan.pl output parser (XML, > CVS, or the possible JSON), and a command line wrapper for > calling it. > > Peter > From biopython at maubp.freeserve.co.uk Mon Oct 19 19:02:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 20:02:56 +0100 Subject: [Biopython-dev] Pfam24/HMMER3 (and GO terms...) In-Reply-To: References: <320fb6e00910191029k14b1ae56gd7dcd5db93f9c598@mail.gmail.com> Message-ID: <320fb6e00910191202g64a74c21tf77da909a0356eb6@mail.gmail.com> On Mon, Oct 19, 2009 at 6:46 PM, Kyle Ellrott wrote: > I've started as a close re-write the the original PfamScan script to > make sure the python script works equivalently to the original. ?Now > that it works (for basic tests), I will begin by putting better data > interfaces. ?The Bio.Pfam.HMM module should as a HMMER3 module work by > itself. ?But it needs some examples, and probably some work on making > the interface more clean. ?We could also move the code to Bio.HMMER, > rather than having it as a sub modules of Bio.Pfam. That sounds good. Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 20 10:45:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 20 Oct 2009 06:45:34 -0400 Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails with blastall 2.2.14 In-Reply-To: Message-ID: <200910201045.n9KAjY4e030244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-20 06:45 EST ------- Fixed in github, tested on the two examples here and also output from BLAST 2.2.20 and 2.2.21 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Oct 20 10:47:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 11:47:29 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated Message-ID: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> Hi all, Just to let you know I've been doing a little work on the NCBI plain text parser, and got it to work on multiquery output from recent versions of BLAST (Bug 2090). http://bugzilla.open-bio.org/show_bug.cgi?id=2090 I would not describe the changes as elegant, but the plain text parser has evolved over time to cope with more and more NCBI variations, so some ugliness is perhaps to be expected. If there are any regressions, please report them and we can extend the test suite. Likewise, if you have an recent plain text BLAST files which didn't work and still don't work - get in touch, it may be easy to fix. [I'd still encourage everyone to use the XML output by default, but there are times when the plain text is the only or best option.] Peter From peter at maubp.freeserve.co.uk Tue Oct 20 11:56:51 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 12:56:51 +0100 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.22 now available In-Reply-To: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> References: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> Message-ID: <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> Hi all, The new NCBI BLAST tools are out now, and I'd only just updated my desktop to BLAST 2.2.21 this morning! It looks like the "old style" blastall etc (which are written in C) are much the same, but we will need to add Bio.Blast.Applications wrappers for the new "BLAST+" tools (written in C++). On the bright side, the Biopython tutorial needed updating anyway to switch from Bio.Blast.NCBIStandalone.blastall(...) to using Bio.Blast.Applications and subprocess. Peter ---------- Forwarded message ---------- From: mcginnis Date: Tue, Oct 20, 2009 at 12:42 PM Subject: [blast-announce] BLAST 2.2.22 now available To: blast-announce at ncbi.nlm.nih.gov BLAST 2.2.22 now available This?release includes?new BLAST+ command-line applications. The BLAST+ applications have a number of advantages over the older applications and users are encouraged to migrate to the new applications.? The new applications can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST? These applications have been built with the NCBI C++ toolkit. Changes from the last release are listed below. The older C toolkit applications (e.g., blastall) are still available at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.22/ Changes from the last release are listed below. Please send questions or comments to blast-help at ncbi.nlm.nih.gov Changes for the BLAST+ applications: * Added entrez_query command line option for restricting remote BLAST databases. * Added support for psi-tblastn to the tblastn command line application via ? the -in_pssm option. * Improved documentation for subject masking feature in user manual. * User interface improvements to windowmasker. * Made the specification of BLAST databases to resolve GIs/accessions ? configurable. * update_blastdb.pl downloads and checks BLAST database MD5 checksum files. * Allow long words with blastp. * Added support for overriding megablast index when importing search strategy ? files. * Added support for best-hit algorithm parameters in strategy files. * Bug fixes in blastx and tblastn with genomic sequences, subject masking, ? blastdbcheck, and the SEG filtering algorithm. Changes for C applications: * Blastall was not able to use BLAST databases with only accessions to format results, this has been fixed. From biopython at maubp.freeserve.co.uk Tue Oct 20 13:24:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 14:24:09 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated In-Reply-To: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> References: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> Message-ID: <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> On Tue, Oct 20, 2009 at 11:47 AM, Peter wrote: > Hi all, > > Just to let you know I've been doing a little work on the NCBI plain > text parser, and got it to work on multiquery output from recent > versions of BLAST (Bug 2090). > > http://bugzilla.open-bio.org/show_bug.cgi?id=2090 > > I would not describe the changes as elegant, but the plain text parser > has evolved over time to cope with more and more NCBI variations, so > some ugliness is perhaps to be expected. > > If there are any regressions, please report them and we can extend the > test suite. Likewise, if you have an recent plain text BLAST files > which didn't work and still don't work - get in touch, it may be easy > to fix. > > [I'd still encourage everyone to use the XML output by default, but > there are times when the plain text is the only or best option.] > > Peter The irony of this isn't lost on me. An hour after fixing the parser and testing it with BLAST 2.2.21 output, the NCBI made a release which broke it again. The output from the latest "classic" blastall 2.2.22 is fine. The output from the "new C++" blastx 2.2.22+ (and likely the other tools like blastp etc which are all separate executables now) breaks our plain text parser. I'll also be trying out the XML output which *should* be fine. Peter From biopython at maubp.freeserve.co.uk Tue Oct 20 15:08:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 16:08:55 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated In-Reply-To: <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> References: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> Message-ID: <320fb6e00910200808m3fa1a56dp2578d387d318cc5a@mail.gmail.com> On Tue, Oct 20, 2009 at 2:24 PM, Peter wrote: > On Tue, Oct 20, 2009 at 11:47 AM, Peter wrote: >> Hi all, >> >> Just to let you know I've been doing a little work on the NCBI plain >> text parser, and got it to work on multiquery output from recent >> versions of BLAST (Bug 2090). >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 >> >> I would not describe the changes as elegant, but the plain text parser >> has evolved over time to cope with more and more NCBI variations, so >> some ugliness is perhaps to be expected. >> >> If there are any regressions, please report them and we can extend the >> test suite. Likewise, if you have an recent plain text BLAST files >> which didn't work and still don't work - get in touch, it may be easy >> to fix. >> >> [I'd still encourage everyone to use the XML output by default, but >> there are times when the plain text is the only or best option.] >> >> Peter > > The irony of this isn't lost on me. An hour after fixing the parser > and testing it with BLAST 2.2.21 output, the NCBI made a release > which broke it again. > > The output from the latest "classic" blastall 2.2.22 is fine. > > The output from the "new C++" blastx 2.2.22+ (and likely the > other tools like blastp etc which are all separate executables > now) breaks our plain text parser. Touch wood, that is now working with the latest code in the public repository. It really needs a few more example files covering more than just the new blastx output... Peter From biopython at maubp.freeserve.co.uk Tue Oct 20 15:59:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 16:59:09 +0100 Subject: [Biopython-dev] Plain text BLAST parser updated In-Reply-To: <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> References: <320fb6e00910200347j3857cbbdyc62ea3d39a05b357@mail.gmail.com> <320fb6e00910200624s4d2662axb08df052f39b9ceb@mail.gmail.com> Message-ID: <320fb6e00910200859i4c7fa800j10cad1abe10a007a@mail.gmail.com> On Tue, Oct 20, 2009 at 2:24 PM, Peter wrote: > > The output from the "new C++" blastx 2.2.22+ (and likely the > other tools like blastp etc which are all separate executables > now) breaks our plain text parser. > > I'll also be trying out the XML output which *should* be fine. > That seems to be fine - although I did find the BLAST record's database_sequences property wasn't being populated, just the alias num_sequences_in_database - fixed in github. Peter From biopython at maubp.freeserve.co.uk Wed Oct 21 11:39:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 12:39:07 +0100 Subject: [Biopython-dev] First "new" contribution via git Message-ID: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> Hi all, I'd just like to mention that I have just commited a tiny enhancement from Chris Lasher (username gotgenes) to add a verbose option to run_tests.py (using the git check-pick command to grab the single commit for this change). I think this marks the first commit from a non-core developer since we moved to github. I'm sure there will be many more to come (and not just from Chris!) :) Peter From bugzilla-daemon at portal.open-bio.org Wed Oct 21 19:45:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 15:45:14 -0400 Subject: [Biopython-dev] [Bug 2931] New: Error in PDBList() code for get_all_obsolete Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2931 Summary: Error in PDBList() code for get_all_obsolete Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com Hi, I believe this code: # extract pdb codes obsolete = map(lambda x: x[21:25].lower(), filter(lambda x: x[:6] == 'OBSLTE', url.readlines())) Should instead be # extract pdb codes obsolete = map(lambda x: x[20:24].lower(), filter(lambda x: x[:6] == 'OBSLTE', url.readlines())) As-is, it is missing the first characters of the PDB code and reading one character past it's end upon testing. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 20:26:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 16:26:40 -0400 Subject: [Biopython-dev] [Bug 2933] New: PDBList() get_status_list bug Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2933 Summary: PDBList() get_status_list bug Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com Hi, Upon testing, I believe the following code in the PDBList class get_status_list method is based on an older file format for added.pdb, modified.pdb, and obsolete.pdb: # added by S. Lee list = map(lambda x: x[3:7], \ filter(lambda x: x[-4:] == '.ent', \ map(lambda x: x.split()[-1], file))) I think the file format used to be: -rw-r--r-- 1 rcsb rcsb 330156 Oct 14 2003 pdb1cyq.ent -rw-r--r-- 1 rcsb rcsb 333639 Oct 14 2003 pdb1cz0.ent Now the file format is simply: 1cyq 1cz0 etc Therefore, I believe the correct code to be: list = map(lambda x: x[0:4], file) Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 21:53:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 17:53:45 -0400 Subject: [Biopython-dev] [Bug 2931] Error in PDBList() code for get_all_obsolete In-Reply-To: Message-ID: <200910212153.n9LLrjtL006614@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2931 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-21 17:53 EST ------- Have you got an example to demonstrate this issue? e.g. a PDB files and a tiny script which we can turn into a unit test? Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 21:55:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 17:55:12 -0400 Subject: [Biopython-dev] [Bug 2933] PDBList() get_status_list bug In-Reply-To: Message-ID: <200910212155.n9LLtC3f006654@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2933 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-21 17:55 EST ------- Can you give us a tiny script to demonstrate the problem? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 22:39:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 18:39:04 -0400 Subject: [Biopython-dev] [Bug 2931] Error in PDBList() code for get_all_obsolete In-Reply-To: Message-ID: <200910212239.n9LMd4kI007914@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2931 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |TallPaulInJax at yahoo.com ------- Comment #2 from TallPaulInJax at yahoo.com 2009-10-21 18:39 EST ------- Here is some test code: from Bio.PDB import PDBList p = PDBList() obsolete = p.get_all_obsolete() print obsolete Printout is below: these should be full four digit codes, but the first digit is missing and the space after the last char is instead at the end.['16l ', '25d ', '4ps ', '51c ', '56b ', '79l ', 'a0v ', 'a0w ', 'a0x ', 'a0y ', 'a10 ', 'a1y ', 'a6o ', 'a9d ', 'a9k ', 'aa8 ', 'aak ', 'abh ', 'abk ', 'abm ', 'abp ', 'abx ', 'ace ', 'ack ', 'act ', 'ada ', 'adh ', 'adk ', 'adm ', 'afg ', 'afn ', 'ak3 ', 'alo ', 'alp ', 'alr ', 'am3 ', 'amg ', 'amv ', 'anh ', 'ape ', 'app ', 'apr ', 'ar3 ', 'ara ', 'arn ', 'as9 ', 'asi ', 'at7 ', 'atc ', 'atq ', 'aub ', 'axf ', 'ayh ', 'ayq ', 'az9 ', 'aza ', 'b1w ', 'b2n ', 'b3m ', 'b5c ', 'b6n ', 'b6o ', 'b7c ', 'b8b ', 'b91 ', 'baa ', 'baq ', 'bcl ', 'bdp ', 'ber ', 'bfl ', 'bgh ', 'bgr ', 'bjl ', 'bkq ', 'bl2 ', 'blm ', 'blw ', 'bme ', 'bmi ', 'bmy ', 'bn2 ', 'bnh ', 'bqv ', 'br7 ', 'buk ', 'bur ', 'bv0 ', 'bv5 ', 'bv6 ', -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 21 22:48:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 21 Oct 2009 18:48:08 -0400 Subject: [Biopython-dev] [Bug 2933] PDBList() get_status_list bug In-Reply-To: Message-ID: <200910212248.n9LMm8GK008083@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2933 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |TallPaulInJax at yahoo.com ------- Comment #2 from TallPaulInJax at yahoo.com 2009-10-21 18:48 EST ------- With the code as-is: list = map(lambda x: x[3:7], \ filter(lambda x: x[-4:] == '.ent', \ map(lambda x: x.split()[-1], file))) This test code: p = PDBList() [added,modified,obsolete] = p.get_recent_changes() print "Added=", added print "Modified=", modified print "Obsolete=", obsolete Results in: Added= [] Modified= [] Obsolete= [] Yet visually these (20091016) weekly files have entries in them. Changing the code to list = map(lambda x: x[0:4],file) Results in: Added= ['2k9d', '2k9i', '2kac', '2kap', '2kc5', '2kdx', '2khi', '2khj', '2khs', '2ki0', '2ki2', '2kj4', '2klh', '2klu', '2v57', '2w9y', '2wgb', '2wl1', '2wor', '2wos', '2wri', '2wrj', '2wrk', '2wrl', '2wrn', '2wro', '2wrq', '2wrr', '2wu3', '2wu4', '2wu6', '2wu7', '2wud', '2wue', '2wuf', '2wug', '2wul', '2wuz', '2wv1', '2zuh', '2zui', '2zuj', '3a0n', '3a0r', '3a0s', '3a0t', '3a0u', '3a0v', '3a0w', '3a0x', '3a0y', '3a0z', '3a10', '3a2k', '3a3t', '3a4k', '3a4l', '3a4m', '3a4n', '3eq2', '3es2', '3evh', '3evl', '3ew4', '3ewt', '3ewv', '3f3m', '3f3q', '3f3r', '3f55', '3f68', '3f7e', '3fao', '3fei', '3fej', '3fhp', '3fhu', '3fou', '3ft7', '3g05', '3g7a', '3g9w', '3gea', '3gew', '3gfu', '3ggh', '3gi3', '3glj', '3gne', '3gns', '3gnt', '3gnv', '3gnw', '3gr6', '3gxk', '3gxr', '3h1t', '3h53', '3h54', '3h55', '3h6q', '3h6r', '3h6s', '3h89', '3h8b', '3h8c', '3h8n', '3hi7', '3hig', '3hii', '3hj2', '3hj7', '3hjh', '3hlr', '3hpv', '3hpy', '3hq0', '3hqh', '3hqi', '3hql', '3hqm', '3hqs', '3hqt', '3hrq', '3hrr', '3hsn', '3hso', '3hsp', '3hsv', '3htk', '3htm', '3hu6', '3hvs', '3hvv', '3hvx', '3hx3', '3hy5', '3hyf', '3hzl', '3i43', '3i99', '3i9p', '3igu', '3im0', '3ipj', '3ira', '3is5', '3it8', '3it9', '3ita', '3itb', '3ius', '3iuy', '3ivb', '3ivq', '3ivv', '3jpx', '3jr6', '3jr7', '3jrn', '3jty', '3juh', '3jux', '3jvf', '3jwg', '3jwh', '3jwi', '3jwj', '3jwp', '3jxu', '3jyu', '3jz1', '3jz2', '3k1y', '3k20', '3k2i', '3k2w', '3k31', '3k4s', '3k5e', '3k5h', '3k5i', '3k63', '3k67', '3k6a', '3k6r'] Modified= ['1lug', '1nlf', '1u8d', '1uab', '1vst', '2dxb', '2dxc', '2e52', '2gj5', '2gqg', '2gvl', '2gyt', '2igo', '2jjn', '2jjo', '2jxw', '2kal', '2kdr', '2kis', '2klf', '2klg', '2klv', '2koe', '2kon', '2qrm', '2qrp', '2qrq', '2rfk', '2uzj', '2v0y', '2v1p', '2v4v', '2vn4', '2vn7', '2wio', '2wjg', '2wml', '2wn4', '2wn5', '2wn6', '2wn7', '2wn8', '2wnb', '2wnf', '2wqi', '2wqj', '2wr0', '2wr1', '2wr2', '2wr3', '2wr4', '2wr5', '2wr7', '2wrb', '2wrc', '2wrd', '2wre', '2wrf', '2wrg', '2zc9', '2zda', '2zf9', '2zfp', '2zgx', '2zo3', '2zva', '2zzd', '3a21', '3a22', '3a23', '3a25', '3a26', '3a27', '3a5v', '3dge', '3dgf', '3dhk', '3djo', '3djp', '3djq', '3djv', '3djx', '3dux', '3eft', '3eli', '3ess', '3f30', '3f4y', '3f4z', '3f50', '3fan', '3fje', '3fjf', '3fjh', '3fji', '3fjj', '3fjk', '3ftq', '3fwe', '3g5d', '3g6o', '3g6w', '3gfb', '3gfk', '3gjn', '3gwu', '3gwv', '3gww', '3h3g', '3h90', '3h94', '3h9i', '3h9t', '3hg1', '3hhm', '3hhs', '3hif', '3hiz', '3hjf', '3hk2', '3hm9', '3ho1', '3hom', '3hto', '3htp', '3htq', '3htt', '3htx', '3hvr', '3hxm', '3hxt', '3hy2', '3hy3', '3hy4', '3hy6', '3i53', '3i58', '3i5u', '3i64', '3i65', '3i68', '3i6r', '3iai', '3ibg', '3icv', '3icw', '3idq', '3ig3', '3iiw', '3iiy', '3ij0', '3ij1', '3ijc', '3ijt', '3ikw', '3ilr', '3inn', '3ir8', '3jsm', '3jwq', '3jyi', '3k2x', '3k45', '3k47'] Obsolete= ['2b3n', '2b7w', '2frn', '2fuw', '2wem', '3bw5'] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 11:23:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 07:23:34 -0400 Subject: [Biopython-dev] [Bug 2918] Entrez parser fails on Jython - XMLParser lacks SetParamEntityParsing In-Reply-To: Message-ID: <200910221123.n9MBNYtT032594@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2918 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 07:23 EST ------- In the short term, we'll skip test_Entrez.py under Jython to avoid screenfulls of error messages about SetParamEntityParsing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 12:12:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:12:59 -0400 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200910221212.n9MCCx0p001591@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:12 EST ------- What specifically is our parser failing to extract from this example PSI BLAST XML file? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 12:42:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:42:12 -0400 Subject: [Biopython-dev] [Bug 2931] Error in PDBList() code for get_all_obsolete In-Reply-To: Message-ID: <200910221242.n9MCgC39002369@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2931 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:42 EST ------- Fixed in git repository. Thank you for your report, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 12:52:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:52:45 -0400 Subject: [Biopython-dev] [Bug 2933] PDBList() get_status_list bug In-Reply-To: Message-ID: <200910221252.n9MCqjdR002671@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2933 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:52 EST ------- Fixed in the git repository. Thank you for your report. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 12:55:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 08:55:00 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200910221255.n9MCt0a1002767@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-22 08:55 EST ------- (In reply to comment #4) > Peter, > > yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, > I'll take a look at it and post them here. Hi Christian, Did you identify any other problem PDB files? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 22 13:08:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Oct 2009 09:08:11 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200910221308.n9MD8B4a003222@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #7 from schafer at rostlab.org 2009-10-22 09:08 EST ------- > Did you identify any other problem PDB files? Peter, not yet, sorry. I'm in the middle of publishing a paper. But I'm still on it. I'll let you know. Chris -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Oct 22 13:50:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 14:50:44 +0100 Subject: [Biopython-dev] First "new" contribution via git In-Reply-To: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> References: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> Message-ID: <320fb6e00910220650y2695cef4qd2ede67280942874@mail.gmail.com> On Wed, Oct 21, 2009 at 12:39 PM, Peter wrote: > Hi all, > > I'd just like to mention that I have just committed a tiny enhancement > from Chris Lasher (username gotgenes) to add a verbose option to > run_tests.py (using the git check-pick command to grab the single > commit for this change). > > I think this marks the first commit from a non-core developer since we > moved to github. I'm sure there will be many more to come (and not > just from Chris!) :) I should perhaps clarify this is the first commit from a non-core developer handled via git, and thus appearing under their git username. We have previously manually committed fixes posted on a git branch (e.g. some of Kyle's work for Jython). Peter From dalloliogm at gmail.com Thu Oct 22 14:54:46 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 22 Oct 2009 16:54:46 +0200 Subject: [Biopython-dev] First "new" contribution via git In-Reply-To: <320fb6e00910220650y2695cef4qd2ede67280942874@mail.gmail.com> References: <320fb6e00910210439k48ee611exfcdf4b10067b7689@mail.gmail.com> <320fb6e00910220650y2695cef4qd2ede67280942874@mail.gmail.com> Message-ID: <5aa3b3570910220754p42cc5d14nddb9e8862bf4bc9d@mail.gmail.com> On Thu, Oct 22, 2009 at 3:50 PM, Peter wrote: > > I should perhaps clarify this is the first commit from a non-core > developer handled via git, and thus appearing under their git > username. We have previously manually committed fixes posted > on a git branch (e.g. some of Kyle's work for Jython). great! :-) -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 22 19:54:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 20:54:28 +0100 Subject: [Biopython-dev] Biopython on 64 bit Windows Message-ID: <320fb6e00910221254s9df2270h335e9b2c15c70993@mail.gmail.com> Hi all, Prompted by Mike Lisanke's query on the main list, we should try and provide installers for 64 bit Windows. Do any of you here on the mailing list have a 64 bit Windows machine you would be willing to try installing Biopython from source on? Ideally of course, assuming it all works, I'd like a volunteer to provide installers for us too. I don't know what difference XP versus Vista versus Win 7 will make (even in 32 bit land). Thanks, Peter From bugzilla-daemon at portal.open-bio.org Sun Oct 25 14:01:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 25 Oct 2009 10:01:27 -0400 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200910251401.n9PE1RRQ019544@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #3 from ibdeno at gmail.com 2009-10-25 10:01 EST ------- (In reply to comment #2) > What specifically is our parser failing to extract from this example PSI BLAST > XML file? > (Sorry, I've been away) Well, currently the code tries to get several pieces of information from the Blast.Record.PSIBlast (brecord): brecord.converged brecord.query brecord.query_letters brecord.rounds brecord.rounds.alignments brecord.rounds.alignments.title brecord.rounds.alignments.hsps then in the hsps: hsp.identities hsp.positives hsp.query hsp.sbjct hsp.match hsp.expect hsp.query_start hsp.query_end hsp.sbjct_start hsp.sbjct_end With different XML-tag names I think that all this information is present. As I said on the mail-list, it would be ideal if the XML parser for PSI-Blast would work in the same way as the current text-mode PSIBlast parser. Please, let me know if that was not clear or if you need further information. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Mon Oct 26 15:53:06 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 12:53:06 -0300 Subject: [Biopython-dev] sff file Message-ID: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> Where I can get an sff to make some tests? Biopython doens't include any sample sff file, I guess this is due to file size limitation. I would be happy if I can download a small (less than 10Mb) sff file. I went to ENTREZ Sequence Read Archive, and seems they only provide fastq. From biopython at maubp.freeserve.co.uk Mon Oct 26 16:14:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 16:14:54 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> Message-ID: <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> On Mon, Oct 26, 2009 at 3:53 PM, Sebastian Bassi wrote: > Where I can get an sff to make some tests? I have some on my SFF branch (under Tests/Roche): http://github.com/peterjc/biopython/tree/index (I used to have a branch called "sff", and another "index" for what ended up on the trunk as the new Bio.SeqIO.index function, but the two were linked for indexing SFF files). If you fancy trying the code, it offers reading, writing and indexing of SFF files, showing the full untrimmed sequence in the SeqRecord. > Biopython doens't include any sample sff file, I guess this is due to > file size limitation. Size isn't an issue - the Roche tools will let you create reduced SFF files using just some of the records (e.g. a random subset). > I would be happy if I can download a small (less than 10Mb) sff file. > I went to ENTREZ Sequence Read Archive, and seems they only provide > fastq. I gather the NCBI Short Read Archive used to offer SFF files, but sadly that do not any more. I am aware of a few projects at Sanger with public SFF files - but these will all be large files, see: http://lists.open-bio.org/pipermail/biopython/2009-August/005443.html Peter From sbassi at clubdelarazon.org Mon Oct 26 16:48:04 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 13:48:04 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> Message-ID: <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> On Mon, Oct 26, 2009 at 1:14 PM, Peter wrote: > I have some on my SFF branch (under Tests/Roche): > http://github.com/peterjc/biopython/tree/index Thank you very much. Just for the record, the URL of the file is: http://github.com/peterjc/biopython/raw/index/Tests/Roche/E3MFGYR02_random_10_reads.sff Best, SB. From biopython at maubp.freeserve.co.uk Mon Oct 26 17:16:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 17:16:01 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> Message-ID: <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> On Mon, Oct 26, 2009 at 4:48 PM, Sebastian Bassi wrote: > On Mon, Oct 26, 2009 at 1:14 PM, Peter wrote: >> I have some on my SFF branch (under Tests/Roche): >> http://github.com/peterjc/biopython/tree/index > > Thank you very much. > Just for the record, the URL of the file is: > http://github.com/peterjc/biopython/raw/index/Tests/Roche/E3MFGYR02_random_10_reads.sff If you want a general example, that is a good choice. It is an unmodified file created by the Roche tools containing just 10 random reads. The folder also contains FASTA and QUAL files (with and without trimming), as converted by the Roche tools. Additionally I have some extra SFF files which are a little less typical... Peter From sbassi at clubdelarazon.org Mon Oct 26 17:54:53 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 14:54:53 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> Message-ID: <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> On Mon, Oct 26, 2009 at 2:16 PM, Peter wrote: > If you want a general example, that is a good choice. ?It is an unmodified file > created by the Roche tools containing just 10 random reads. The folder also > contains FASTA and QUAL files (with and without trimming), as converted > by the Roche tools. Looks like there is a problem: >>> from Bio import SeqIO >>> fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff') >>> for rec in SeqIO.parse(fh,'sff'): print rec.id E3MFGYR02JWQ7T E3MFGYR02JA6IL (... cut ...) >>> for rec in SeqIO.parse(fh,'sff'): print rec.seq Traceback (most recent call last): File "", line 1, in for rec in SeqIO.parse(fh,'sff'): File "/usr/local/lib/python2.6/dist-packages/biopython-1.52-py2.6-linux-i686.egg/Bio/SeqIO/SffIO.py", line 354, in SffIterator = _sff_file_header(handle) File "/usr/local/lib/python2.6/dist-packages/biopython-1.52-py2.6-linux-i686.egg/Bio/SeqIO/SffIO.py", line 56, in _sff_file_header raise ValueError("Wrong SFF magic number in header") ValueError: Wrong SFF magic number in header I have 1.52 plus SffIO.py and __init__.py from your branch. From biopython at maubp.freeserve.co.uk Mon Oct 26 18:00:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 18:00:32 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> Message-ID: <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> On Mon, Oct 26, 2009 at 5:54 PM, Sebastian Bassi wrote: > On Mon, Oct 26, 2009 at 2:16 PM, Peter wrote: >> If you want a general example, that is a good choice. ?It is an unmodified file >> created by the Roche tools containing just 10 random reads. The folder also >> contains FASTA and QUAL files (with and without trimming), as converted >> by the Roche tools. > > Looks like there is a problem: > >>>> from Bio import SeqIO >>>> fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff') Try binary mode - although on my tests it isn't essential on Unix, it is on Windows. >>>> for rec in SeqIO.parse(fh,'sff'): > ? ? ? ?print rec.id > > > E3MFGYR02JWQ7T > E3MFGYR02JA6IL > (... cut ...) OK, so that looks good. >>>> for rec in SeqIO.parse(fh,'sff'): > ? ? ? ?print rec.seq > > Traceback (most recent call last): > ... > ValueError: Wrong SFF magic number in header > > > I have 1.52 plus SffIO.py and __init__.py from your branch. Are you using the same (finished) handle in the second example? That would be like opening an empty file... I think it is just the error message that is misleading here. Peter From sbassi at clubdelarazon.org Mon Oct 26 18:09:57 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 26 Oct 2009 15:09:57 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> Message-ID: <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> On Mon, Oct 26, 2009 at 3:00 PM, Peter wrote: > Try binary mode - although on my tests it isn't essential on Unix, > it is on Windows. I am in Linux Ubuntu 9.04 and I followed your advice and now I did: fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff','rb') And now it works: >>> for rec in SeqIO.parse(fh,'sff'): print rec.seq tcagGGTCTACATGTTGGTTAACCCGTACTGATTTGAATTGGCTCTTTGTCTTTCCAAAGGGAATTCATCTTCTTATGGCACACATAAAGGATAAATACAAGAATCTTCCTATTTACATCACTGAAAATGGCATGGCTGAATCAAGGAATGACTCAATACCAGTCAATGAAGCCCGCAAGGATAGTATAAGGATTAGATACCATGATGGCCATCTTAAATTCCTTCTTCAAGCGATCAAGGAAGGTGTTAATTTGAAGGGGCTTa tcagTTTTTTTTGGAAAGGAAAACGGACGTACTCATAGATGGATCATACTGACGTTAGGAAAATAATTCATAAGACAATAAGGAAACAAAGTGTAAAAAAAAAACCTAAATGCTCAAGGAAAATACATAGCCATCTGAACAGATTTCTGCTGGAAGCCACATTTCTCGTAGAACGCCTTGTTCTCGACGCTGCAATCAAGAATCACCTTGTAGCATCCCATTGAACGCGCATGCTCCGTGAGGAACTTGATGATTCTCTTTCCCAAATGcc (.... cut ...) So looks that binary mode is also needed for Linux. From biopython at maubp.freeserve.co.uk Mon Oct 26 18:43:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 18:43:50 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> Message-ID: <320fb6e00910261143x393d8d49s39b41253dcb02cb7@mail.gmail.com> On Mon, Oct 26, 2009 at 5:54 PM, Sebastian Bassi wrote: > > I have 1.52 plus SffIO.py and __init__.py from your branch. > You'll also want to get Bio/SeqIO/_index.py if you want to test random access to reads in an SFF file via the new Bio.SeqIO.index() function. This will read the Roche style SFF index if present (which is very fast) or just index the file directly (which is still reasonably quick). Peter From biopython at maubp.freeserve.co.uk Mon Oct 26 19:17:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 19:17:25 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> Message-ID: <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> On Mon, Oct 26, 2009 at 6:09 PM, Sebastian Bassi wrote: > > On Mon, Oct 26, 2009 at 3:00 PM, Peter wrote: >> Try binary mode - although on my tests it isn't essential on Unix, >> it is on Windows. > > I am in Linux Ubuntu 9.04 and I followed your advice and now I did: > > fh = open('/home/sbassi/E3MFGYR02_random_10_reads.sff','rb') > > And now it works: Opening SFF files in binary mode is good practice (as it is required for Windows), but is unrelated to your problem. It was just a simple "user error" coupled with a very unhelpful error message. I have updated my code so if you try and "re-parse" the same handle (without first doing handle.seek(0) to reset it), you get this: >>> from Bio import SeqIO >>> handle = open("E3MFGYR02_random_10_reads.sff", "rb") >>> for record in SeqIO.parse(handle, "sff") : print record.id ... E3MFGYR02JWQ7T E3MFGYR02JA6IL E3MFGYR02JHD4H E3MFGYR02GFKUC E3MFGYR02FTGED E3MFGYR02FR9G7 E3MFGYR02GAZMS E3MFGYR02HHZ8O E3MFGYR02GPGB1 E3MFGYR02F7Z7G >>> for record in SeqIO.parse(handle, "sff") : print record.id ... Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/SffIO.py", line 378, in SffIterator = _sff_file_header(handle) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/SffIO.py", line 77, in _sff_file_header raise ValueError("SFF handle seems to be at index block, not start") ValueError: SFF handle seems to be at index block, not start The code is now here, I wanted to get this semi-ready for merging to the trunk - depending on user feedback of course ;) http://github.com/peterjc/biopython/tree/sff-seqio (I feel the old index branch has served its purpose.) Peter From bioinformed at gmail.com Mon Oct 26 21:32:41 2009 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Mon, 26 Oct 2009 17:32:41 -0400 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> Message-ID: <2e1434c10910261432l314c0cf5i304a04a9055d72b1@mail.gmail.com> At the risk of asking a dumb question, is this native SFF support better than what is available via BioLib? ~Kevin From biopython at maubp.freeserve.co.uk Mon Oct 26 22:24:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 22:24:21 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <2e1434c10910261432l314c0cf5i304a04a9055d72b1@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <2e1434c10910261432l314c0cf5i304a04a9055d72b1@mail.gmail.com> Message-ID: <320fb6e00910261524j6255a8bp56ab79c0b436eb72@mail.gmail.com> On Mon, Oct 26, 2009 at 9:32 PM, Kevin Jacobs wrote: > At the risk of asking a dumb question, is this native SFF support > better than what is available via BioLib? > ~Kevin What do mean by better? >From my point of view, the nice thing about this (the Biopython SFF code) is it is integrated into the Bio.SeqIO system using SeqRecord objects, so you can use the same scripts etc that you may have written for processing FASTA or FASTQ files. Also, it is pure Python which may be important for cross platform use (e.g. Jython, IronPython, ...). According to their webpage, the BioLib SFF support is via the Staden io_lib, which is probably pretty efficient. http://biolib.open-bio.org/wiki/Main_Page Peter From sbassi at clubdelarazon.org Tue Oct 27 03:35:28 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 27 Oct 2009 00:35:28 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> Message-ID: <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> On Mon, Oct 26, 2009 at 4:17 PM, Peter wrote: > ? ?raise ValueError("SFF handle seems to be at index block, not start") > ValueError: SFF handle seems to be at index block, not start I see, the new error message is better now since gives the user a hint of the user error. I didn't realize my mistake at first because I am used to have an empty result when I make that mistake using a text format like fasta. From biopython at maubp.freeserve.co.uk Tue Oct 27 10:03:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 10:03:05 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> Message-ID: <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> On Tue, Oct 27, 2009 at 3:35 AM, Sebastian Bassi wrote: > > On Mon, Oct 26, 2009 at 4:17 PM, Peter wrote: >> ? ?raise ValueError("SFF handle seems to be at index block, not start") >> ValueError: SFF handle seems to be at index block, not start > > I see, the new error message is better now since gives the user a hint > of the user error. > I didn't realize my mistake at first because I am used to have an > empty result when I make that mistake using a text format like fasta. Great - your feedback has made a difference :) I suppose an empty file could be allowed for SFF, but I don't really like this idea. Peter From biopython at maubp.freeserve.co.uk Tue Oct 27 11:50:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 11:50:38 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910260914p5cf6b7f3gfe90a0432eb23b17@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> Message-ID: <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> On Tue, Oct 27, 2009 at 10:03 AM, Peter wrote: > > Great - your feedback has made a difference :) > As part of the polishing in anticipation of merging the SFF support into the trunk, I've just made some big additions to the docstring (with doctest examples) on the branch - it would be great if you could read over this at some point. http://github.com/peterjc/biopython/tree/sff-seqio What do you think of the current rather pragmatic way I'm handling trimming the SeqRecord objects? i.e. SeqIO file format "sff" gives the full data and supports reading and writing, while SeqIO format "sff-trim" only supports reading and gives trimmed sequences without the flow data. This is a bit of a hack, and the "sff-trim" format could be left out - but then we would need a nice way to trim the full length SeqRecord objects... Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 27 15:50:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 27 Oct 2009 11:50:03 -0400 Subject: [Biopython-dev] [Bug 2938] New: Bio.Entrez.read() returns empty string for HTML (not an error) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2938 Summary: Bio.Entrez.read() returns empty string for HTML (not an error) Product: Biopython Version: 1.52 Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk If given HTML instead of XML, Bio.Entrez.read() returns an empty string. I would have expected a helpful error message. e.g. >>> from Bio import Entrez >>> handle = Entrez.efetch(db="pubmed", id="17206916") >>> handle.readline() 'PmFetch response\n' Try parsing this HTML as if it were XML ... >>> handle = Entrez.efetch(db="pubmed", id="17206916") >>> "" == Entrez.read(handle) True i.e. Entrez.read is returning an empty string. Problem spotted based on a mailing list query, see this thread: http://lists.open-bio.org/pipermail/biopython/2009-October/005774.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 28 10:16:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Oct 2009 06:16:32 -0400 Subject: [Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error) In-Reply-To: Message-ID: <200910281016.n9SAGW9p017546@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2009-10-28 06:16 EST ------- It is relatively easy to check if the file starts with Message-ID: <200910281057.n9SAvgpY018500@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-28 06:57 EST ------- Good point - and hopefully the NCBI will make all their XML consistent. In the meantime, instead of the white list, how about a blacklist? i.e. If the data starts " Message-ID: <200910281112.n9SBC5Md018851@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2009-10-28 07:12 EST ------- (In reply to comment #2) > In the meantime, instead of the white list, how about a blacklist? > i.e. If the data starts " We could also spot things like FASTA and GenBank files etc, and > as all we want to do is spot non-XML, this should be reliable. > One important point is that the initial tag is not handled as a regular XML tag by the parser. There is a separate handler method specific for parsing the tag. This makes it much easier to check if an XML document is really XML: If this special handler is never called, it's not XML. Checking for a FASTA and GenBank file is also relatively easy; the parser raises an xml.parsers.expat.ExpatError syntax error, which we can catch and transform in a more informative message. Checking for HTML is trickier. The parser will not raise an error, because except for the missing initial tag, the HTML could in principle be regarded as XML. To check if the input starts with , we'd have to read some data ahead, check for the , and pass the data to the parser if it seems to be OK. So I suggest we add a check for xml.parsers.expat.ExpatError syntax errors now, and add a check for the initial once NCBI has fixed the XML output to always contain this tag, but don't check for . -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Oct 28 11:26:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Oct 2009 07:26:33 -0400 Subject: [Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error) In-Reply-To: Message-ID: <200910281126.n9SBQXo9019213@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-10-28 07:26 EST ------- (In reply to comment #3) > (In reply to comment #2) > > In the meantime, instead of the white list, how about a blacklist? > > i.e. If the data starts " > We could also spot things like FASTA and GenBank files etc, and > > as all we want to do is spot non-XML, this should be reliable. > > > One important point is that the initial tag is not handled as a > regular XML tag by the parser. There is a separate handler method specific for > parsing the tag. This makes it much easier to check if an XML > document is really XML: If this special handler is never called, it's not XML. > > Checking for a FASTA and GenBank file is also relatively easy; the parser > raises an xml.parsers.expat.ExpatError syntax error, which we can catch and > transform in a more informative message. Sounds good. > Checking for HTML is trickier. The parser will not raise an error, because > except for the missing initial tag, the HTML could in principle be > regarded as XML. To check if the input starts with , we'd have to read > some data ahead, check for the , and pass the data to the parser if it > seems to be OK. Understood. > So I suggest we add a check for xml.parsers.expat.ExpatError syntax errors > now, and add a check for the initial once NCBI has fixed the XML > output to always contain this tag, but don't check for . +1 on adding the syntax error check now, that will be a worthwhile improvement in itself. Regarding flagging , is it currently a safe assumption that anything starting is NOT an NCBI XML file? If the NCBI will fix all their XML output to always start then great. I suspect it will take a while though. If you want to wait, fine. I'm happy to leave this decision to you - it's your module after all ;) Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Oct 28 12:07:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Oct 2009 12:07:57 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features Message-ID: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> Hi all, I've been following a thread on the BioPerl mailing list about how to get the mature peptide amino acid sequences for mat_peptide features in a GenBank file (given in general these features do not include the translation, nor a GI number of Protein ID which can be looked up online). Chris summarised a working approach here: http://lists.open-bio.org/pipermail/bioperl-l/2009-October/031493.html Step one of this process is to be able to take a GenBank feature (here a mat_peptide) and use the location information to extract the relevant part of the parent nucleotide sequence (at the foot of the file). For example, http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank Consider mat_peptide nsp12, whose location is a little complex, join(12332..12358,12358..15117) - in Python terms, we need seq[12331:12358] + seq[12357:15117], although in general there are other concerns like the strand. Step two (in Chris' workflow) is to translate this into amino acids, and as a precaution, verify this is a subsequence of the precursor protein given in the previous CDS entry (protein ID ABI14446.1 in this case). This is quite straightforward. The first operation is tricky, but is actually a very general problem, and has come up before on the Biopython mailing lists, e.g. http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005991.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005997.html As noted in the linked threads, I have some (apparently) working code as function get_feature_nuc in the unit test file test_SeqIO_features.py I think this should be part of Biopython proper (with unit tests etc), and would like to discuss where to put it. My ideas include: (1) Method of the SeqFeature object taking the parent sequence (as a string, Seq, ...?) as a required argument. Would return an object of the same type as the parent sequence passed in. (2) Separate function, perhaps in Bio.SeqUtils taking the parent sequence (as a string, Seq, ...?) and a SeqFeature object. Would return an object of the same type as the parent sequence passed in. (3) Method of the Seq object taking a SeqFeature, returning a Seq. [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature] (4) Method of the SeqRecord object taking a SeqFeature. Could return a SeqRecord using annotation from the SeqFeature. Complex. Any other ideas? We could even offer more than one of these approaches, but ideally there should be one obvious way for the end user to do this. My question is, which is most intuitive? I quite like idea (1). In terms of code complexity, I expect (1), (2) and (3) to be about the same. Building a SeqRecord in (4) is trickier. Options (1) and (2) are not tied to the sequence object, and could in theory support any Seq like object, plain strings - or in future even a SeqRecord: I have a git branch where the SeqRecord object supports addition and the reverse_complement method, which would work nicely here. See: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html Peter From chapmanb at 50mail.com Wed Oct 28 12:07:33 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 28 Oct 2009 08:07:33 -0400 Subject: [Biopython-dev] [Biopython] Alignment object In-Reply-To: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> Message-ID: <20091028120733.GB22395@sobchak.mgh.harvard.edu> Peter and Eric; [Moving this over to biopython-dev and changing the subject] > > Here's +1 for Python counting. That would match SeqFeature and the > > ProteinDomain class in Bio.Tree.PhyloXML. > > > > While we're on this topic -- I have some unpublished code for rendering an > > alignment object in HTML, with plans for colorization, conservation > > profiles, etc. I rolled my own alignment class since the one in > > Bio.Align.Generic didn't have the attributes (start, end, selected columns) > > for a particular file format I was parsing. It's not urgent, but at some > > point could you publish your plans for the Alignment classes so I (and > > probably others) can stay/become compatible? > > My rough work in progress in on github - at the moment I'm still trying > things out, and don't assume anything is set in stone. If you want to > have a play with this code, feedback is very welcome - probably best > on the dev list rather than here. See: > > http://github.com/peterjc/biopython/tree/seqrecords > > (a lot of the alignment things I want to support, like slicing and adding > are very closely linked to doing the same operations to SeqRecords) > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From chapmanb at 50mail.com Wed Oct 28 12:18:33 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 28 Oct 2009 08:18:33 -0400 Subject: [Biopython-dev] [Biopython] Alignment object In-Reply-To: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> Message-ID: <20091028121833.GC22395@sobchak.mgh.harvard.edu> Peter and Eric; [Moving this over to biopython-dev and changing the subject] > > Here's +1 for Python counting. That would match SeqFeature and the > > ProteinDomain class in Bio.Tree.PhyloXML. Agreed. My opinion on the 0/1 mess is that data objects in code should expose all of the coordinates as 0-based, and that output and display files meant for biologists should be 1-based. > > While we're on this topic -- I have some unpublished code for rendering an > > alignment object in HTML, with plans for colorization, conservation > > profiles, etc. I rolled my own alignment class since the one in > > Bio.Align.Generic didn't have the attributes (start, end, selected columns) > > for a particular file format I was parsing. It's not urgent, but at some > > point could you publish your plans for the Alignment classes so I (and > > probably others) can stay/become compatible? > > My rough work in progress in on github - at the moment I'm still trying > things out, and don't assume anything is set in stone. If you want to > have a play with this code, feedback is very welcome - probably best > on the dev list rather than here. See: > > http://github.com/peterjc/biopython/tree/seqrecords > > (a lot of the alignment things I want to support, like slicing and adding > are very closely linked to doing the same operations to SeqRecords) The bx-python alignment object is nice and goes to/from MAF and AXT formats: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py This supports slicing by alignment coordinates and by reference coordinates for a species in the alignment. Some other useful features are limiting the alignment to specific species and removing all gap columns that can result. The representation is a high level Alignment object containing multiple Components. You can also index the files for quick lookup via range queries: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/ It's a nice implementation; it would be good to stay compatible with it and leverage as much as we can from what they've done. Brad From biopython at maubp.freeserve.co.uk Wed Oct 28 12:50:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Oct 2009 12:50:55 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> Message-ID: <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> On Wed, Oct 28, 2009 at 12:07 PM, Peter wrote: > I think this should be part of Biopython proper (with unit tests etc), and > would like to discuss where to put it. My ideas include: > > (1) Method of the SeqFeature object taking the parent sequence (as a > string, Seq, ...?) as a required argument. Would return an object of the > same type as the parent sequence passed in. > > (2) Separate function, perhaps in Bio.SeqUtils taking the parent > sequence (as a string, Seq, ...?) and a SeqFeature object. Would > return an object of the same type as the parent sequence passed in. > > (3) Method of the Seq object taking a SeqFeature, returning a Seq. > [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature] > > (4) Method of the SeqRecord object taking a SeqFeature. Could > return a SeqRecord using annotation from the SeqFeature. Complex. > > Any other ideas? > > We could even offer more than one of these approaches, but ideally > there should be one obvious way for the end user to do this. My > question is, which is most intuitive? I quite like idea (1). > > In terms of code complexity, I expect (1), (2) and (3) to be about the > same. Building a SeqRecord in (4) is trickier. Actually, thinking about this over lunch, for many of the use cases we do want to turn a SeqFeature into a SeqRecord - either for the nucleotides, or in some cases their translation. And if doing this, do something sensible with the SeqFeature annotation (qualifiers) seems generally to be useful. This could still be done with approaches (1) and (2) as well as (4). Peter From biopython at maubp.freeserve.co.uk Wed Oct 28 12:52:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Oct 2009 12:52:28 +0000 Subject: [Biopython-dev] [Biopython] Alignment object In-Reply-To: <20091028121833.GC22395@sobchak.mgh.harvard.edu> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> <20091028121833.GC22395@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00910280552x7c5bfa6aw7b02a2e1dd0f8a7e@mail.gmail.com> On Wed, Oct 28, 2009 at 12:18 PM, Brad Chapman wrote: >> >> My rough work in progress in on github - at the moment I'm still trying >> things out, and don't assume anything is set in stone. If you want to >> have a play with this code, feedback is very welcome - probably best >> on the dev list rather than here. See: >> >> http://github.com/peterjc/biopython/tree/seqrecords >> >> (a lot of the alignment things I want to support, like slicing and adding >> are very closely linked to doing the same operations to SeqRecords) > > The bx-python alignment object is nice and goes to/from MAF and AXT > formats: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/core.py > > This supports slicing by alignment coordinates and by reference > coordinates for a species in the alignment. Some other useful > features are limiting the alignment to specific species and removing > all gap columns that can result. The representation is a high level > Alignment object containing multiple Components. > > You can also index the files for quick lookup via range queries: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py > http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/ > > It's a nice implementation; it would be good to stay compatible with it and leverage > as much as we can from what they've done. We also have to try and stay compatible with the existing Biopython alignment object though. But thanks for the bx links, I should take a look. Peter From sbassi at clubdelarazon.org Wed Oct 28 14:24:02 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 28 Oct 2009 11:24:02 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> Message-ID: <9e2f512b0910280724g2cb8d98o61fdd9aaae5a8965@mail.gmail.com> On Tue, Oct 27, 2009 at 8:50 AM, Peter wrote: > (with doctest examples) on the branch - it would be great if you > could read over this at some point. > http://github.com/peterjc/biopython/tree/sff-seqio I will take a look at it tonight. Best, SB. From sbassi at clubdelarazon.org Thu Oct 29 14:02:39 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Thu, 29 Oct 2009 11:02:39 -0300 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <9e2f512b0910260948l13a66e01p962f3b4a85e6a24@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> Message-ID: <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> On Tue, Oct 27, 2009 at 8:50 AM, Peter wrote: > As part of the polishing in anticipation of merging the SFF support > into the trunk, I've just made some big additions to the docstring > (with doctest examples) on the branch - it would be great if you > could read over this at some point. > http://github.com/peterjc/biopython/tree/sff-seqio I've read it (you mean the code in SffIO.py). Regarding your questions: > What do you think of the current rather pragmatic way I'm > handling trimming the SeqRecord objects? i.e. SeqIO file format > "sff" gives the full data and supports reading and writing, while > SeqIO format "sff-trim" only supports reading and gives trimmed > sequences without the flow data. This is a bit of a hack, and the > "sff-trim" format could be left out - but then we would need a nice > way to trim the full length SeqRecord objects... sff-trim is OK for me but I am not familiar with this format. I see there are some mixed upper and lower case dna sequence, why? Are lower case bases with less quality? (like the both extremes in standards read). From biopython at maubp.freeserve.co.uk Thu Oct 29 14:09:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Oct 2009 14:09:07 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <320fb6e00910261016o1bb1d0c1jac7bfa47cb1d3023@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> Message-ID: <320fb6e00910290709r36ac58aeid4728004965481bf@mail.gmail.com> On Thu, Oct 29, 2009 at 2:02 PM, Sebastian Bassi wrote: > > On Tue, Oct 27, 2009 at 8:50 AM, Peter wrote: >> As part of the polishing in anticipation of merging the SFF support >> into the trunk, I've just made some big additions to the docstring >> (with doctest examples) on the branch - it would be great if you >> could read over this at some point. >> http://github.com/peterjc/biopython/tree/sff-seqio > > I've read it (you mean the code in SffIO.py). Regarding your questions: I meant the docstrings in Bio/SeqIO/SffIO.py (i.e. the comments which get exposed as the API help). >> What do you think of the current rather pragmatic way I'm >> handling trimming the SeqRecord objects? i.e. SeqIO file format >> "sff" gives the full data and supports reading and writing, while >> SeqIO format "sff-trim" only supports reading and gives trimmed >> sequences without the flow data. This is a bit of a hack, and the >> "sff-trim" format could be left out - but then we would need a nice >> way to trim the full length SeqRecord objects... > > sff-trim is OK for me but I am not familiar with this format. I see > there are some mixed upper and lower case dna sequence, why? > Are lower case bases with less quality? (like the both extremes in > standards read). Yes, they are in mixed case, and this is linked to the quality and adaptor sequences . I tried to explain in the SffIO docstring (near the top of Bio/SeqIO/SffIO.py) with examples and the following text: >> ... Notice that the sequence is given in mixed case, [there] the >> central upper case region corresponds to the trimmed sequence. >> This matches the output of the Roche tools (and the 3rd party tool >> sff_extract) for SFF to FASTA. I think I need to remove the word "there" from that paragraph ;) Peter From biopython at maubp.freeserve.co.uk Thu Oct 29 17:31:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Oct 2009 17:31:14 +0000 Subject: [Biopython-dev] sff file In-Reply-To: <320fb6e00910290709r36ac58aeid4728004965481bf@mail.gmail.com> References: <9e2f512b0910260853r79f54dc8m99a565b10c225c60@mail.gmail.com> <9e2f512b0910261054r48b73bdamacb1d3e3ef165dea@mail.gmail.com> <320fb6e00910261100m2931324ei451d3719babe441c@mail.gmail.com> <9e2f512b0910261109j21825f7cqed182b3b0b600a79@mail.gmail.com> <320fb6e00910261217m77d8e73at51db502fcb3510f2@mail.gmail.com> <9e2f512b0910262035r14e9d9fcra3c9192f3024524e@mail.gmail.com> <320fb6e00910270303m1c33bb48u44672876e08a59f2@mail.gmail.com> <320fb6e00910270450i7d299a63gd11f46629c2bf30c@mail.gmail.com> <9e2f512b0910290702m76a8b9cev8f5ca89472af4925@mail.gmail.com> <320fb6e00910290709r36ac58aeid4728004965481bf@mail.gmail.com> Message-ID: <320fb6e00910291031s4ba6fdabj8de26b123d1a4126@mail.gmail.com> On Thu, Oct 29, 2009 at 2:09 PM, Peter wrote: > > I meant the docstrings in Bio/SeqIO/SffIO.py (i.e. the comments > which get exposed as the API help). > ... > I think I need to remove the word "there" from that paragraph ;) > That typo is fixed, and I have also added a docstring/doctest example showing how to do simple primer trimming of an SFF file giving a new SFF file where the clipping co-ordinates have been updated. Some of these examples will probably get moved into the tutorial if/when I merge this to the trunk. Peter From bugzilla-daemon at portal.open-bio.org Fri Oct 30 10:17:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 30 Oct 2009 06:17:44 -0400 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: Message-ID: <200910301017.n9UAHi0G013552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2924 fkauff at biologie.uni-kl.de changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1370 is|0 |1 obsolete| | AssignedTo|biopython-dev at biopython.org |fkauff at biologie.uni-kl.de Status|NEW |ASSIGNED ------- Comment #3 from fkauff at biologie.uni-kl.de 2009-10-30 06:17 EST ------- Created an attachment (id=1380) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1380&action=view) fixed memory leak of this bug (basicall same as attachmenent from Joseph) and a second one (now line 67) Fixed above memory leak (basically doing the same as Joseph) and fixed another in line 67 where it should read free(scanned_start) instead of free(scanned) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Fri Oct 30 10:20:57 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Fri, 30 Oct 2009 11:20:57 +0100 Subject: [Biopython-dev] [Bug 2924] memory leak in cnexus.c In-Reply-To: <200910301017.n9UAHi0G013552@portal.open-bio.org> References: <200910301017.n9UAHi0G013552@portal.open-bio.org> Message-ID: <4AEABE09.8000003@biologie.uni-kl.de> ... and once I've learned this git stuff I'll submit the corrected cnexus.c to github... Frank On 10/30/2009 11:17 AM, bugzilla-daemon at portal.open-bio.org wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2924 > > > fkauff at biologie.uni-kl.de changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Attachment #1370 is|0 |1 > obsolete| | > AssignedTo|biopython-dev at biopython.org |fkauff at biologie.uni-kl.de > Status|NEW |ASSIGNED > > > > > ------- Comment #3 from fkauff at biologie.uni-kl.de 2009-10-30 06:17 EST ------- > Created an attachment (id=1380) > --> (http://bugzilla.open-bio.org/attachment.cgi?id=1380&action=view) > fixed memory leak of this bug (basicall same as attachmenent from Joseph) and a > second one (now line 67) > > Fixed above memory leak (basically doing the same as Joseph) and fixed another > in line 67 where it should read free(scanned_start) instead of free(scanned) > > > From bugzilla-daemon at portal.open-bio.org Fri Oct 30 11:32:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 30 Oct 2009 07:32:31 -0400 Subject: [Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error) In-Reply-To: Message-ID: <200910301132.n9UBWVRD016588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2938 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2009-10-30 07:32 EST ------- I've added a syntax error check. This will raise a more informative error message if the data given to the parser is not in XML format (e.g., plain text). This does not yet check for HTML input though, so I'm leaving this bug open. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Oct 30 11:35:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 30 Oct 2009 07:35:49 -0400 Subject: [Biopython-dev] [Bug 2771] Bio.Entrez.read can't parse XML files from dbSNP (snp database) In-Reply-To: Message-ID: <200910301135.n9UBZnkv016735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2771 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2009-10-30 07:35 EST ------- I've modified the parser such that it will raise an informative error message if an XML schema / namespace is encountered. Leaving this bug report open; we still need a parser for XML data using an XML schema. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.