From jblanca at btc.upv.es Tue May 4 09:31:27 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 4 May 2010 15:31:27 +0200 Subject: [Biopython-dev] [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: <201005041531.27857.jblanca@btc.upv.es> Hi Peter: On Tuesday 04 May 2010 11:13:05 Peter wrote: > On Mon, May 3, 2010 at 11:37 AM, Jose Blanca wrote: > > Hi: > > > > As in many other labs we are working with NGS sequences. We work mostly > > in non model plants and we were repeating the same analyses for different > > projects: sequence cleaning, mapping to a reference, annotation and SNV > > calling and filtering. To solve the problem we have developed a software > > named ngs_backbone. We use this software and we think that it might be of > > some use to the biopython community. To take a look at it you can go to > > http://bioinf.comav.upv.es/ngs_backbone/index.html > > > > This software is build on top of biopython. > > > > If the biopython developers think that some part of this software could > > be added to biopython we would be glad to do it. We are aware of the > > different licences used by both projects, but we could relicence the > > required parts to solve that. > > > > Best regards, > > Hi Jose, > > This sounds very interesting. Are there any bits of low level functionality > you think would be particularly suitable for including in Biopython? That I don't know. Most of the package is of higher level, but maybe there's something. > I've just had a quick look at your function _seqs_in_file_with_bio in > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/readers.py > Would be it be simpler to do FASTA+QUAL parsing using > Bio.SeqIO.PairedFastaQualIterator? I going to look into that, it seems a good tip, thanks Peter. > I see you have a copy of our (private) function Bio.Seq._maketrans() here: > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py > Would it be useful to have this as a public API in Biopython? I dont' think so. We copied the function because of the self.__class__ problem that we discussed some time ago. The complement method or our Seq should return our class and not the Biopython one, that's why we have duplicated this method. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From bugzilla-daemon at portal.open-bio.org Wed May 5 08:29:30 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 08:29:30 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005051229.o45CTUaJ019366@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OS/Version|Mac OS |All Version|1.51 |Not Applicable ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 08:29 EST ------- I've recently started looking at parsing SAM and BAM files. These files just contain the reads - they do not include the reference sequence, that is usually kept in a separate FASTA file. I therefore think it would make sense to parse each read as a SeqRecord in Bio.SeqIO. The SAM format is basically tab separated plain text. Parsing it is straight forward, the complication is turning this into a suitable SeqRecord object. The BAM format can be decompressed in Python using the gzip library (built in), and decoded with the struct library (also built in - we already use this for parsing the binary SFF file format). i.e. This is fiarly straightforward to do in pure python - without any dependence on the samtools C library, an alternative approach which is how pysam works. See http://code.google.com/p/pysam/ Extracting just the read name, sequence, and PHRED quality scores when building the SeqRecord objects is sufficient to implement SAM/BAM to FASTQ/FASTA/QUAL conversion with Bio.SeqIO. The harder part will be deciding how to represent all the other annotation information for each read... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 08:57:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 08:57:11 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005051257.o45CvBtP020884@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #3 from chapmanb at 50mail.com 2010-05-05 08:57 EST ------- I'd really like to see our support for this re-use the work in the pysam project. Agreed that both a pure Python implementation of BAM parsing and Biopython-interoperable objects are useful, and we should either contribute it as part of pysam or consider discussing a closer collaboration with the pysam authors. Biopython should be taking the lead on encouraging better interoperability with other projects. pysam is useful to me in my work right now, and we should support that effort. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 09:37:54 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 09:37:54 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005051337.o45Dbs0P022560@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 09:37 EST ------- Created an attachment (id=1498) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1498&action=view) Basic SAM/BAM parser for Bio.SeqIO This file would go in Bio/SeqIO/SamBamIO.py with the usual additions to file Bio/SeqIO/__init__.py to define the "sam" and "bam" format names plus that "bam" is a binary file format. There are docstring unit tests using ex1.sam and ex1.bam borrowed from the pysam project. (In reply to comment #3) > I'd really like to see our support for this re-use the work in the pysam > project. Agreed that both a pure Python implementation of BAM parsing and > Biopython-interoperable objects are useful, and we should either contribute it > as part of pysam or consider discussing a closer collaboration with the pysam > authors. > > Biopython should be taking the lead on encouraging better interoperability > with other projects. pysam is useful to me in my work right now, and we > should support that effort. Hi Brad, What I was (for now) focussing on was SAM/BAM parser support in Bio.SeqIO, which is really quite narrow in scope. It is also quite simple - I have attached a proof of principle implementation to this bug. The gzip/struct code to interpret the BAM fields is pretty straight forward (having done a lot of similar work on the SFF support helped). The only challenging bit is turning the data into a SeqRecord (and this part seems irrelevant to pysam). Going beyond basic access to the reads, the next step up is working on the alignment data structure - e.g. extracting columns to look at SNPs. Here there are a lot of neat things like indexing schemes etc where the SAMtools API (and thus pysam) is probably a sensible choice. You'll notice in the draft module docstring I've suggested this (and this wasn't prompted by your comment either - grin). On the licence side, pysam and SAMtools both use the MIT licence, so no problems there. Regarding dependencies and cross platform support, pysam is a lightweight wrapper of the samtools C-API, using pyrex. If we want to use pysam in Biopython that means build time dependencies on samtools and pyrex. This won't work under Jython, and at the time of writing pysam doesn't appear to support Windows either. So I'm not so comfortable about this. It would be interesting to see if pysam could have a pure python back end as an alternative to calling the SAMtools C API (and I'm happy for any of my code to be used for that - but this would have to cover far more than just parsing). That would allow pysam under Jython, and might help on Windows too. So in the short term, I don't seem any overlap between SAM/BAM support in Bio.SeqIO and the pysam project. In the medium/long term, working with the cigar strings and of course the alignments rather than just the reads, then yes absolutely - some level of discussion or collaboration would be sensible and desirable. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 11:13:43 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 11:13:43 -0400 Subject: [Biopython-dev] [Bug 3071] EMBL parser does not parse RP lines correctly. In-Reply-To: Message-ID: <201005051513.o45FDhhe026936@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3071 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 11:13 EST ------- Fixed and unit tests updated to cover this: http://github.com/biopython/biopython/commit/c53eb60956f52ada0116c6b0045e0a1d16cb1de8 Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 12:22:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 12:22:09 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005051622.o45GM9GP030954@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 12:22 EST ------- (In reply to comment #8) > (In reply to comment #7) > > > However, if the only out-of-specification thing in the IMGT EMBL files is > > the feature indentation and long feature keys, many your original request > > to make the EMBL parser more tolerant is the best route. > > I think it will actually be a headache to do so. Unless you want to rewrite > the EMBL parser the way that I wrote the IMGT parser. The only thing that > needed changing was handling the header lines. Once it finds an FH line, it > uses the position of the "Location..." string to determine how indented the > qualifiers are. Hi Uri, Could you retest as "embl" format with the trunk? I would expect some warnings from these over indented features in IMGT, and we can certainly remove the warning if we decide not to introduce a separate IMGT format variant. http://github.com/biopython/biopython/commit/e6ba962dd60fe585baa1237445d33f67d47dd57f This change takes a slightly different approach to your work on github, but is quite similar to your two line patch - but this should still work with another odd form: FH Key Location/Qualifiers FT L-V-D-J-C-SEQUEN1..1151 FT /db_xref="taxon:32630" FT /organism="synthetic construct" FT 5'UTR 1..37 ... In the above example (generated by Biopython itself), the strict EMBL column limits have been obeyed but the feature key has been truncated to just L-V-D-J-C-SEQUEN rather than L-V-D-J-C-SEQUENCE. This is a related query - when asked to output such a feature as EMBL or GenBank format, should we raise an exception here? We could add a warning instead, and either leave the code as is, or output this: FH Key Location/Qualifiers FT L-V-D-J-C-SEQUE 1..1151 FT /db_xref="taxon:32630" FT /organism="synthetic construct" FT 5'UTR 1..37 ... > > Thinking ahead would you also want to be able to write out IMGT variant > > EMBL files? > > I personally don't need this functionality, but I am willing to write it to > complement the IMGT parser that I wrote. If we go done the route of formalising IMGT as an EMBL variant with a different feature indent, it should just be a trivial subclass of the existing EMBL writer object but with the indentation constant changed. Note there are other problem in the IMGT data, including locations like "1..428>" and "<1..328>" where the greater than should be BEFORE the location (but we could probably cope with this all the same), and just "1." where half the location is missing (which we can't really do much with other than treat it as simply "1" instead?). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Thu May 6 02:08:34 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Thu, 6 May 2010 02:08:34 -0400 Subject: [Biopython-dev] 5/6 BioStar - Biopython Questions Message-ID: ================================================== 1. Simple Fasta Parsing Is Not Simple. ================================================== May 5, 2010 at 6:25 PM When trying out the examples from chapter 2.3 of the biopython 1.54b tutorial I keep running into this very annoying problem: When I use: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) The Python interpreter tells me I should use a handle in stead of filenames. I use biopython 1.53 and not 1.54b for which this tutorial was meant. I can't find an older archived version of the tutorial I could use to help me learn how to use a handle in stead of the simple filename method (ls_orchid.fasta in this case.). I know that in older versions of biopython you have to do everything in handlers but not in the newer versions. The new tutorial now on-line has an obscure last chapter on how to use handlers but that's hardly helpfull. I think the tutorial is great, I just havent got the right version installed. I use Ubuntu and downloaded biopython via ubuntu software center. Can anyone help me use a handler and get the above parsing example working in 1.53? Thank you, http://biostar.stackexchange.com/questions/969/simple-fasta-parsing-is-not-simple -------------------------------------------------- ================================================== 2. How do I create a SeqRecord in biopython? ================================================== May 5, 2010 at 6:25 PM id1 ="HWI-EAS380:8:1:16:830/1" seq1 ="AGGGCGTTCAGCAGCCAGCTTGCGGCAAAACTGCGTAACCGTCTTCTCGTTCTCT AAAAACCATTTTTCGTCCCCTTCGGGGCGGTGGTCTATAGTGTTATTAATATCAA GTTGGGGGAGCACATTGTAGCATTG" qual1="abbbbbaab`abaabbaabaaaab^E^``^aaabaa_\_abaaaaaaaa`aaaa` Z^`^^aaaaaaa`aa^aaa``_aa_aaaaaaaaaaa`aaaa`aaaaaabaaabba aaaaaaaaaaa_baaaabbbbbaba" assume these are fastq-illumina quality scores and the sequence is unambiguous dna I want to create a SeqRecord object from this. Thanks. http://biostar.stackexchange.com/questions/967/how-do-i-create-a-seqrecord-in-biopython -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From dalloliogm at gmail.com Thu May 6 04:36:51 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 6 May 2010 10:36:51 +0200 Subject: [Biopython-dev] 5/6 BioStar - Biopython Questions In-Reply-To: References: Message-ID: On Thu, May 6, 2010 at 8:08 AM, Feed My Inbox wrote: > ================================================== > 1. Simple Fasta Parsing Is Not Simple. > ================================================== > May 5, 2010 at 6:25 PM > > When trying out the examples from chapter 2.3 of the biopython 1.54b tutorial I keep running into this very annoying problem: When I use: This user is complaining that the current biopython's tutorial describes a feature introduced in biopython 1.54, so if you try it with an earlier version, it doesn't work. In particular, it is the feature that allow to use a string/filename as argument for SeqIO.parse, which is not available in earlier versions. Maybe it would be useful to add a note to the tutorial, explaining users that they should update biopython or that in earlier versions they have to use a filehandler . > > from Bio import SeqIO > > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): > > ? ?print seq_record.id > ? ?print repr(seq_record.seq) > ? ?print len(seq_record) > > > ================================================== > 2. How do I create a SeqRecord in biopython? > ================================================== > May 5, 2010 at 6:25 PM > > id1 ?="HWI-EAS380:8:1:16:830/1" > > seq1 ="AGGGCGTTCAGCAGCCAGCTTGCGGCAAAACTGCGTAACCGTCTTCTCGTTCTCT > ? ? ? AAAAACCATTTTTCGTCCCCTTCGGGGCGGTGGTCTATAGTGTTATTAATATCAA > ? ? ? GTTGGGGGAGCACATTGTAGCATTG" > > qual1="abbbbbaab`abaabbaabaaaab^E^``^aaabaa_\_abaaaaaaaa`aaaa` > ? ? ? Z^`^^aaaaaaa`aa^aaa``_aa_aaaaaaaaaaa`aaaa`aaaaaabaaabba > ? ? ? aaaaaaaaaaa_baaaabbbbbaba" Have a look at this question, too. I don't know how to answer properly. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu May 6 06:57:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 May 2010 11:57:20 +0100 Subject: [Biopython-dev] 5/6 BioStar - Biopython Questions In-Reply-To: References: Message-ID: On Thu, May 6, 2010 at 9:36 AM, Giovanni Marco Dall'Olio wrote: > On Thu, May 6, 2010 at 8:08 AM, Feed My Inbox wrote: >> ================================================== >> 1. Simple Fasta Parsing Is Not Simple. >> ================================================== >> May 5, 2010 at 6:25 PM >> >> When trying out the examples from chapter 2.3 of the biopython >>1.54b tutorial I keep running into this very annoying problem: When I use: > > This user is complaining that the current biopython's tutorial > describes a feature introduced in biopython 1.54, so if you try it > with an earlier version, it doesn't work. > In particular, it is the feature that allow to use a string/filename > as argument for SeqIO.parse, which is not available in earlier > versions. > Maybe it would be useful to add a note to the tutorial, explaining > users that they should update biopython or that in earlier versions > they have to use a filehandler . It did, there was an FAQ on this. However I've added some examples to the appendix section on handles and make the FAQ entry reference that for more details. Peter From bugzilla-daemon at portal.open-bio.org Thu May 6 13:08:58 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 May 2010 13:08:58 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005061708.o46H8wRQ018125@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-06 13:08 EST ------- (In reply to comment #0) > This is the error message I get: > > ====================================================================== > FAIL: Simple round-trip through app with infile. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 56, in test_Mafft_simple > self.assert_("STEP 2 / 2 d" in stderr_string) > AssertionError I have changed that to look for "Progressive alignment ..." instead which is present in both this MAFFT 5.x output and in MAFFT 6.x output. > ====================================================================== > FAIL: Round-trip with complex command line. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 126, in test_Mafft_with_complex_command_line > self.assertEqual(return_code, 0) > AssertionError: 1 != 0 I've changed this to give the command line used to help debug when MAFFT returns an error code. Could you retest and report what MAFFT does for this particular command? Also what is the output of "mafft --help" from MAFFT 5.732? That would be useful if we do have to make running the test conditional on the version of MAFFT. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 6 14:38:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 May 2010 14:38:33 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005061838.o46IcXgw021287@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1498 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-06 14:38 EST ------- (From update of attachment 1498) This code is now on one of my github branches: http://github.com/peterjc/biopython/tree/seqio-sam-bam This includes basic indexing support via Bio.SeqIO.index(), currently for SAM only. BAM should be easy enough. Note that this is *much* simpler than the indexing by mapping location offered by samtools (and pysam). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 6 21:07:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 May 2010 21:07:36 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005070107.o4717a4E002248@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2010-05-06 21:07 EST ------- (In reply to comment #4) > Regarding dependencies and cross platform support, pysam is a lightweight > wrapper of the samtools C-API, using pyrex. If we want to use pysam in > Biopython that means build time dependencies on samtools and pyrex. This > won't work under Jython, and at the time of writing pysam doesn't appear > to support Windows either. So I'm not so comfortable about this. Since the samtools C library is not so large, we could consider writing a plain C wrapper instead of pyrex to get at least rid of this dependency. The samtools dependency is more difficult. There are two reasonably options. One option is to help out the pysam/samtools developers to create a non-pyrex C wrapper and have it included into the samtools distribution. The other option is to have both samtools itself and a Python wrapper included in Biopython -- note that pysam itself includes the samtools source files. However, this would mean keeping the samtools in Biopython up-to-date with the standalone samtools, which I think will cause us headaches in the long run. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 02:46:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 02:46:35 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070646.o476kZOx016899@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2010-05-07 02:46 EST ------- (In reply to comment #2) > (In reply to comment #0) > > This is the error message I get: > > > > ====================================================================== > > FAIL: Simple round-trip through app with infile. > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "test_Mafft_tool.py", line 56, in test_Mafft_simple > > self.assert_("STEP 2 / 2 d" in stderr_string) > > AssertionError > > I have changed that to look for "Progressive alignment ..." instead > which is present in both this MAFFT 5.x output and in MAFFT 6.x output. This error has disappeared -- thanks! > > > ====================================================================== > > FAIL: Round-trip with complex command line. > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "test_Mafft_tool.py", line 126, in test_Mafft_with_complex_command_line > > self.assertEqual(return_code, 0) > > AssertionError: 1 != 0 > > I've changed this to give the command line used to help debug when MAFFT > returns an error code. Could you retest and report what MAFFT does for > this particular command? This is the output I am now getting: ====================================================================== FAIL: Round-trip with complex command line. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Mafft_tool.py", line 144, in test_Mafft_with_complex_command_line % (return_code, cmdline)) AssertionError: Got error code 1 back from: mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 -- ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 If I just run this mafft command directly, I get: $ mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 --ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 /usr/local/bin/mafft: line 184: [: --treeout: integer expression expected Unknown option: --treeout MAFFT version 5.732 (2005/09/14) References: Katoh et al., 2002, NAR 30: 3059-3066 Katoh et al., 2005, NAR 33: 511-518 http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft Options: --localpair : All pairwise local alignment information is included to the objective function. default: off --globalpair : All pairwise global alignment information is included to the objective function. default: off --op # : Gap opening penalty (>0). default: 1.53 --ep # : Offset (>0, works like gap extension penalty). default: 0.123 --bl #, --jtt # : Scoring matrix. default: BLOSUM62 Alternatives are BLOSUM (--bl) 30, 45, 62, 80, or JTT (--jtt) # PAM. --nuc or --amino : Sequence type. default: auto --retree # : The number of tree building in progressive method (see the paper for detail). default: 2 --maxiterate # : Maximum number of iterative refinement. default: 0 --fft or --nofft: FFT is enabled or disabled. default: enabled --memsave: Memory saving mode (beta). default: off --clustalout: Output: clustal format (not tested). default: fasta --reorder: Outorder: aligned (not tested). default: input order --quiet : Do not report progress. Input format: fasta format Typical usages: % mafft --maxiterate 1000 --localpair input > output L-INS-i (most accurate in many cases; assumes there is only one alignable domain) % mafft --maxiterate 1000 --genafpair input > output E-INS-i (works even if there are many unalignable residues between alignable domains) % mafft --maxiterate 1000 --globalpair input > output G-INS-i (suitable for globally alignable sequences) % mafft --maxiterate 1000 input > output FFT-NS-i (accurate and slow, iterative refinement method) % mafft --retree 2 input > output (DEFAULT; same as mafft input > output) FFT-NS-2 (rough and fast; progressive method) % mafft --retree 1 input > output FFT-NS-1 (very rough and very fast, applicable to >5,000 sequences; progressive method with a rough guide tree) > Also what is the output of "mafft --help" from MAFFT 5.732? That would be > useful if we do have to make running the test conditional on the version of > MAFFT. > This is the output of "mafft --help": $ mafft --help Cannot open --help. MAFFT version 5.732 (2005/09/14) References: Katoh et al., 2002, NAR 30: 3059-3066 Katoh et al., 2005, NAR 33: 511-518 http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft Options: --localpair : All pairwise local alignment information is included to the objective function. default: off --globalpair : All pairwise global alignment information is included to the objective function. default: off --op # : Gap opening penalty (>0). default: 1.53 --ep # : Offset (>0, works like gap extension penalty). default: 0.123 --bl #, --jtt # : Scoring matrix. default: BLOSUM62 Alternatives are BLOSUM (--bl) 30, 45, 62, 80, or JTT (--jtt) # PAM. --nuc or --amino : Sequence type. default: auto --retree # : The number of tree building in progressive method (see the paper for detail). default: 2 --maxiterate # : Maximum number of iterative refinement. default: 0 --fft or --nofft: FFT is enabled or disabled. default: enabled --memsave: Memory saving mode (beta). default: off --clustalout: Output: clustal format (not tested). default: fasta --reorder: Outorder: aligned (not tested). default: input order --quiet : Do not report progress. Input format: fasta format Typical usages: % mafft --maxiterate 1000 --localpair input > output L-INS-i (most accurate in many cases; assumes there is only one alignable domain) % mafft --maxiterate 1000 --genafpair input > output E-INS-i (works even if there are many unalignable residues between alignable domains) % mafft --maxiterate 1000 --globalpair input > output G-INS-i (suitable for globally alignable sequences) % mafft --maxiterate 1000 input > output FFT-NS-i (accurate and slow, iterative refinement method) % mafft --retree 2 input > output (DEFAULT; same as mafft input > output) FFT-NS-2 (rough and fast; progressive method) % mafft --retree 1 input > output FFT-NS-1 (very rough and very fast, applicable to >5,000 sequences; progressive method with a rough guide tree) Thanks, --Michiel. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 04:49:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 04:49:11 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070849.o478nBF0022623@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-07 04:49 EST ------- (In reply to comment #3) > > This is the output I am now getting: > > ====================================================================== > FAIL: Round-trip with complex command line. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 144, in test_Mafft_with_complex_command_line > % (return_code, cmdline)) > AssertionError: Got error code 1 back from: > mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 > -- > ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 > > If I just run this mafft command directly, I get: > > $ mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 > --ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 > /usr/local/bin/mafft: line 184: [: --treeout: integer expression expected > Unknown option: --treeout > > MAFFT version 5.732 (2005/09/14) > ... It looks to me like some of the other arguments the test is trying to use are also not supported on this version of MAFFT. > > Also what is the output of "mafft --help" from MAFFT 5.732? That would be > > useful if we do have to make running the test conditional on the version of > > MAFFT. > > > > This is the output of "mafft --help": > > $ mafft --help > Cannot open --help. > > MAFFT version 5.732 (2005/09/14) > ... Great. That's enough to be able to detect the version number. Note that MAFFT v6 doesn't support the --help argument, the point is it will abort with help text and not sit waiting on stdin. I've update the unit test to require MAFFT v6 or later, which should resolve this bug. http://github.com/biopython/biopython/commit/e2219beb156e80b55da3efb4a8efe2c2347ec877 Thanks for your help, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 05:21:08 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 05:21:08 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070921.o479L8bl023566@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2010-05-07 05:21 EST ------- (In reply to comment #4) > I've update the unit test to require MAFFT v6 or later, which should resolve > this bug. Thanks. This works fine now. --Michiel. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 05:42:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 05:42:45 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070942.o479gj3E024144@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-07 05:42 EST ------- (In reply to comment #5) > > Thanks. This works fine now. > Great - marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 07:15:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 07:15:28 -0400 Subject: [Biopython-dev] [Bug 3043] test_NCBI_BLAST_tools fails In-Reply-To: Message-ID: <201005071115.o47BFSuG029162@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3043 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-07 07:15 EST ------- Michiel, Could you run this test using the latest code? I've added a hack to ignore the three "extra" arguments, -remote_verbose, -use_test_remote_service, and -verbose so it should work. We can probably then comment this out for general usage because... I've also added a few real tests using the pairwise search functionality in BLAST+ where you can search a FASTA file of queries against a FASTA file of subjects -- without having to setup a BLAST database first. This is rather nice. However the tool will not output XML in this mode, and it seems right now we can't parse the plain text output. Tabular output should be fine. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 07:39:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 07:39:45 -0400 Subject: [Biopython-dev] [Bug 3043] test_NCBI_BLAST_tools fails In-Reply-To: Message-ID: <201005071139.o47BdjRT030107@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3043 ------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2010-05-07 07:39 EST ------- (In reply to comment #6) > However the tool will not output XML in this mode, and it seems right > now we can't parse the plain text output. Tabular output should be fine. I'll put together a Blast output parser after Biopython 1.54 final is out. > Could you run this test using the latest code? THis works fine now. Thanks! $ python test_NCBI_BLAST_tools.py Check all blastn arguments are supported ... ok Check all blastp arguments are supported ... ok Check all blastx arguments are supported ... ok Check all psiblast arguments are supported ... ok Check all rpsblast arguments are supported ... ok Check all rpstblastn arguments are supported ... ok Check all tblastn arguments are supported ... ok Check all tblastx arguments are supported ... ok Pairwise BLASTP search ... ok Pairwise BLASTN search ... ok ---------------------------------------------------------------------- Ran 10 tests in 0.389s OK -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Fri May 7 09:23:56 2010 From: krother at rubor.de (Kristian Rother) Date: Fri, 7 May 2010 15:23:56 +0200 Subject: [Biopython-dev] Added Loop Closure algorithm to rna branch. Message-ID: <75a2442e49a53379b4cf707eee2a1f05-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheQ1pWUA5cWQ==-webmailer2@server03.webmailer.hosteurope.de> Hi, We've prepared some code that we would like to propose for inclusion to the community: Bio.PDB.CoordBuilder Constructs 3D coordinates based on 3 atoms + distance + angle + dihedral. Uses the NerF algorithm also used in the ROSETTA protein prediction program. Bio.PDB.FCCDLoopCloser Iteratively optimizes a dangling chain of atoms until they reach a defined target site. Refactored version of Wouter Boomsmas and Thomas Hamelrycks algorithm. Code + tests have been commited to krother/biopython, branch *rna* on github. We hope this is useful. Cheers, Kristian, Magdalena, Tomek From biopython at maubp.freeserve.co.uk Fri May 7 09:42:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 14:42:02 +0100 Subject: [Biopython-dev] Added Loop Closure algorithm to rna branch. In-Reply-To: <75a2442e49a53379b4cf707eee2a1f05-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheQ1pWUA5cWQ==-webmailer2@server03.webmailer.hosteurope.de> References: <75a2442e49a53379b4cf707eee2a1f05-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheQ1pWUA5cWQ==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: On Fri, May 7, 2010 at 2:23 PM, Kristian Rother wrote: > > Hi, > > We've prepared some code that we would like to propose for inclusion to > the community: > > Bio.PDB.CoordBuilder > ? ?Constructs 3D coordinates based on 3 atoms + distance + angle + > ? ?dihedral. > ? ?Uses the NerF algorithm also used in the ROSETTA protein prediction > ? ?program. > > Bio.PDB.FCCDLoopCloser > ? ?Iteratively optimizes a dangling chain of atoms until they > ? ?reach a defined target site. > ? ?Refactored version of Wouter Boomsmas and Thomas Hamelrycks algorithm. > > > Code + tests have been commited to krother/biopython, branch *rna* on github. > > We hope this is useful. > > Cheers, > ? ?Kristian, Magdalena, Tomek > Sounds very interesting. Eric - can you take a look at this? We could potentially merge this after Biopython 1.54 is out the door... Peter From bugzilla-daemon at portal.open-bio.org Fri May 7 10:47:02 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 10:47:02 -0400 Subject: [Biopython-dev] [Bug 3074] New: Please support additional fields in the SeqIO embl parser Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3074 Summary: Please support additional fields in the SeqIO embl parser Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: Wim.DeSmet at UGent.be Sequences returned from the Bio.SeqIO parser for 'embl' files don't contain a parsed version of at least the following fields: DT (date) DR (database cross references) Possibly also missing: KW the keywords field dataclass field in the ID field It would be useful to me and I imagine others to have access to these additional fields that are in the original embl files. Not having them means that if you parse embl files, manipulate the sequence and write out the result means losing data or having to manually add the original data back into the file. If you wish to hold on to this data. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Sun May 9 02:08:19 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sun, 9 May 2010 02:08:19 -0400 Subject: [Biopython-dev] 5/9 BioStar - Biopython Questions Message-ID: ================================================== 1. I wonder why it is so important to use Seq objects in stead of plain ol' strings in Biopyton? ================================================== May 8, 2010 at 5:10 PM This may seem like a superfluous question, and perhaps it is, but it's important to get the basic raison d'etres of the programming habits that are encouraged in the tutorial straight. (Wow that's a strange and awkward sentence but it's good to write in English and show off literal non-skills.) In short: Why use: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) When you can simply use: >>> from Bio.Seq import translate >>> my_string_messenger_rna = "AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG" >>> translate(my_string_messenger_rna) 'MAIVMGR*KGAR*' http://biostar.stackexchange.com/questions/1001/i-wonder-why-it-is-so-important-to-use-seq-objects-in-stead-of-plain-ol-strings -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From bugzilla-daemon at portal.open-bio.org Wed May 12 19:41:03 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 12 May 2010 19:41:03 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005122341.o4CNf3Ps001897@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #11 from laserson at mit.edu 2010-05-12 19:41 EST ------- Hi Peter, Sorry for my short hiatus...see responses below. (In reply to comment #10) > Could you retest as "embl" format with the trunk? I would expect some warnings > from these over indented features in IMGT, and we can certainly remove the > warning if we decide not to introduce a separate IMGT format variant. I still get the LocationParserErrors for many records. Also note that the SeqIO.index function doesn't treat the IMGT headers correctly, so it's not possible to access any of the records from the index it creates (this was also addressed in my patch where I subclassed an independent IMGT parser). > > http://github.com/biopython/biopython/commit/e6ba962dd60fe585baa1237445d33f67d47dd57f > > This change takes a slightly different approach to your work on github, but > is quite similar to your two line patch - but this should still work with > another odd form: > > FH Key Location/Qualifiers > FT L-V-D-J-C-SEQUEN1..1151 > FT /db_xref="taxon:32630" > FT /organism="synthetic construct" > FT 5'UTR 1..37 > ... I still couldn't get the current master branch 'embl' format to work. But hardcoding the alternate indentation did work, even in the cases where the feature key is right up against the location qualifier. > > In the above example (generated by Biopython itself), the strict EMBL column > limits have been obeyed but the feature key has been truncated to just > L-V-D-J-C-SEQUEN rather than L-V-D-J-C-SEQUENCE. This is a related query - > when asked to output such a feature as EMBL or GenBank format, should we raise > an exception here? We could add a warning instead, and either leave the code > as is, or output this: > > FH Key Location/Qualifiers > FT L-V-D-J-C-SEQUE 1..1151 > FT /db_xref="taxon:32630" > FT /organism="synthetic construct" > FT 5'UTR 1..37 > ... > I think we should probably output all IMGT records using the increased indentation. This way there will be no ambiguity and no information loss. If you want to manually "convert" to standard EMBL format, I think the truncation makes sense as you proposed it, and we could issue a warning about lost information. > > > Thinking ahead would you also want to be able to write out IMGT variant > > > EMBL files? > > > > I personally don't need this functionality, but I am willing to write it to > > complement the IMGT parser that I wrote. > > If we go done the route of formalising IMGT as an EMBL variant with a different > feature indent, it should just be a trivial subclass of the existing EMBL > writer object but with the indentation constant changed. > Agreed. > Note there are other problem in the IMGT data, including locations like > "1..428>" and "<1..328>" where the greater than should be BEFORE the location > (but we could probably cope with this all the same), and just "1." where half > the location is missing (which we can't really do much with other than treat > it as simply "1" instead?). I have already notified IMGT regarding the ">" problem, though they seem like they will be slow to change it. It's a very simple fix to the flatfile, and I did it manually with regular expressions. My preference is that we do NOT support the backwards notation, as it's clearly wrong. We'll have them fix it. In the meanwhile, I can post my python script that corrects it somewhere (maybe as a gist on github) and we can just point people to it in a warning if they are using the IMGT parser. Regarding the 1. problem, I have not yet told the IMGT people, but I will do so shortly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Thu May 13 02:12:15 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Thu, 13 May 2010 02:12:15 -0400 Subject: [Biopython-dev] 5/13 BioStar - Biopython Questions Message-ID: <1c49e04af2a3e80dd942fa741b65a7b2@74.63.51.88> ================================================== 1. Clustalw alignment problem ================================================== May 12, 2010 at 4:54 PM Hi everyone, I tried these lines ................................................ import os from Bio.Clustalw import MultipleAlignCL cline = MultipleAlignCL(os.path.join(os.curdir, "opuntia.fasta")) cline.set_output("test.aln") alignment = Clustalw.do_alignment(cline) ............................................. But couldn't proceed with these errors ............................................. Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python24\align.py", line 5, in ? alignment = Clustalw.do_alignment(cline) File "C:\Python24\lib\site-packages\Bio\Clustalw\__init__.py", line 95, in do_alignment shell=(sys.platform!="win32") File "C:\Python24\lib\subprocess.py", line 534, in __init__ (p2cread, p2cwrite, File "C:\Python24\lib\subprocess.py", line 594, in _get_handles p2cread = self._make_inheritable(p2cread) File "C:\Python24\lib\subprocess.py", line 635, in _make_inheritable DUPLICATE_SAME_ACCESS) TypeError: an integer is required ................................................. test.aln is not generated too .................................................. Thanks http://biostar.stackexchange.com/questions/1041/clustalw-alignment-problem -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From biopython at maubp.freeserve.co.uk Thu May 13 07:37:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 May 2010 12:37:36 +0100 Subject: [Biopython-dev] Ready for Biopython 1.54 release? Message-ID: Hello all, Are there any outstanding issues we should address before making the Biopython 1.54 release? Eric has made a good start on covering Bio.Phylo in the tutorial, which can be easily proof read online: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf One thing I am wondering about is making column extraction in the new alignment object return a string rather than a Seq object. I'll start another thread on this issue... Peter From biopython at maubp.freeserve.co.uk Thu May 13 07:47:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 May 2010 12:47:48 +0100 Subject: [Biopython-dev] Alignment columns as strings or Seq objects? Message-ID: Peter wrote: > Hello all, > > Are there any outstanding issues we should address before making > the Biopython 1.54 release? > > ... > > One thing I am wondering about is making column extraction in > the new alignment object return a string rather than a Seq object. > I'll start another thread on this issue... I remember we debated this a bit before but can't find the thread right now. See also Bug 3066 where I am proposing to add methods to iterate over the rows or columns as strings. http://bugzilla.open-bio.org/show_bug.cgi?id=3066 The main benefit of using a plain string when extracting the alignment columns is speed. Because the data is stored by row, each time we extract a column we would have to build a new instance of the Seq object. For large alignments (and thinking ahead to next-gen alignment objects) this could be a painful overhead. Because the whole alignment has an alphabet, we can use this to assign an alphabet to a column sequence. Note that the rows of the alignments could have slightly different alphabets. So it is possible (and the current code does this) to generate a Seq object with a meaningful alphabet from a column. Why is this useful? Other than the alphabet, the main benefit of using a Seq object is consistency. On a practical level, the Seq object's biological translate method isn't appropriate at all for an alignment column. On the other hand, one might possibly want to use (back)transcribe to flip between DNA and RNA, and maybe even take the complement. Are there any strong views here on how alignment slicing to get a column should behave? i.e. should align[:,9] return the column as a string or as a Seq? Peter From mjldehoon at yahoo.com Thu May 13 20:29:59 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 13 May 2010 17:29:59 -0700 (PDT) Subject: [Biopython-dev] Alignment columns as strings or Seq objects? In-Reply-To: Message-ID: <948611.90987.qm@web62403.mail.re1.yahoo.com> I would definitely use a plain string. A Seq object suggests that we're dealing with a real biological sequence, which a column in the alignment matrix is not. The only advantage of having a Seq object is that it has an alphabet associated with it. But alphabets are very rarely used in practice, if at all. Reverse complementing or (back-)transcribing are available in the Bio.Seq module as functions that can operate on plain strings, so we don't need a Seq object for that. --Michiel. --- On Thu, 5/13/10, Peter wrote: > From: Peter > Subject: [Biopython-dev] Alignment columns as strings or Seq objects? > To: "Biopython-Dev Mailing List" > Date: Thursday, May 13, 2010, 7:47 AM > Peter wrote: > > Hello all, > > > > Are there any outstanding issues we should address > before making > > the Biopython 1.54 release? > > > > ... > > > > One thing I am wondering about is making column > extraction in > > the new alignment object return a string rather than a > Seq object. > > I'll start another thread on this issue... > > I remember we debated this a bit before but can't find the > thread right now. See also Bug 3066 where I am proposing > to add methods to iterate over the rows or columns as > strings. > http://bugzilla.open-bio.org/show_bug.cgi?id=3066 > > The main benefit of using a plain string when extracting > the > alignment columns is speed. Because the data is stored by > row, each time we extract a column we would have to build > a new instance of the Seq object. For large alignments > (and > thinking ahead to next-gen alignment objects) this could > be > a painful overhead. > > Because the whole alignment has an alphabet, we can use > this > to assign an alphabet to a column sequence. Note that the > rows > of the alignments could have slightly different alphabets. > So it > is possible (and the current code does this) to generate a > Seq > object with a meaningful alphabet from a column. > > Why is this useful? Other than the alphabet, the main > benefit > of using a Seq object is consistency. On a practical level, > the > Seq object's biological translate method isn't appropriate > at all > for an alignment column. On the other hand, one might > possibly > want to use (back)transcribe to flip between DNA and RNA, > and maybe even take the complement. > > Are there any strong views here on how alignment slicing > to > get a column should behave? i.e. should align[:,9] return > the > column as a string or as a Seq? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Thu May 13 22:28:13 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 May 2010 19:28:13 -0700 Subject: [Biopython-dev] Alignment columns as strings or Seq objects? In-Reply-To: <948611.90987.qm@web62403.mail.re1.yahoo.com> References: <948611.90987.qm@web62403.mail.re1.yahoo.com> Message-ID: Here's another +1 for plain strings. I agree with Michiel, and if the user really needs to rebuild a Seq with the original alphabet, it's not too difficult to fetch that information from the original alignment object. -Eric On Thu, May 13, 2010 at 5:29 PM, Michiel de Hoon wrote: > I would definitely use a plain string. A Seq object suggests that we're > dealing with a real biological sequence, which a column in the alignment > matrix is not. The only advantage of having a Seq object is that it has an > alphabet associated with it. But alphabets are very rarely used in practice, > if at all. Reverse complementing or (back-)transcribing are available in the > Bio.Seq module as functions that can operate on plain strings, so we don't > need a Seq object for that. > > --Michiel. > > --- On Thu, 5/13/10, Peter wrote: > > > From: Peter > > Subject: [Biopython-dev] Alignment columns as strings or Seq objects? > > To: "Biopython-Dev Mailing List" > > Date: Thursday, May 13, 2010, 7:47 AM > > Peter wrote: > > > Hello all, > > > > > > Are there any outstanding issues we should address > > before making > > > the Biopython 1.54 release? > > > > > > ... > > > > > > One thing I am wondering about is making column > > extraction in > > > the new alignment object return a string rather than a > > Seq object. > > > I'll start another thread on this issue... > > > > I remember we debated this a bit before but can't find the > > thread right now. See also Bug 3066 where I am proposing > > to add methods to iterate over the rows or columns as > > strings. > > http://bugzilla.open-bio.org/show_bug.cgi?id=3066 > > > > The main benefit of using a plain string when extracting > > the > > alignment columns is speed. Because the data is stored by > > row, each time we extract a column we would have to build > > a new instance of the Seq object. For large alignments > > (and > > thinking ahead to next-gen alignment objects) this could > > be > > a painful overhead. > > > > Because the whole alignment has an alphabet, we can use > > this > > to assign an alphabet to a column sequence. Note that the > > rows > > of the alignments could have slightly different alphabets. > > So it > > is possible (and the current code does this) to generate a > > Seq > > object with a meaningful alphabet from a column. > > > > Why is this useful? Other than the alphabet, the main > > benefit > > of using a Seq object is consistency. On a practical level, > > the > > Seq object's biological translate method isn't appropriate > > at all > > for an alignment column. On the other hand, one might > > possibly > > want to use (back)transcribe to flip between DNA and RNA, > > and maybe even take the complement. > > > > Are there any strong views here on how alignment slicing > > to > > get a column should behave? i.e. should align[:,9] return > > the > > column as a string or as a Seq? > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Thu May 13 23:08:52 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 May 2010 20:08:52 -0700 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Thu, May 13, 2010 at 4:37 AM, Peter wrote: > Hello all, > > Are there any outstanding issues we should address before making > the Biopython 1.54 release? > > Eric has made a good start on covering Bio.Phylo in the tutorial, > which can be easily proof read online: > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf > > I wrote some more, but something very inconvenient happened to my laptop just after I pushed the commit to my own branch on GitHub: http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 This is a description of every TreeMixin method, ripped from the docstrings / epydoc output and cleaned up a little. The main thing I still need to explain is how the find_* arguments work. However, if we want to get the release out quickly, then all of this can be moved to the wiki instead, if you'd prefer. -Eric From biopython at maubp.freeserve.co.uk Fri May 14 05:17:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 May 2010 10:17:06 +0100 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Fri, May 14, 2010 at 4:08 AM, Eric Talevich wrote: > On Thu, May 13, 2010 at 4:37 AM, Peter > wrote: >> >> Hello all, >> >> Are there any outstanding issues we should address before making >> the Biopython 1.54 release? >> >> Eric has made a good start on covering Bio.Phylo in the tutorial, >> which can be easily proof read online: >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf >> > > I wrote some more, but something very inconvenient happened to my laptop > just after I pushed the commit to my own branch on GitHub: > http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 > > This is a description of every TreeMixin method, ripped from the docstrings > / epydoc output and cleaned up a little. The main thing I still need to > explain is how the find_* arguments work. > > However, if we want to get the release out quickly, then all of this can be > moved to the wiki instead, if you'd prefer. > > -Eric Hi Eric, It sounds like giving you a little more time will make the Phylo chapter much more useful. I'm not going to have time today, and while I could do the release from home my Windows development machine is at work. Shall we aim for a release early next week then? Say Monday or Tuesday? Peter From biopython at maubp.freeserve.co.uk Fri May 14 06:23:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 May 2010 11:23:30 +0100 Subject: [Biopython-dev] Alignment columns as strings or Seq objects? In-Reply-To: References: <948611.90987.qm@web62403.mail.re1.yahoo.com> Message-ID: On Fri, May 14, 2010 at 3:28 AM, Eric Talevich wrote: > Here's another +1 for plain strings. I agree with Michiel, and if the user > really needs to rebuild a Seq with the original alphabet, it's not too > difficult to fetch that information from the original alignment object. > > -Eric OK, done. Peter From bugzilla-daemon at portal.open-bio.org Fri May 14 09:14:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 09:14:59 -0400 Subject: [Biopython-dev] [Bug 3066] Iterating/looping over colums/rows of a MultipleSeqAlignment In-Reply-To: Message-ID: <201005141314.o4EDExXE021190@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3066 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 09:14 EST ------- (In reply to comment #0) > A related question here is should the columns be returned as strings or as Seq > objects? Possible implementation to follow as a patch... The main __getitem__ method has just been changed to return strings as of Biopython 1.54 (while the beta returned columns as Seq objects): http://github.com/biopython/biopython/commit/dbf72e19d65d1edd6777bd498306fe34eb4e371e Therefore for consistency, any column iterator method should now also return strings. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 14 09:33:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 09:33:48 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005141333.o4EDXm7Q021824@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 09:33 EST ------- (In reply to comment #11) > > I think we should probably output all IMGT records using the increased > indentation. This way there will be no ambiguity and no information loss. If > you want to manually "convert" to standard EMBL format, I think the truncation > makes sense as you proposed it, and we could issue a warning about lost > information. I've found a page describing the IMGT file format, and it does say their feature indent should be 26 (while EMBL files use 21): http://www.ebi.ac.uk/imgt/hla/docs/manual.html > > I have already notified IMGT regarding the ">" problem, though they seem like > they will be slow to change it. It's a very simple fix to the flatfile, and I > did it manually with regular expressions. My preference is that we do NOT > support the backwards notation, as it's clearly wrong. We'll have them fix > it. In the meanwhile, I can post my python script that corrects it somewhere > (maybe as a gist on github) and we can just point people to it in a warning if > they are using the IMGT parser. > > Regarding the 1. problem, I have not yet told the IMGT people, but I will do > so shortly. > The document I found does not discuss the details of the location, so I would expect it to follow the same rules as EMBL (and GenBank and the DDBJ), see: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html I now agree with you it makes sense to treat this as a new format in SeqIO (i.e. "imgt" rather than "embl"). The actual new code should be minimal too. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 14 09:44:46 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 09:44:46 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005141344.o4EDikeW022139@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 09:44 EST ------- (In reply to comment #12) > I've found a page describing the IMGT file format, and it does say their > feature indent should be 26 (while EMBL files use 21): > http://www.ebi.ac.uk/imgt/hla/docs/manual.html See also: http://imgt.cines.fr/download/LIGM-DB/userman_doc.html and: http://imgt.cines.fr/download/LIGM-DB/ftable_doc.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Fri May 14 12:14:15 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 14 May 2010 09:14:15 -0700 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Fri, May 14, 2010 at 2:17 AM, Peter wrote: > On Fri, May 14, 2010 at 4:08 AM, Eric Talevich > wrote: > > On Thu, May 13, 2010 at 4:37 AM, Peter > > wrote: > >> > >> Hello all, > >> > >> Are there any outstanding issues we should address before making > >> the Biopython 1.54 release? > >> > >> Eric has made a good start on covering Bio.Phylo in the tutorial, > >> which can be easily proof read online: > >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html > >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf > >> > > > > I wrote some more, but something very inconvenient happened to my laptop > > just after I pushed the commit to my own branch on GitHub: > > > http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 > > > > This is a description of every TreeMixin method, ripped from the > docstrings > > / epydoc output and cleaned up a little. The main thing I still need to > > explain is how the find_* arguments work. > > > > However, if we want to get the release out quickly, then all of this can > be > > moved to the wiki instead, if you'd prefer. > > > > -Eric > > Hi Eric, > > It sounds like giving you a little more time will make the Phylo > chapter much more useful. > > I'm not going to have time today, and while I could do the release > from home my Windows development machine is at work. > > Shall we aim for a release early next week then? Say Monday or > Tuesday? > > Peter > Sure. I'll aim for pushing my documentation to GitHub on Saturday or Sunday. -Eric From bugzilla-daemon at portal.open-bio.org Fri May 14 13:05:06 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 13:05:06 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005141705.o4EH56ok028481@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 13:05 EST ------- The code on my branch has been updated, and now supports SAM and BAM parsing (currently it only extracts the read name, sequence and quality scores), indexing by name with Bio.SeqIO.index(), and fast conversion to FASTA or Sanger FASTQ with Bio.SeqIO.convert() which is handy for redoing a mapping: http://github.com/peterjc/biopython/tree/seqio-sam-bam Note that suffixes of "/1" or "/2" are added to forward or reverse read names to make them unique. This matches the Illumina pipeline convention and is handled by most tools which take paired end data. I'm actually using this code at the moment: I've started with BAM files of paired end Illumina transcriptome reads mapped onto a draft assembly. I then used the convert code to convert these to FASTQ files, then split them into a pair of FASTQ files (forward and reverse) and used BWA to remap them to a different reference (giving new SAM files). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat May 15 06:32:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 15 May 2010 11:32:39 +0100 Subject: [Biopython-dev] Simpler SeqRecord creation Message-ID: Hi all, Since several of the changes coming in Biopython 1.54 are "syntactic sugar" like accepting filenames or handles in SeqIO, I was wondering about other ways to make life a little easier. One is creation of a SeqRecord with the default argument: from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord rec = SeqRecord(Seq("ACGT"), id="Test") If the SeqRecord __init__ checked for a plain string as the sequence, it could automatically upgrade it into a Seq object with the default argument, thus: from Bio.SeqRecord import SeqRecord rec = SeqRecord("ACGT", id="Test") I'm a little concerned that this will impose a small but noticeable overhead when working on very large files though... What are peoples thoughts on this idea? Peter From mjldehoon at yahoo.com Sat May 15 12:19:05 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 15 May 2010 09:19:05 -0700 (PDT) Subject: [Biopython-dev] Simpler SeqRecord creation In-Reply-To: Message-ID: <140115.14021.qm@web62402.mail.re1.yahoo.com> Simpler SeqRecord creation is good in itself, but I wouldn't spend too much time on int. If hopefully we some day deprecate alphabets, then a Seq object reduces to a string anyway. --Michiel. --- On Sat, 5/15/10, Peter wrote: > From: Peter > Subject: [Biopython-dev] Simpler SeqRecord creation > To: "Biopython-Dev Mailing List" > Date: Saturday, May 15, 2010, 6:32 AM > Hi all, > > Since several of the changes coming in Biopython 1.54 > are "syntactic sugar" like accepting filenames or handles > in SeqIO, I was wondering about other ways to make life > a little easier. One is creation of a SeqRecord with the > default argument: > > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > rec = SeqRecord(Seq("ACGT"), id="Test") > > If the SeqRecord __init__ checked for a plain string as > the sequence, it could automatically upgrade it into a > Seq object with the default argument, thus: > > from Bio.SeqRecord import SeqRecord > rec = SeqRecord("ACGT", id="Test") > > I'm a little concerned that this will impose a small > but noticeable overhead when working on very > large files though... > > What are peoples thoughts on this idea? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Sat May 15 13:53:20 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 15 May 2010 13:53:20 -0400 Subject: [Biopython-dev] Simpler SeqRecord creation In-Reply-To: <140115.14021.qm@web62402.mail.re1.yahoo.com> References: <140115.14021.qm@web62402.mail.re1.yahoo.com> Message-ID: <20100515175320.GA2432@kunkel> Peter and Michiel; > > If the SeqRecord __init__ checked for a plain string as > > the sequence, it could automatically upgrade it into a > > Seq object with the default argument, thus: > > > > from Bio.SeqRecord import SeqRecord > > rec = SeqRecord("ACGT", id="Test") > Simpler SeqRecord creation is good in itself, but I wouldn't spend too > much time on int. If hopefully we some day deprecate alphabets, then a > Seq object reduces to a string anyway. Accepting strings seems like a good way to start a transition from Seq objects to standard strings. +1 for this. It would also be useful if the defaults for id, name and description were empty strings instead of "." These don't seem especially useful, and when generating SeqRecords and writing them to Fasta, this helps avoid having to explicitly set descriptions to an empty string. Brad From biopython at maubp.freeserve.co.uk Sun May 16 07:48:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 16 May 2010 12:48:49 +0100 Subject: [Biopython-dev] Simpler SeqRecord creation In-Reply-To: <20100515175320.GA2432@kunkel> References: <140115.14021.qm@web62402.mail.re1.yahoo.com> <20100515175320.GA2432@kunkel> Message-ID: On Sat, May 15, 2010 at 6:53 PM, Brad Chapman wrote: > Peter and Michiel; > >> > If the SeqRecord __init__ checked for a plain string as >> > the sequence, it could automatically upgrade it into a >> > Seq object with the default argument, thus: >> > >> > from Bio.SeqRecord import SeqRecord >> > rec = SeqRecord("ACGT", id="Test") > >> Simpler SeqRecord creation is good in itself, but I wouldn't spend too >> much time on int. If hopefully we some day deprecate alphabets, then a >> Seq object reduces to a string anyway. > > Accepting strings seems like a good way to start a transition from > Seq objects to standard strings. +1 for this. I'm not convinced about moving from Seq objects to plain strings. I *like* having the biological methods as part of the Seq object. I can also think of several useful Seq subclass objects such as 2bit encoded unambiguous DNA or RNA (BioJava has this) or 4bit encoded ambiguous DNA or RNA (the BAM format uses this). These would be a trade off using less memory at the expense of being a bit slower for many operations - they could be very useful is dealing with next generation sequence data. > It would also be useful if the defaults for id, name and description > were empty strings instead of "." These don't seem > especially useful, and when generating SeqRecords and writing them > to Fasta, this helps avoid having to explicitly set descriptions to > an empty string. Yes, I like that idea for name and description. I'm not 100% sure about having a default ID - I'd prefer that was mandatory since so much depends on it (e.g. SeqIO and AlignIO), and a default of the empty string may have side effects. Changing these defaults won't hurt performance which is good. Something to change after we release Biopython 1.54 this coming week? Peter From eric.talevich at gmail.com Sun May 16 12:20:15 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 16 May 2010 12:20:15 -0400 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Fri, May 14, 2010 at 12:14 PM, Eric Talevich wrote: > On Fri, May 14, 2010 at 2:17 AM, Peter wrote: > >> On Fri, May 14, 2010 at 4:08 AM, Eric Talevich >> wrote: >> > On Thu, May 13, 2010 at 4:37 AM, Peter > > >> > wrote: >> >> >> >> Hello all, >> >> >> >> Are there any outstanding issues we should address before making >> >> the Biopython 1.54 release? >> >> >> >> Eric has made a good start on covering Bio.Phylo in the tutorial, >> >> which can be easily proof read online: >> >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html >> >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf >> >> >> > >> > I wrote some more, but something very inconvenient happened to my laptop >> > just after I pushed the commit to my own branch on GitHub: >> > >> http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 >> > >> > This is a description of every TreeMixin method, ripped from the >> docstrings >> > / epydoc output and cleaned up a little. The main thing I still need to >> > explain is how the find_* arguments work. >> > >> > However, if we want to get the release out quickly, then all of this can >> be >> > moved to the wiki instead, if you'd prefer. >> > >> > -Eric >> >> Hi Eric, >> >> It sounds like giving you a little more time will make the Phylo >> chapter much more useful. >> >> I'm not going to have time today, and while I could do the release >> from home my Windows development machine is at work. >> >> Shall we aim for a release early next week then? Say Monday or >> Tuesday? >> >> Peter >> > > Sure. I'll aim for pushing my documentation to GitHub on Saturday or > Sunday. > > -Eric > I've pushed the latest docs to GitHub. Does the chapter look all right now? Feel free to modify the text as you see fit; I'm going to be traveling again today and tomorrow and won't be able to respond quickly. (My netbook's still misbehaving, so I had to do some things I'm not proud of -- hence "root" as the committer on the last merge.) -Eric From biopython at maubp.freeserve.co.uk Mon May 17 03:37:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 May 2010 08:37:48 +0100 Subject: [Biopython-dev] Test failures on Jython 2.5.1 Message-ID: Hi all (especially Eric), I've just run the test suite with Jython 2.5.1 and this found some new problems. Most of these are XML related from Bio.Phylo ERROR: Round-trip parsing and serialization of apaf.xml. ExpatError: The element type "phy:clade" must be terminated by the matching end-tag "". ERROR: Round-trip parsing and serialization of bcl_2.xml. ExpatError: The element type "phy:branch_length" must be terminated by the matching end-tag "". ERROR: Round-trip parsing and serialization of o_tol_332_d_dollo.xml. ExpatError: XML document structures must start and end within the same entity. ERROR: Round-trip parsing and serialization of made_up.xml. ExpatError: Premature end of file. ERROR: Round-trip parsing and serialization of phyloxml_examples.xml. ExpatError: XML document structures must start and end within the same entity. It would probably be instructive to look at the serialisation output in an XML validator - if there is a problem it may be the Jython parser is stricter than the C Python XML parser. There are also a couple of SeqIO related problems with large files: ERROR: test_SeqIO OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space ERROR: Write and read back Human_contigs.embl OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space Example Tests/EMBL/Human_contigs.embl is causing the problem. This is a sequence of length 958952, but the file doesn't actually hold the sequence so we use an UnknownSeq object. The two out of heap space error are both from trying to create a string of 958952 "N" characters. I'll have a look at this - we can probably avoid it in the test. Peter From biopython at maubp.freeserve.co.uk Mon May 17 03:57:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 May 2010 08:57:15 +0100 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Mon, May 17, 2010 at 8:37 AM, Peter wrote: > Hi all (especially Eric), > > I've just run the test suite with Jython 2.5.1 and this found some new > problems. Most of these are XML related from Bio.Phylo > > ... > > There are also a couple of SeqIO related problems with large files: > > ERROR: test_SeqIO > OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space > ERROR: Write and read back Human_contigs.embl > OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space > > Example Tests/EMBL/Human_contigs.embl is causing the problem. > This is a sequence of length 958952, but the file doesn't actually > hold the sequence so we use an UnknownSeq object. The two > out of heap space error are both from trying to create a string of > 958952 "N" characters. I'll have a look at this - we can probably > avoid it in the test. Fixed on the trunk. Once we have moved to using string equality for Seq objects comparing two UnknownSeq objects can be handled much more cleanly (without the memory overhead of the naive approach of building the strings). Peter From bugzilla-daemon at portal.open-bio.org Mon May 17 19:44:07 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 17 May 2010 19:44:07 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005172344.o4HNi7Ma008083@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #14 from laserson at mit.edu 2010-05-17 19:44 EST ------- (In reply to comment #12) > I now agree with you it makes sense to treat this as a new format in SeqIO > (i.e. "imgt" rather than "embl"). The actual new code should be minimal too. Great, so how do you want to implement this? I believe the patch I posted does define an 'imgt' format with all the necessary stuff other than writing. But if you'd like to make it more concise, let me know what to do. (The patch also doesn't incorporate the latest changes you made to the EMBL parser. Speaking of which, I was finally able to use the SeqIO.index function successfully using the parser. However, when there are feature keys flush against the location qualifiers, it still raises an exception.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Tue May 18 02:11:31 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Tue, 18 May 2010 02:11:31 -0400 Subject: [Biopython-dev] 5/18 BioStar - Biopython Questions Message-ID: <9255f60053f6ccfb752076b4c86c2c62@74.63.51.88> ================================================== 1. When should we develop biopython that support python 3.X'?? ================================================== May 17, 2010 at 9:17 PM As python 3.X becomming more and more popular,Can we developers take developing biopython that support python 3.X into consideration? I am a newer to biopython and find that biopython doesn't support python 3.X. It's really frustrated. Thank you ! http://biostar.stackexchange.com/questions/1083/when-should-we-develop-biopython-that-support-python-3-x -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From eric.talevich at gmail.com Tue May 18 03:29:53 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 18 May 2010 00:29:53 -0700 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Mon, May 17, 2010 at 12:37 AM, Peter wrote: > Hi all (especially Eric), > > I've just run the test suite with Jython 2.5.1 and this found some new > problems. Most of these are XML related from Bio.Phylo > > ERROR: Round-trip parsing and serialization of apaf.xml. > ExpatError: The element type "phy:clade" must be terminated by the > matching end-tag "". > > ERROR: Round-trip parsing and serialization of bcl_2.xml. > ExpatError: The element type "phy:branch_length" must be terminated by > the matching end-tag "". > > ERROR: Round-trip parsing and serialization of o_tol_332_d_dollo.xml. > ExpatError: XML document structures must start and end within the same > entity. > > ERROR: Round-trip parsing and serialization of made_up.xml. > ExpatError: Premature end of file. > > ERROR: Round-trip parsing and serialization of phyloxml_examples.xml. > ExpatError: XML document structures must start and end within the same > entity. > > It would probably be instructive to look at the serialisation output in > an XML validator - if there is a problem it may be the Jython parser > is stricter than the C Python XML parser. > > (If you're gonna poke around in the bushes, be ready to stir up a few snakes...) Doing a round-trip parsing, rewriting and re-parsing of the test files manually works in Jython, and the XML output looks the same as it does from CPython. I don't immediately see why the test is failing, although I faintly recall reading that Jython's xml.etree implementation is/was a little short of fully baked -- maybe its parser is stopping early for some reason. I'm sorely tempted to just update to the documentation to say Jython support is beta, since I hadn't tried it myself until you pointed this out. But now that we know about this bug, I suppose it warrants another day or so of fussing around with Jython internals. I'll report back after I've done that. -Eric From bugzilla-daemon at portal.open-bio.org Tue May 18 07:48:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 07:48:38 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-line files from IMGT In-Reply-To: Message-ID: <201005181148.o4IBmcof030759@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|More robust feature parser |Support for EMBL-line files |for GenBank/EMBL records |from IMGT ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 07:48 EST ------- Retitling bug as "Support for EMBL-line files from IMGT in Bio.SeqIO". Here is a start at parsing IMGT files based on subclassing the INSDC code with slightly more flexible feature handling: http://github.com/peterjc/biopython/tree/seqio-imgt I've been testing using http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z There are some interesting cases like AB114296 with: FT TRANSMEMBRANE-REGION2163..2240 We can if necessary work around some of the bad locations strings (see the above branch). Note that there are still other problems in the IMGT data like mismatched lengths. Uri - Could you explain what your code was trying to do with the record header parsing? An example or two would be great. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 11:45:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 11:45:10 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-line files from IMGT In-Reply-To: Message-ID: <201005181545.o4IFjAhe007264@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #16 from laserson at mit.edu 2010-05-18 11:45 EST ------- (In reply to comment #15) > Uri - Could you explain what your code was trying to do with the record header > parsing? An example or two would be great. Thanks! So the approach I used was to keep the feature parser the exact same as it was in the EMBL parser. In the parse_header function, I would determine for each record what the indentation was, and then changed FEATURE_QUALIFIER_INDENT and FEATURE_QUALIFIER_SPACER for each record. This way, the standard EMBL parser would work fine, and there would never be any problems if the feature key was adjacent to the location qualifier. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 11:53:03 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 11:53:03 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-line files from IMGT In-Reply-To: Message-ID: <201005181553.o4IFr3Rr007470@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #17 from laserson at mit.edu 2010-05-18 11:53 EST ------- Also, here is a script that will fix the location errors with the '>' symbols. Run as: python fix_ligm_locations.py imgt.dat imgt.fixed.dat http://gist.github.com/405146 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 12:10:29 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 12:10:29 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005181610.o4IGATZb008256@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Support for EMBL-line files |Support for EMBL-like files |from IMGT |from IMGT ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 12:10 EST ------- (In reply to comment #16) > (In reply to comment #15) > > Uri - Could you explain what your code was trying to do with the record > > header parsing? An example or two would be great. Thanks! > > So the approach I used was to keep the feature parser the exact same as it was > in the EMBL parser. In the parse_header function, I would determine for each > record what the indentation was, and then changed FEATURE_QUALIFIER_INDENT and > FEATURE_QUALIFIER_SPACER for each record. This way, the standard EMBL parser > would work fine, and there would never be any problems if the feature key was > adjacent to the location qualifier. > I see now. If the IGMT have consistent FH and FT lines we can trust, that would be quite elegant... on the other hand to fix the nasty locations we are forced to subclass parse_features anyway. (In reply to comment #17) > Also, here is a script that will fix the location errors with the '>' > symbols. > > Run as: > > python fix_ligm_locations.py imgt.dat imgt.fixed.dat > > http://gist.github.com/405146 > I've used your regular expression solution in my branch now, http://github.com/biopython/biopython Remind me to add your name as a contributor once this gets merged to the trunk. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 13:36:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 13:36:44 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005181736.o4IHahUL011171@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 13:36 EST ------- (In reply to comment #18) > > I've used your regular expression solution in my branch now, > http://github.com/biopython/biopython > Sorry - I pasted the wrong URL, I mean here: http://github.com/peterjc/biopython/tree/seqio-imgt I've found an even worse example of partial location example from http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z ID M97158 IMGT/LIGM annotation : keyword level; RNA; ROD; 1093 BP. XX ... XX FH Key Location/Qualifiers (from EMBL) FH FT source 1. FT /organism="Mus musculus" FT mRNA join(523. FT intron 1. FT exon 523. FT intron 541. FT exon 638. FT intron 745. XX ... You can see the original at EMBL, http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=M97158&Submit=Go Or in GenBank format at the NCBI, http://www.ncbi.nlm.nih.gov/nuccore/200865 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 14:37:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 14:37:17 -0400 Subject: [Biopython-dev] [Bug 3074] Please support additional fields in the SeqIO embl parser In-Reply-To: Message-ID: <201005181837.o4IIbHO6013307@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3074 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 14:37 EST ------- The database and primary accessions from DR lines are now recorded in the SeqRecord's dbxrefs list: http://github.com/biopython/biopython/commit/d96ab570b196b1b92f65aa945ae6816a60ddb54e The best way to dealing with secondary accessions in a backwards compatible way isn't clear to me - probably as another colon separated entry. See: http://lists.open-bio.org/pipermail/biopython/2010-May/006495.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 14:38:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 14:38:31 -0400 Subject: [Biopython-dev] [Bug 3043] test_NCBI_BLAST_tools fails In-Reply-To: Message-ID: <201005181838.o4IIcVlC013353@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3043 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 14:38 EST ------- Marking this as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 17:46:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 17:46:24 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005182146.o4ILkOWD018452@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #20 from laserson at mit.edu 2010-05-18 17:46 EST ------- (In reply to comment #18) > I see now. If the IGMT have consistent FH and FT lines we can trust, that would > be quite elegant... on the other hand to fix the nasty locations we are forced > to subclass parse_features anyway. My impression so far is that we can trust their feature indentations. At least all their FH lines are one of two indentations, and we can measure the indentation on all the FT lines. Once we get a parser that generally works, I'm going to make a list of all the accessions that have actual errors and submit to IMGT. In the meanwhile, I'll personally settle for catching those exceptions and skipping those records. > Remind me to add your name as a contributor once this gets merged to the trunk. That's very kind. I'm glad I can contribute to a worthy and incredibly useful project. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 17:53:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 17:53:35 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005182153.o4ILrZKm018617@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #21 from laserson at mit.edu 2010-05-18 17:53 EST ------- (In reply to comment #19) > Sorry - I pasted the wrong URL, I mean here: > http://github.com/peterjc/biopython/tree/seqio-imgt I'm still not sure where you integrated the regular expression. > I've found an even worse example of partial location example from > http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z > > ID M97158 IMGT/LIGM annotation : keyword level; RNA; ROD; 1093 BP. > XX > > ... > > You can see the original at EMBL, > http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=M97158&Submit=Go > > Or in GenBank format at the NCBI, > http://www.ncbi.nlm.nih.gov/nuccore/200865 That is awful! And there is no excuse for it either, as they should've just taken the coords from EMBL. I feel as though we should leave these problems as errors, and have IMGT fix them. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 18:26:25 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 18:26:25 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005182226.o4IMQPtB019429@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #22 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 18:26 EST ------- (In reply to comment #21) > (In reply to comment #19) > > > Sorry - I pasted the wrong URL, I mean here: > > http://github.com/peterjc/biopython/tree/seqio-imgt > > I'm still not sure where you integrated the regular expression. File Bio/GenBank/Scanner, this commit: http://github.com/peterjc/biopython/commit/a41db092a40542944158278f2cc26517cd464b60 > > I've found an even worse example of partial location example from > > http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z > > > > ID M97158 IMGT/LIGM annotation : keyword level; RNA; ROD; 1093 BP. > > XX > > > > ... > > > > You can see the original at EMBL, > > http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=M97158&Submit=Go > > > > Or in GenBank format at the NCBI, > > http://www.ncbi.nlm.nih.gov/nuccore/200865 > > That is awful! And there is no excuse for it either, as they should've just > taken the coords from EMBL. I feel as though we should leave these problems > as errors, and have IMGT fix them. In this case (and the other locations with missing text) there is no good work around so I would agree - get the IMGT to fix them. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 19:06:21 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 19:06:21 -0400 Subject: [Biopython-dev] [Bug 3016] Change WriterTests in test_PhyloXML.py to use StringIO or temp files In-Reply-To: Message-ID: <201005182306.o4IN6LCj020428@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3016 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from eric.talevich at gmail.com 2010-05-18 19:06 EST ------- Fixed in GitHub: http://github.com/biopython/biopython/commit/ad1a618def838d98432e9623367cffb595eadecd The patch uses one named temp file for everything, closes file handles diligently, and deletes the temp file at the end of the script. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Tue May 18 19:26:33 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 18 May 2010 16:26:33 -0700 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Mon, May 17, 2010 at 12:37 AM, Peter wrote: > Hi all (especially Eric), > > I've just run the test suite with Jython 2.5.1 and this found some new > problems. Most of these are XML related from Bio.Phylo > > ERROR: Round-trip parsing and serialization of apaf.xml. > ExpatError: The element type "phy:clade" must be terminated by the > matching end-tag "". > > ... > Fixed in GitHub: http://github.com/biopython/biopython/commit/ad1a618def838d98432e9623367cffb595eadecd I couldn't replicate this crash doing anything remotely normal in the Jython interpreter, but the rewriting scheme in the PhyloXML unit test suite crashed with various confusing tracebacks. The rewritten files were valid, though, and could be read by Jython outside the test suite. I know Jython doesn't clean up file handles as diligently as CPython does, so my best guess is that some file handles remained open or were reused/resurrected while parsing the rewritten files -- i.e. during the second parse, Jython's XML parser either started somplace other than the start of the file, or terminated early/late, expecting the rewritten file to have the same size as the original (which it doesn't because of collapsed whitespace). My patch reworks the file rewriting scheme and manages file handles obsessively; the PhyloXML parser itself stays the same. -Eric From bugzilla-daemon at portal.open-bio.org Tue May 18 20:30:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 20:30:53 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005190030.o4J0UrSW022165@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #23 from laserson at mit.edu 2010-05-18 20:30 EST ------- (In reply to comment #22) So I tried parsing the whole imgt.dat file, and we do pretty well. The only two problems I see are the broken location qualifiers, and a few records where the lengths annotated in their ID strings don't match the actual lengths of the sequences. > In this case (and the other locations with missing text) there is no good work > around so I would agree - get the IMGT to fix them. So let's go ahead and change the warnings back to errors. In the meanwhile, we can parse properly using the SeqIO.index function and just catch and ignore all the bad records. And I will compile a list of bad records and give them to the curators at IMGT. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Wed May 19 02:13:07 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 19 May 2010 02:13:07 -0400 Subject: [Biopython-dev] 5/19 Stack Overflow - Biopython questions Message-ID: <2d537c12e961d821ff59bb55f3a10502@74.63.51.88> ================================================== 1. Can anyone tell me why these lines are not working? ================================================== May 18, 2010 at 7:21 AM I am trying to generate tree with fasta file input and Alignment with MuscleCommandline import sys,os, subprocess from Bio import AlignIO from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="c:\Python26\opuntia.fasta") child= subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) align=AlignIO.read(child.stdout,"fasta") outfile=open('c:\Python26\opuntia.phy','w') AlignIO.write([align],outfile,'phylip') outfile.close() I always encounter with these problems Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python26\muscleIO.py", line 11, in align=AlignIO.read(child.stdout,"fasta") File "C:\Python26\Lib\site-packages\Bio\AlignIO\__init__.py", line 423, in read raise ValueError("No records found in handle") ValueError: No records found in handle http://stackoverflow.com/questions/2856697/can-anyone-tell-me-why-these-lines-are-not-working -------------------------------------------------- ================================================== 2. Subprocess fails to catch the standard output ================================================== May 18, 2010 at 7:21 AM I am trying to generate tree with fasta file input and Alignment with MuscleCommandline import sys,os, subprocess from Bio import AlignIO from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="c:\Python26\opuntia.fasta") child= subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) align=AlignIO.read(child.stdout,"fasta") outfile=open('c:\Python26\opuntia.phy','w') AlignIO.write([align],outfile,'phylip') outfile.close() I always encounter with these problems Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python26\muscleIO.py", line 11, in align=AlignIO.read(child.stdout,"fasta") File "C:\Python26\Lib\site-packages\Bio\AlignIO\__init__.py", line 423, in read raise ValueError("No records found in handle") ValueError: No records found in handle http://stackoverflow.com/questions/2856697/subprocess-fails-to-catch-the-standard-output -------------------------------------------------- =========================================================== Source: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311789/3e0b2a02a42e76a71e4f14abbbfad2f294f545ce/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From biopython at maubp.freeserve.co.uk Wed May 19 03:52:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 May 2010 08:52:30 +0100 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Wed, May 19, 2010 at 12:26 AM, Eric Talevich wrote: > On Mon, May 17, 2010 at 12:37 AM, Peter wrote: > >> Hi all (especially Eric), >> >> I've just run the test suite with Jython 2.5.1 and this found some new >> problems. Most of these are XML related from Bio.Phylo >> >> ERROR: Round-trip parsing and serialization of apaf.xml. >> ExpatError: The element type "phy:clade" must be terminated by the >> matching end-tag "". >> >> ... >> > > Fixed in GitHub: > http://github.com/biopython/biopython/commit/ad1a618def838d98432e9623367cffb595eadecd > > I couldn't replicate this crash doing anything remotely normal in the Jython > interpreter, but the rewriting scheme in the PhyloXML unit test suite > crashed with various confusing tracebacks. The rewritten files were valid, > though, and could be read by Jython outside the test suite. > > I know Jython doesn't clean up file handles as diligently as CPython does, > so my best guess is that some file handles remained open or were > reused/resurrected while parsing the rewritten files -- i.e. during the > second parse, Jython's XML parser either started somplace other than the > start of the file, or terminated early/late, expecting the rewritten file to > have the same size as the original (which it doesn't because of collapsed > whitespace). My patch reworks the file rewriting scheme and manages file > handles obsessively; the PhyloXML parser itself stays the same. > > -Eric Good work :) Peter From bugzilla-daemon at portal.open-bio.org Wed May 19 07:35:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 07:35:44 -0400 Subject: [Biopython-dev] [Bug 3016] Change WriterTests in test_PhyloXML.py to use StringIO or temp files In-Reply-To: Message-ID: <201005191135.o4JBZilB013157@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3016 ------- Comment #2 from chapmanb at 50mail.com 2010-05-19 07:35 EST ------- Eric; Just a quick tip on mkstemp. When you do: DUMMY = tempfile.mkstemp()[1] you leave an open handle as the first argument of this tuple. It won't cause you any issues here, but is a problem if you have a long running server process. You will leak open file handles and eventually get an error about too many open files. See: http://www.logilab.org/blogentry/17873 http://vocamus.net/dave/?p=997 No problems here, but rather a heads up on a tricky bit of python I've run into too many times to count, Brad -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 19 11:11:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 11:11:31 -0400 Subject: [Biopython-dev] [Bug 2964] placing x-axis of graph track at the bottom or top of the track in GenomeDiagram In-Reply-To: Message-ID: <201005191511.o4JFBVG2019800@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2964 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-19 11:11 EST ------- I've been looking at this myself recently (I'm drawing transcriptome read coverage data which means I have no negative values), and tried a few things. (In reply to comment #7 and #8) > > By allowing the position of the axis to take any value within the data range, > this still allows 'top', 'middle' and 'bottom' to be defined as functions of > the data with, e.g. > > x_axis_pos = min(data) # bottom > x_axis_pos = max(data) # top > x_axis_pos = median(data) # middle > > and also allows for explicit placing of the axis at specified points on the > y-axis, or as other points that depend on the data (e.g. mean, quartiles, > etc.) There is a problem with this - the x-axis and other bits like the scale are drawn by the Track object. This can contain multiple datasets, which can all be using their own coordinate systems. In specifying the x-axis position we can't therefore talk about max(data), min(data) or median(data) for the track as a whole. What we can do is talk about bottom/middle/top (or even a float between 0 and 1 to be more precise). This is quite easy but doesn't address the plotting side of things... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 19 11:21:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 11:21:48 -0400 Subject: [Biopython-dev] [Bug 2964] placing x-axis of graph track at the bottom or top of the track in GenomeDiagram In-Reply-To: Message-ID: <201005191521.o4JFLmZQ020098@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2964 ------- Comment #14 from lpritc at scri.sari.ac.uk 2010-05-19 11:21 EST ------- (In reply to comment #13) > [...] the x-axis and other bits like the scale are > drawn by the Track object. This can contain multiple datasets, which can all > be using their own coordinate systems. In specifying the x-axis position we > can't therefore talk about max(data), min(data) or median(data) for the track > as a whole. What we can do is talk about bottom/middle/top (or even a float > between 0 and 1 to be more precise). This is quite easy but doesn't address > the plotting side of things... Fair point - something to look at in GD2. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 19 22:21:06 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 22:21:06 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <201005200221.o4K2L6gq006217@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2010-05-19 22:21 EST ------- I have trouble understanding the submitted code. Could you provide a patch instead of an updated complete file? Also, don't combine multiple issues in a patch. Your patch should only be to do KDTree NN searches without specifying radius. As far as I can tell from the description, this is not a major change to the code, so if you provide a patch to the C++ version we make the corresponding changes to the current C code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu May 20 03:54:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 08:54:38 +0100 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: Hi all, I've got an urgent bit of work to finish this week (a poster using GenomeDiagram - hence the minor bug fix recently committed), but hope to be able to do the release tomorrow after it has been printed. If I run out of time, I'm away all next week at a conference. I don't mind doing the release when I get back, but if anyone else wanted to volunteer that would be great. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu May 20 12:28:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 17:28:25 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release Message-ID: Hi all, I'm going to start doing the Biopython 1.54 release now, so please don't check anything onto the trunk until further notice. [Working on other branches should be fine of course ;)] Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu May 20 12:39:23 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 20 May 2010 12:39:23 -0400 Subject: [Biopython-dev] [Bug 3016] Change WriterTests in test_PhyloXML.py to use StringIO or temp files In-Reply-To: Message-ID: <201005201639.o4KGdNeN002328@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3016 ------- Comment #3 from eric.talevich at gmail.com 2010-05-20 12:39 EST ------- (In reply to comment #2) > Eric; > Just a quick tip on mkstemp. When you do: > > DUMMY = tempfile.mkstemp()[1] > > you leave an open handle as the first argument of this tuple. It won't cause > you any issues here, but is a problem if you have a long running server > process. You will leak open file handles and eventually get an error about too > many open files. See: > > http://www.logilab.org/blogentry/17873 > http://vocamus.net/dave/?p=997 > > No problems here, but rather a heads up on a tricky bit of python I've run into > too many times to count, > Brad > Thanks! Instead of closing the stray file handle mkstemp generates, I used mktemp. As I understand it, the security issue mentioned in mktemp's docstring is if an attacker creates a symlink to an important, protected file using the same name mktemp chose for this test script. Then if this script is run as root, it would clobber that file even if the attacker didn't originally have permissions to modify that file. http://mail.python.org/pipermail/python-dev/2001-March/013507.html But the Biopython test suite isn't normally run as root, and in any case all of the test scripts reuse file names that aren't protected, which means everything has the same vulnerability. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu May 20 13:05:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 18:05:53 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: References: Message-ID: On Thu, May 20, 2010 at 5:28 PM, Peter wrote: > Hi all, > > I'm going to start doing the Biopython 1.54 release now, so please > don't check anything onto the trunk until further notice. > > [Working on other branches should be fine of course ;)] The archive and windows installers are done and are online. Please feel free to have a quick sanity test now - I'll be back in an hour or so to update the downloads page, send the release announcement etc. [Please consider the trunk still frozen for now] Thanks, Peter From biopython at maubp.freeserve.co.uk Thu May 20 15:07:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 20:07:17 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: References: Message-ID: On Thu, May 20, 2010 at 6:05 PM, Peter wrote: > On Thu, May 20, 2010 at 5:28 PM, Peter wrote: >> Hi all, >> >> I'm going to start doing the Biopython 1.54 release now, so please >> don't check anything onto the trunk until further notice. >> >> [Working on other branches should be fine of course ;)] > > The archive and windows installers are done and are online. > Please feel free to have a quick sanity test now - I'll be back > in an hour or so to update the downloads page, send the > release announcement etc. > > [Please consider the trunk still frozen for now] OK, news article from David updated and posted, downloads page updated. The email and API docs update are still to be done, and Brad - could you do the python package index update please? Ta, Peter From chapmanb at 50mail.com Thu May 20 15:33:13 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 20 May 2010 15:33:13 -0400 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: References: Message-ID: <20100520193313.GF1054@sobchak.mgh.harvard.edu> Peter; > OK, news article from David updated and posted, downloads page updated. > The email and API docs update are still to be done, and Brad - could you do > the python package index update please? Done. Congrats and thanks for getting this done so quickly. Have fun at your meeting next week, Brad From biopython at maubp.freeserve.co.uk Thu May 20 17:59:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 22:59:43 +0100 Subject: [Biopython-dev] Biopython 1.54 Message-ID: Dear Biopythoneers, Earlier today we released Biopython 1.54 (a little later than originally planned) which addresses a few bugs found in the beta release, has some changes to the new Bio.Phylo module, adds a whole chapter to the tutorial. Thank you to everyone who contributed code, reported bugs, etc. For more details please see this announcement (kindly drafted by David Winter): http://news.open-bio.org/news/2010/05/biopython-release-154/ Regards, Peter From updates at feedmyinbox.com Fri May 21 02:13:22 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 21 May 2010 02:13:22 -0400 Subject: [Biopython-dev] 5/21 BioStar - Biopython Questions Message-ID: ================================================== 1. PHYLIP (->prodist) Command line wrapper in biopython ================================================== May 20, 2010 at 5:57 AM Phylip has different applications for different phylogency purposes. Can anyone suggest me how to operate PHYLIP suppose(consense, dnaml,protdist) through commandline in biopython. Each applications has got its own different parameters, how can i handle them? http://biostar.stackexchange.com/questions/1123/phylip-prodist-command-line-wrapper-in-biopython -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From biopython at maubp.freeserve.co.uk Fri May 21 07:44:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 May 2010 12:44:58 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: <20100520193313.GF1054@sobchak.mgh.harvard.edu> References: <20100520193313.GF1054@sobchak.mgh.harvard.edu> Message-ID: On Thu, May 20, 2010 at 8:33 PM, Brad Chapman wrote: > > Peter; > >> OK, news article from David updated and posted, downloads page updated. >> The email and API docs update are still to be done, and Brad - could you do >> the python package index update please? > > Done. Congrats and thanks for getting this done so quickly. Have fun > at your meeting next week, > > Brad Cheers Brad - we've been running the tests and just making small tweaks recently, so the release went very smoothly (the beta helped in that sense). Having the release notice already done by David was also a big help, so thanks David :) I sent the email announcement out last night, and I've just now updated the API docs with epydoc. And I bumped the version number to a plus on the trunk. I think that's it now - all done :) Everyone with commit rights - consider the trunk re-open for small changes. Ideally new features should be implemented on a branch before merging. Regards, Peter From updates at feedmyinbox.com Sun May 23 02:10:29 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sun, 23 May 2010 02:10:29 -0400 Subject: [Biopython-dev] 5/23 Stack Overflow - Biopython questions Message-ID: <802d82a45525bfbe5a8834d227bd623c@74.63.51.88> ================================================== 1. 50 sequences in one line ================================================== May 22, 2010 at 9:27 AM I have Multiple sequence alignment (clustal) file and I want to read this file and arrange sequences in such a way that it looks more clear and precise in order. I am doing this from biopython using AlignIO object. My codes goes like this: alignment = AlignIO.read("opuntia.aln", "clustal") print "Number of rows: %i" % len(align) for record in alignment: print "%s - %s" % (record.id, record.seq) My Output -- http://i48.tinypic.com/ae48ew.jpg , it looks messy and long scrolling. What i want to do is print only 50 sequences in each line and continue till the end of alignment file. I wish to have output like this http://i45.tinypic.com/4vh5rc.jpg from http://www.ebi.ac.uk/Tools/clustalw2/. Any suggestions, algorithm and sample code is appreciated Thanks in advance Br, http://stackoverflow.com/questions/2888257/50-sequences-in-one-line -------------------------------------------------- =========================================================== Source: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311789/3e0b2a02a42e76a71e4f14abbbfad2f294f545ce/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From jblanca at btc.upv.es Tue May 25 01:53:14 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 25 May 2010 07:53:14 +0200 Subject: [Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: References: <809294.48600.qm@web62407.mail.re1.yahoo.com> Message-ID: <201005250753.14186.jblanca@btc.upv.es> Hi: On Tuesday 25 May 2010 07:03:19 Vincent Davis wrote: > Discussions on the pystatsmodels mailing list are I am sure relevant but it > might be more beneficial to discuss first on the biopython list as > sometimes to discussions get long and tend to be about economic type data. > The google group/mailing list is > http://groups.google.ca/group/pystatsmodels > I think a few good examples of a "typical" biopy data set and or some of > the typical difficulties would be good to have on the wiki. This might help > start collaboration between statsmodels and biopython on this subject. I > think there are few people that cross over between economics and > bioinformatics. My main concern with the current tools is the memory issue. For instance when I try to create a distribution of sequence lengths or qualities using NGS data I end up with millions of numbers. That is too much for any reasonable computer. I've solved the problem by using disk caches that work as iterators. I'm sure that this is not the most performant solucion. It's just a hack and I would like to use better tools for sure. If you want to take a look at my current solution go to: http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/itertools_.py http://github.com/JoseBlanca/franklin/blob/master/franklin/statistics.py Best regards, Jose Blanca > Also If you know of other groups that would be interested please share this > link/information. > > > Thanks, > > --Michiel. > > > > --- On Mon, 5/24/10, Vincent Davis wrote: > > > From: Vincent Davis > > > Subject: [Biopython] SciPy paper: documenting statistical data > > > structure > > > > design issues > > > > > To: "biopython" > > > Date: Monday, May 24, 2010, 3:45 PM > > > "see the message below, cross posted > > > from pystatsmodels" > > > > > > We have ben having some discussion on the pystatsmodels > > > maling list about > > > data objects, numpy arrays... I think it would be valuable > > > for some > > > biopython users to contribute some comments, examples or > > > ideas to the scipy > > > wiki that has been setup for this. I think at the heart of > > > this is that > > > although almost anything can be done with a numpy array we > > > run into many > > > problems that are difficult to solve with the current tools > > > for numpy > > > arrays. Because of this I think some nice examples of the > > > data design > > > problems that you have faced in the biopython and how they > > > have been solved > > > would be valuable. > > > > > > Thanks > > > Vincent > > > > > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney > > > > > > wrote: > > > > For my SciPy talk and paper in a little over a month, > > > > > > I was hoping to > > > > > > > render a somewhat coherent discussion of the design > > > > > > needs of > > > > > > > statistical data structures, based on my experience > > > > > > developing pandas > > > > > > > for quant finance research. I think these broadly fall > > > > > > into a few > > > > > > > categories: implementation ease, usability (for the > > > > > > non-developer > > > > > > > IPython-based console user), performance, and > > > > > > flexibility. Hopefully > > > > > > > this will be useful information that will help guide > > > > > > future > > > > > > > development efforts. What do you folks think? > > > > > > > > As part of this, I was thinking maybe we should start > > > > > > a wiki page (or > > > > > > > pages) somewhere to start listing out the various > > > > > > design issues (big > > > > > > > and small) where people can write their opinions and > > > > > > we can have a > > > > > > > structured discussion (e-mail is a bit hard for this > > > > > > sort of thing). > > > > > > > I'd also like to spend some time reading through other > > > > > > people's code > > > > > > > (e.g. all of the larry code) and writing down what I > > > > > > think about their > > > > > > > design choices in a constructive way. > > > > > > > > Part of what prompted my idea for a wiki was reading > > > > > > some of the larry > > > > > > > code and wanting to share my thoughts on various parts > > > > > > of it. Of > > > > > > > course I'm also prepared for other people to attack > > > > > > (and for me to > > > > > > > have to defend) my own code. For most of these things > > > > > > there isn't a > > > > > > > "right" and "wrong" and I am only interested in having > > > > > > constructive > > > > > > > discussions and hearing people's perspectives. Here's > > > > > > an example: in > > > > > > > pandas when adding two different-labeled 2d arrays, > > > > > > the result has the > > > > > > > *union* of all the labels. In la you get the > > > > > > intersection. Certainly > > > > > > > are pros and cons for either approach (in my case I > > > > > > don't want to lose > > > > > > > information, even if it's nulled out). > > > > > > > > We should also have a place where we document > > > > > > differences in > > > > > > > performance for various operations. I spent a lot of > > > > > > time even before > > > > > > > pandas was open-source obsessing over speed-- I'd like > > > > > > to think I > > > > > > > learned a few things but I was operating in a bubble > > > > > > so I might have > > > > > > > missed really obvious speedups. I also learned lots of > > > > > > odd things > > > > > > > about NumPy (did you know fancy indexing is a LOT > > > > > > slower than > > > > > > > ndarray.take?). We should probably establish some > > > > > > apples-to-apples > > > > > > > performance benchmarks to help people decide what to > > > > > > use for their > > > > > > > applications if speed matters. > > > > > > > > Best, > > > > Wes > > > > > > *Vincent Davis > > > 720-301-3003 * > > > vincent at vincentdavis.net > > > my blog | > > > LinkedIn > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From mailinglist.honeypot at gmail.com Tue May 25 09:52:26 2010 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 25 May 2010 09:52:26 -0400 Subject: [Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: <201005250753.14186.jblanca@btc.upv.es> References: <809294.48600.qm@web62407.mail.re1.yahoo.com> <201005250753.14186.jblanca@btc.upv.es> Message-ID: Hi, > My main concern with the current tools is the memory issue. For instance when > I try to create a distribution of sequence lengths or qualities using NGS > data I end up with millions of numbers. That is too much for any reasonable > computer. Several million numbers aren't all that much, though, right? To simulate your example, I created a 100,000,000 long vector (which, depending on what type of NGS data you have, should be considered a large number of reads) representing faux read-lengths, and it's only taking up ~ 382 MB's[1] and gathering basic statistics on it (variance, mean, histograms, etc.) isn't painful at all. Once you start adding more metadata to the 100,000,000 elements, I can see where you start running into problems, though. > I've solved the problem by using disk caches that work as > iterators. I'm sure that this is not the most performant solucion. It's just > a hack and I would like to use better tools for sure. Have you tried looking at something like PyTables? Might be something to consider ... Just a thought, -steve [1] I'm using R, which only used 32bit integers, but the language itself isn't really the the point since we're all going to be running into a wall with respect to NGS-sized datasets. -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From vincent at vincentdavis.net Tue May 25 15:19:35 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 25 May 2010 13:19:35 -0600 Subject: [Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: <201005250753.14186.jblanca@btc.upv.es> References: <809294.48600.qm@web62407.mail.re1.yahoo.com> <201005250753.14186.jblanca@btc.upv.es> Message-ID: On Mon, May 24, 2010 at 11:53 PM, Jose Blanca wrote: > Hi: > > My main concern with the current tools is the memory issue. For instance > when > I try to create a distribution of sequence lengths or qualities using NGS > data I end up with millions of numbers. That is too much for any reasonable > computer. I've solved the problem by using disk caches that work as > iterators. I'm sure that this is not the most performant solucion. It's > just > a hack and I would like to use better tools for sure. > If you want to take a look at my current solution go to: > > > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/itertools_.py > http://github.com/JoseBlanca/franklin/blob/master/franklin/statistics.py Please feel free to add some of the comments to the wiki. I also cross posted this to the StatsModels list as I thought it might be of interest to the list. Although I believe Steve Lianoglou comments are correct, data set size is a issue in bio and only getting bigger. > > Best regards, > > Jose Blanca > > > Also If you know of other groups that would be interested please share > this > > link/information. > > > > > Thanks, > > > --Michiel. > > > > > > --- On Mon, 5/24/10, Vincent Davis wrote: > > > > From: Vincent Davis > > > > Subject: [Biopython] SciPy paper: documenting statistical data > > > > structure > > > > > > design issues > > > > > > > To: "biopython" > > > > Date: Monday, May 24, 2010, 3:45 PM > > > > "see the message below, cross posted > > > > from pystatsmodels" > > > > > > > > We have ben having some discussion on the pystatsmodels > > > > maling list about > > > > data objects, numpy arrays... I think it would be valuable > > > > for some > > > > biopython users to contribute some comments, examples or > > > > ideas to the scipy > > > > wiki that has been setup for this. I think at the heart of > > > > this is that > > > > although almost anything can be done with a numpy array we > > > > run into many > > > > problems that are difficult to solve with the current tools > > > > for numpy > > > > arrays. Because of this I think some nice examples of the > > > > data design > > > > problems that you have faced in the biopython and how they > > > > have been solved > > > > would be valuable. > > > > > > > > Thanks > > > > Vincent > > > > > > > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney > > > > > > > > wrote: > > > > > For my SciPy talk and paper in a little over a month, > > > > > > > > I was hoping to > > > > > > > > > render a somewhat coherent discussion of the design > > > > > > > > needs of > > > > > > > > > statistical data structures, based on my experience > > > > > > > > developing pandas > > > > > > > > > for quant finance research. I think these broadly fall > > > > > > > > into a few > > > > > > > > > categories: implementation ease, usability (for the > > > > > > > > non-developer > > > > > > > > > IPython-based console user), performance, and > > > > > > > > flexibility. Hopefully > > > > > > > > > this will be useful information that will help guide > > > > > > > > future > > > > > > > > > development efforts. What do you folks think? > > > > > > > > > > As part of this, I was thinking maybe we should start > > > > > > > > a wiki page (or > > > > > > > > > pages) somewhere to start listing out the various > > > > > > > > design issues (big > > > > > > > > > and small) where people can write their opinions and > > > > > > > > we can have a > > > > > > > > > structured discussion (e-mail is a bit hard for this > > > > > > > > sort of thing). > > > > > > > > > I'd also like to spend some time reading through other > > > > > > > > people's code > > > > > > > > > (e.g. all of the larry code) and writing down what I > > > > > > > > think about their > > > > > > > > > design choices in a constructive way. > > > > > > > > > > Part of what prompted my idea for a wiki was reading > > > > > > > > some of the larry > > > > > > > > > code and wanting to share my thoughts on various parts > > > > > > > > of it. Of > > > > > > > > > course I'm also prepared for other people to attack > > > > > > > > (and for me to > > > > > > > > > have to defend) my own code. For most of these things > > > > > > > > there isn't a > > > > > > > > > "right" and "wrong" and I am only interested in having > > > > > > > > constructive > > > > > > > > > discussions and hearing people's perspectives. Here's > > > > > > > > an example: in > > > > > > > > > pandas when adding two different-labeled 2d arrays, > > > > > > > > the result has the > > > > > > > > > *union* of all the labels. In la you get the > > > > > > > > intersection. Certainly > > > > > > > > > are pros and cons for either approach (in my case I > > > > > > > > don't want to lose > > > > > > > > > information, even if it's nulled out). > > > > > > > > > > We should also have a place where we document > > > > > > > > differences in > > > > > > > > > performance for various operations. I spent a lot of > > > > > > > > time even before > > > > > > > > > pandas was open-source obsessing over speed-- I'd like > > > > > > > > to think I > > > > > > > > > learned a few things but I was operating in a bubble > > > > > > > > so I might have > > > > > > > > > missed really obvious speedups. I also learned lots of > > > > > > > > odd things > > > > > > > > > about NumPy (did you know fancy indexing is a LOT > > > > > > > > slower than > > > > > > > > > ndarray.take?). We should probably establish some > > > > > > > > apples-to-apples > > > > > > > > > performance benchmarks to help people decide what to > > > > > > > > use for their > > > > > > > > > applications if speed matters. > > > > > > > > > > Best, > > > > > Wes > > > > > > > > *Vincent Davis > > > > 720-301-3003 * > > > > vincent at vincentdavis.net > > > > my blog | > > > > LinkedIn > > > > _______________________________________________ > > > > Biopython mailing list - Biopython at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > *Vincent Davis > > 720-301-3003 * > > vincent at vincentdavis.net > > my blog | > > LinkedIn > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > -- > Jose M. Blanca Postigo > Instituto Universitario de Conservacion y > Mejora de la Agrodiversidad Valenciana (COMAV) > Universidad Politecnica de Valencia (UPV) > Edificio CPI (Ciudad Politecnica de la Innovacion), 8E > 46022 Valencia (SPAIN) > Tlf.:+34-96-3877000 (ext 88473) > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From mjldehoon at yahoo.com Fri May 28 23:23:21 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 28 May 2010 20:23:21 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records Message-ID: <901919.44402.qm@web62402.mail.re1.yahoo.com> Hi everybody, With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use its new Blast+ suite of Blast programs, maybe this is a good time to tackle some older bugs related to Blast output parsing in Biopython: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 (inconsistencies in the output of different Blast parsers) http://bugzilla.open-bio.org/show_bug.cgi?id=2929 (inconsistencies between Psi-blast parsers) http://bugzilla.open-bio.org/show_bug.cgi?id=2319 (parsing Blast table output) and more generally think about the design of the Blast record class and Blast parsing. In my opinion, these are the major issues: 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should have one read() function and one parse() function under Bio.Blast, with arguments specifying which format the Blast output is in. 2) Blast records produced by any of the parsers should be consistent with each other. As XML output by blast and psi-blast follow the same DTD, we should be able to represent both by a single Record class. 3) Different parsers should store information in this Record class in the same way. 4) The current Blast record stores its information in attributes. If you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the necessary DTDs to do so), the information is stored in dictionaries. This has some advantages. For example, it allows you to use record.keys() to find out what the record contains. Ideally, I think that a Blast Record class should inherit from a dictionary. 5) We should be able to print a Blast record object to generate output that is close to the plain-text output generated by blast. This would allow us to generate and store Blast output as XML, and to convert the output to plain-text to make it more human-readable. 6) The current Blast record inherits from Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. I don't see the rationale for this inheritance, and I think we should remove it. Any comments, suggestions (in particular about by proposal to have a Blast Record class that inherits from a dictionary? Btw, to avoid breaking scripts, I propose that any changes to the Blast record and parser are implemented separately from the existing parsers and record, and to leave those untouched. --Michiel. From bugzilla-daemon at portal.open-bio.org Sat May 29 13:53:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 13:53:35 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201005291753.o4THrY4v013325@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #10 from eric.talevich at gmail.com 2010-05-29 13:53 EST ------- I've applied Konstanin's patch to a branch on GitHub: http://github.com/etal/biopython/tree/pdbfixes I'm going to apply some more small patches for the various PDB bugs here, so testers are welcome to try out/monitor this branch. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From lgautier at gmail.com Sat May 29 14:29:00 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 29 May 2010 20:29:00 +0200 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: References: Message-ID: <4C015CEC.30908@gmail.com> Hi, Few thoughts below: On 5/29/10 6:00 PM, biopython-dev-request at lists.open-bio.org wrote: > Hi everybody, > > With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use > its new Blast+ suite of Blast programs, maybe this is a good time to > tackle some older bugs related to Blast output parsing in Biopython: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 (inconsistencies in > the output of different Blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 (inconsistencies > between Psi-blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2319 (parsing Blast > table output) > > and more generally think about the design of the Blast record class > and Blast parsing. In my opinion, these are the major issues: > > 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, > Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we > should have one read() function and one parse() function under > Bio.Blast, with arguments specifying which format the Blast output is > in. Having a factory function would be handy, but since the file formats differ having different classes to model them can be nice. Modularity is good, and what is known as duck-typing makes it for an intuitive API. What would you think of a design such as: - module/package 'Blast' - an abstract class 'Output' is defined in that module/package. - classes '; each one of those classes defines a method 'read()' and 'parse()' (read() and parse() would formally be declared by an interface, and 'Output' require their implementation). > 2) Blast records produced by any of the parsers should be consistent > with each other. As XML output by blast and psi-blast follow the same > DTD, we should be able to represent both by a single Record class. Definitely the case for XML - blast/psi-blast... however, the various formats (XML, others) may contain different levels of details (I do not know for sure, just considering the possibility here). > 3) Different parsers should store information in this Record class in > the same way. I'd see two options : - either the same Record class is returned by all parsers or - a hierarchy of classes with common accessors and methods whenever possible (e.g., an abstract parent class (or interface) 'Blast.Record' with child classes 'Blast.XMLRecord', blahblahblah...) > 4) The current Blast record stores its information in attributes. If > you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains > the necessary DTDs to do so), the information is stored in > dictionaries. This has some advantages. For example, it allows you to > use record.keys() to find out what the record contains. Ideally, I > think that a Blast Record class should inherit from a dictionary. Indeed. Attributes also have constrains regarding valid names that dictionaries do not have. Still, there is no need to require a strict inheritance from Python's dict, and require the implementation of the interface (methods such as __getitem__(), __iter__(), iteritems(), keys(), etc...) might has well do it. I am thinking of the cost of conversion here: there might be time where the only purpose is to loop through record and only access limited information (and in that case a custom class performing a lazy access to information would be neat). Keeping it as an interface rather than expect a direct inheritance will give more freedom to implement it, while keeping compatibility with the rest of the code base. > 5) We should be able to print a Blast record object to generate > output that is close to the plain-text output generated by blast. > This would allow us to generate and store Blast output as XML, and to > convert the output to plain-text to make it more human-readable. > > 6) The current Blast record inherits from Bio.Blast.Record.Header, > Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. I > don't see the rationale for this inheritance, and I think we should > remove it. > > Any comments, suggestions (in particular about by proposal to have a > Blast Record class that inherits from a dictionary? Btw, to avoid > breaking scripts, I propose that any changes to the Blast record and > parser are implemented separately from the existing parsers and > record, and to leave those untouched. > > --Michiel. > > > > > > ------------------------------ > > _______________________________________________ Biopython-dev mailing > list Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 88, Issue 20 > ********************************************* From bugzilla-daemon at portal.open-bio.org Sat May 29 15:31:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 15:31:56 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <201005291931.o4TJVuIi015483@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #4 from eric.talevich at gmail.com 2010-05-29 15:31 EST ------- grep tells me that the only place "__delitem__" appears in Bio/PDB/ is in Chain.py: the definition of Chain.__delitem__, and the call to the nonexistent Entity.__delitem__. The method would be required for this do work: struct = PDBParser().get_structure('asdf', 'ASDF.pdb') del struct[0] This seems useful, since __getitem__ is already implemented, but I can't imagine it functioning any differently than detach_child. Solution 1: Comment out Chain.__delitem__, since they're no way this ever worked for anybody. ( http://github.com/etal/biopython/commit/835b444df6b7f2c63b427535bc1c796c26ccce60 ) Solution 2: Implement Entity.__delitem__, essentially identical to Entity.detach_child. Look at the implementations of __getitem__ in the other subclasses of Entity to see if anything fancy needs to be done to support __delitem__ in each of them. (to do) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 29 15:52:13 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 15:52:13 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <201005291952.o4TJqDHP015930@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #5 from eric.talevich at gmail.com 2010-05-29 15:52 EST ------- (In reply to comment #4) > Solution 2: Implement Entity.__delitem__, essentially identical to > Entity.detach_child. Look at the implementations of __getitem__ in the other > subclasses of Entity to see if anything fancy needs to be done to support > __delitem__ in each of them. Done on the same branch. Nothing fancy was needed in the other Entity subclasses. NB: It looks like Entity supports some methods that are handled just as well by the appropriate magic methods and Python syntax: get_list => __iter__, detach_child => __delitem__, has_id => __contains__. We should eventually deprecate those methods, make some others non-public, and promote the use of properties and magic syntax instead, I think. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 29 17:03:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 17:03:31 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201005292103.o4TL3Vij017627@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #7 from eric.talevich at gmail.com 2010-05-29 17:03 EST ------- I've applied the patch to my pdbfixes branch on GitHub: http://github.com/etal/biopython/commit/cc9da03002ae90a3b8eedae69a8adae7216506b8 And added a couple unit tests so we know when further modifications change the way header info is parsed (which will be desirable at some point). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sandford at ufl.edu Sat May 29 22:35:18 2010 From: sandford at ufl.edu (Michael Sandford) Date: Sat, 29 May 2010 22:35:18 -0400 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: <4C01CEE6.8030603@ufl.edu> I've got a few comments as well: > 4) The current Blast record stores its information in attributes. If you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the necessary DTDs to do so), the information is stored in dictionaries. This has some advantages. For example, it allows you to use record.keys() to find out what the record contains. Ideally, I think that a Blast Record class should inherit from a dictionary. > The disadvantage that I can immediately think of using this methodology is that you lose the ability to have a heavyweight IDE give you intellisense on what fields are available. Many may say that intellisense is evil and/or a crutch and I won't really argue that. But Eclipse is pretty good at giving you options if you type in "variablename." and then it'll bring up a whole list of attributes and functions, and I find that handy. Moving to a dictionary based approach will stop that. Calling dir(variablename) will enable you to see not only the attributes available, but the functions as well. That may not be as elegant as iterating over keys in a dictionary but it is some measure of an alternative. It seems to me that there is a fair amount of xml parsing that gets done in bioinformatics these days. I know that one of the goals of the project is minimal dependence on external libraries, however, I think that lxml ( http://codespeak.net/lxml/) might provide some rather substantial gains in terms of parsing code complexity reduction. I also think that the lxml/etree representation of parsed data is fairly reasonable. Mike > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From biopython at maubp.freeserve.co.uk Mon May 31 05:10:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 10:10:43 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: On Sat, May 29, 2010 at 4:23 AM, Michiel de Hoon wrote: > Hi everybody, > > With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use its new Blast+ > suite of Blast programs, maybe this is a good time to tackle some older bugs related > to Blast output parsing in Biopython: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > (inconsistencies in the output of different Blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 > (inconsistencies between Psi-blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2319 > (parsing Blast table output) > > and more generally think about the design of the Blast record class and Blast > parsing. In my opinion, these are the major issues: > > 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, > Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should > have one read() function and one parse() function under Bio.Blast, with > arguments specifying which format the Blast output is in. I see the point, but some of these parsers give very different output (your points 2 and 3). > 2) Blast records produced by any of the parsers should be consistent > with each other. See also (3) below. > As XML output by blast and psi-blast follow the same > DTD, we should be able to represent both by a single Record class. I think this was a short term hack by the NCBI - and rules out having a single XML file hold multiple PSI queries and their iterations. > 3) Different parsers should store information in this Record class in > the same way. Where possible, yes, but different BLAST output formats contain different information - e.g. some contain the hit sequences while others do not. > 4) The current Blast record stores its information in attributes. If you > use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains > the necessary DTDs to do so), the information is stored in dictionaries. > This has some advantages. For example, it allows you to use > record.keys() to find out what the record contains. Ideally, I think > that a Blast Record class should inherit from a dictionary. As already pointed out, it has disadvantages too. With traditional attributes or properties you can use dir(record) and also setup docstrings for properties etc. I think they are clearer than dictionary keys. I would look at a base BLAST record (covering the core information found in all formats including tabular) with subclasses for the richer output formats (default plain text and XML). > 5) We should be able to print a Blast record object to generate > output that is close to the plain-text output generated by blast. > This would allow us to generate and store Blast output as XML, > and to convert the output to plain-text to make it more human- > readable. Nice - but that could make the str(record) output very long. > 6) The current Blast record inherits from Bio.Blast.Record.Header, > Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. > I don't see the rationale for this inheritance, and I think we should > remove it. I agree this is a rather odd design choice (even if the three sections did map onto three parts of the plain text output). We can probable do this without changing the exposed Blast record behaviour. > Any comments, suggestions (in particular about by proposal to > have a Blast Record class that inherits from a dictionary? Btw, to > avoid breaking scripts, I propose that any changes to the Blast > record and parser are implemented separately from the existing > parsers and record, and to leave those untouched. Some of these suggestions like (5) and (6) could be done to the existing BLAST parsers and objects, and would seem a good idea. Regarding the main proposal (1), I would be more interested in more ambitious proposal along the lines of BioPerl's SearchIO covering not just BLAST but also FASTA, BLAT, HMMER and any other "pairwise searches" (and potentially we could share code for this with AlignIO for pairwise alignment formats). This is more work of course, and could come later. http://www.bioperl.org/wiki/HOWTO:SearchIO Peter From sbassi at clubdelarazon.org Mon May 31 09:52:45 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 31 May 2010 10:52:45 -0300 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: On Sat, May 29, 2010 at 12:23 AM, Michiel de Hoon wrote: > 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should have one read() function and one parse() function under Bio.Blast, with arguments specifying which format the Blast output is in. > .... I would add another issue (7). The interface to run the BLAST search is different. The "clasic" will execute the search while the blast+ one "just" generate the command line and it is up to the programmer to actually run it (and get the result back to the program). >From my POV, it is adding 3 lines in my code, but you didn't have to use subprocess before blast+, so I find this a little inconsistent. From biopython at maubp.freeserve.co.uk Mon May 31 10:08:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 15:08:01 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: On Mon, May 31, 2010 at 2:52 PM, Sebastian Bassi wrote: > > I would add another issue (7). The interface to run the BLAST search > is different. The "clasic" will execute the search while the blast+ > one "just" generate the command line and it is up to the programmer to > actually run it (and get the result back to the program). > From my POV, it is adding 3 lines in my code, but you didn't have to > use subprocess before blast+, so I find this a little inconsistent. This is a separate issue to BLAST parsing, but yes, I've been meaning to post a new thread regarding making this easier for the typical use cases. I have a cunning plan... Peter From biopython at maubp.freeserve.co.uk Mon May 31 10:53:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 15:53:45 +0100 Subject: [Biopython-dev] More SeqRecord methods Message-ID: Hi all, What do people think of adding upper and lower methods to the SeqRecord? http://bugzilla.open-bio.org/show_bug.cgi?id=3054 If that is well received, how about adding another Seq method to the SeqRecord, the newish ungap method? http://bugzilla.open-bio.org/show_bug.cgi?id=3060 Peter From biopython at maubp.freeserve.co.uk Mon May 31 10:50:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 15:50:36 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers Message-ID: Hi all, With the new command line wrappers and the tutorial pushing users towards using subprocess we've had more queries about how to use it. The subprocess module itself is rather scary I guess, and things could be made a lot easier. I think the most typical use cases are: (1) Run the command, return the error code (integer) (2) Run the command, return stdout, stderr and error code In theory the function subprocess.call() would take care of the first example, but there is a cross platform annoyance here with the shell parameter. Also, if you want the output too things get even more tricky. It hasn't helped that there are a few platform specific quirks/bugs in subprocess itself (the different behaviour of the shell option on Windows, bug http://bugs.python.org/issue1124861 in old Pythons, the risk of deadlocks with large output files, etc). A while ago while doing the Bio.Motif application wrappers Bartek suggested adding a run or execute method to the application wrapper. I wasn't so receptive at the time, but the utility of this has grown on me. However, adding methods could potentially clash with arbitrary parameter names. We could instead make the wrapper objects callable (define the magic method __call__) to offer this kind of functionality. This seems quite elegant to me. I've just posted a possible such implementation for comment on this new branch, http://github.com/peterjc/biopython/tree/app-exec Thus far there is just one commit, http://github.com/peterjc/biopython/commit/b53fb443e2153576509e159a1eac9da55124e41b Is this a nice approach? Assuming this is a well received idea, there are several details to discuss. First of all, if we are going to return the stdout and stderr would it be best as strings or wrapped as handles (using StringIO) which can be passed to a parser? Peter From eric.talevich at gmail.com Mon May 31 11:38:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 31 May 2010 11:38:51 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements Message-ID: Hi all, This summer our GSoC student Jo?o Rodrigues will be implementing a number of enhancements to Biopython's structural biology modules. Since Bio.PDB is one of the most widely used parts of Biopython, I'd like to find a way to let Jo?o add major new features without breaking existing code and documentation. There are a few issues I'd like to address: 1. The I/O conventions of parse/read/write/convert seem to work very well in SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports I/O in several formats, but the API is lower-level and isn't unified in the same way (yet). 2. PDB headers seem to have become better structured in recent years, in both the wwPDB spec and submitted files. But header info isn't well integrated with PDB Structure object, and parse_pdb_header needs some attention as well. 3. Kristian asked on this list awhile ago about the proper location for his new code that works with RNA structures. While RCSB's PDB contains some RNA structures, the RNA world doesn't revolve around it. Similarly, Jo?o needs a place to put code for structure prediction/validation servers, command-line wrappers, secondary structures, etc. I propose a new sub-package called Bio.Struct for these enhancements: from Bio import Struct mystruct = Struct.read("1MOT.pdb", "pdb") # Or, letting the format argument default to "pdb": mystruct = Struct.read("1MOT.pdb") # Eventually this will work too: Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml") from Bio.Struct.Applications import DSSP # Like the other command-line wrappers # (I'm curious about Peter's cunning new scheme...) from Bio.Struct import WHATIF, Jpred # Servers each get their own module from Bio.Struct import RNA # Would this work for you, Kristian? Alternatively, we could do all of this within the PDB module -- so picture the above examples with "PDB" in place of "Struct". This raises the chance of naming collisions, though, and doesn't solve issue #3 above. We'll leave the existing PDB module layout alone, in general. I think it will be necessary to add a few more attributes to the Bio.PDB.Structure.Structure class, but we can do this without breaking compatibility. Since fewer people depend on the exact formatting of the Structure.header data (we believe), it's safer to change this dictionary, moving the more essential entries to a separate attribute, or whatever seems reasonable when we dig into it. Comments? Thanks, Eric From biopython at maubp.freeserve.co.uk Mon May 31 11:53:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 16:53:31 +0100 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 4:38 PM, Eric Talevich wrote: > Hi all, > > This summer our GSoC student Jo?o Rodrigues will be implementing a number of > enhancements to Biopython's structural biology modules. Since Bio.PDB is one > of the most widely used parts of Biopython, I'd like to find a way to > let Jo?o add major new features without breaking existing code and > documentation. > > There are a few issues I'd like to address: > > 1. The I/O conventions of parse/read/write/convert seem to work very well in > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports > I/O in several formats, but the API is lower-level and isn't unified in the > same way (yet). Currently Bio.PDB supports the plain text PDB format, and has partial support for mmCIF. It lacks support for the XML PDB format, PDBML - Protein Data Bank Markup Language. Under this proposed scheme, what would you see as the basic record type (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and Bio.Phylo)? It would be nice to say a protein chain, but there is the issue of multiple models (e.g. from NMR). I presume you'd go with the model as the basic unit (where each model may contain multiple chains). > 2. PDB headers seem to have become better structured in recent years, in > both the wwPDB spec and submitted files. But header info isn't well > integrated with PDB Structure object, and parse_pdb_header needs some > attention as well. Agreed. > 3. Kristian asked on this list awhile ago about the proper location for his > new code that works with RNA structures. While RCSB's PDB contains some RNA > structures, the RNA world doesn't revolve around it. Similarly, Jo?o needs a > place to put code for structure prediction/validation servers, command-line > wrappers, secondary structures, etc. > > > I propose a new sub-package called Bio.Struct for these enhancements: > > from Bio import Struct > mystruct = Struct.read("1MOT.pdb", "pdb") > # Or, letting the format argument default to "pdb": > mystruct = Struct.read("1MOT.pdb") > # Eventually this will work too: > Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml") I'd probably go with "pdbml" rather than "pdbxml" since that seems to be what the PDB themselves call it: http://www.pdb.org/pdb/static.do?p=file_formats/index.jsp > from Bio.Struct.Applications import DSSP > # Like the other command-line wrappers > # (I'm curious about Peter's cunning new scheme...) See: http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007773.html > from Bio.Struct import WHATIF, Jpred > # Servers each get their own module Hmm - perhaps we may need have another level here, Bio.Struct.Servers or Bio.Struct.WWW or something. How many of these do you expect? > from Bio.Struct import RNA > # Would this work for you, Kristian? > > > Alternatively, we could do all of this within the PDB module -- so picture > the above examples with "PDB" in place of "Struct". This raises the chance > of naming collisions, though, and doesn't solve issue #3 above. > > > We'll leave the existing PDB module layout alone, in general. I think it > will be necessary to add a few more attributes to the > Bio.PDB.Structure.Structure class, but we can do this without breaking > compatibility. Since fewer people depend on the exact formatting of the > Structure.header data (we believe), it's safer to change this dictionary, > moving the more essential entries to a separate attribute, or whatever seems > reasonable when we dig into it. > > Comments? I don't want us to break backwards compatibility in Bio.PDB (given how widely used it seems to be based on citations at least), but would like us to continue making small fixes or enhancements to it. Therefore a new Bio.Struct module may be the safer option. Peter From rodrigo_faccioli at uol.com.br Mon May 31 13:51:10 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Mon, 31 May 2010 14:51:10 -0300 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hi, I would like to comment some ideas: Firstly, I suggest to maintain the getStructure command. This command has the goal load whole structure (models, chains, ATOM, HETAM, etc). So, the getStructure command is executed: structure = getStructure(id) Afterwards, users can execute it as they need. Below I try to show some specific exemaples. In structure contains whole structure loaded including its errors. The command can be like: structure.get_StructureErrors().getStructureErrors() This command returns a dictionary containing all errors of the strcurure. For complete example is [1]. One idea: this dictionary is created by WHATIF module. Other example is about convert command. It may have more options such as model and chain. So, it can be called: convert(structure, SelectedModels, SelectedChains,"1MOT.xml", "pdbxml") When SelectedModels and SelectedChains options are None will be considered all values of, respectively, models and chains of protein. In this way we've developed a new Bio.PDB.Parser methodology. Please read loadStructureFromFile function in [2]. This new methodology is an alternative developed by my group research. With it we have worked with pdb file and our database applying one parser only. In that example is showing to work with pdb file only. I hope this mail may contribute with something. Sorry my English mistakes. [1] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/scripts/check_structure.py [2] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDBParser.py Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 2010/5/31 Peter > On Mon, May 31, 2010 at 4:38 PM, Eric Talevich > wrote: > > Hi all, > > > > This summer our GSoC student Jo?o Rodrigues will be implementing a number > of > > enhancements to Biopython's structural biology modules. Since Bio.PDB is > one > > of the most widely used parts of Biopython, I'd like to find a way to > > let Jo?o add major new features without breaking existing code and > > documentation. > > > > There are a few issues I'd like to address: > > > > 1. The I/O conventions of parse/read/write/convert seem to work very well > in > > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports > > I/O in several formats, but the API is lower-level and isn't unified in > the > > same way (yet). > > Currently Bio.PDB supports the plain text PDB format, and has partial > support for mmCIF. It lacks support for the XML PDB format, PDBML - > Protein Data Bank Markup Language. > > Under this proposed scheme, what would you see as the basic record type > (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and > Bio.Phylo)? It would be nice to say a protein chain, but there is the issue > of > multiple models (e.g. from NMR). I presume you'd go with the model as the > basic unit (where each model may contain multiple chains). > > > 2. PDB headers seem to have become better structured in recent years, in > > both the wwPDB spec and submitted files. But header info isn't well > > integrated with PDB Structure object, and parse_pdb_header needs some > > attention as well. > > Agreed. > > > 3. Kristian asked on this list awhile ago about the proper location for > his > > new code that works with RNA structures. While RCSB's PDB contains some > RNA > > structures, the RNA world doesn't revolve around it. Similarly, Jo?o > needs a > > place to put code for structure prediction/validation servers, > command-line > > wrappers, secondary structures, etc. > > > > > > I propose a new sub-package called Bio.Struct for these enhancements: > > > > from Bio import Struct > > mystruct = Struct.read("1MOT.pdb", "pdb") > > # Or, letting the format argument default to "pdb": > > mystruct = Struct.read("1MOT.pdb") > > # Eventually this will work too: > > Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml") > > I'd probably go with "pdbml" rather than "pdbxml" since that seems to be > what the PDB themselves call it: > http://www.pdb.org/pdb/static.do?p=file_formats/index.jsp > > > from Bio.Struct.Applications import DSSP > > # Like the other command-line wrappers > > # (I'm curious about Peter's cunning new scheme...) > > See: > http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007773.html > > > from Bio.Struct import WHATIF, Jpred > > # Servers each get their own module > > Hmm - perhaps we may need have another level here, Bio.Struct.Servers > or Bio.Struct.WWW or something. How many of these do you expect? > > > from Bio.Struct import RNA > > # Would this work for you, Kristian? > > > > > > Alternatively, we could do all of this within the PDB module -- so > picture > > the above examples with "PDB" in place of "Struct". This raises the > chance > > of naming collisions, though, and doesn't solve issue #3 above. > > > > > > We'll leave the existing PDB module layout alone, in general. I think it > > will be necessary to add a few more attributes to the > > Bio.PDB.Structure.Structure class, but we can do this without breaking > > compatibility. Since fewer people depend on the exact formatting of the > > Structure.header data (we believe), it's safer to change this dictionary, > > moving the more essential entries to a separate attribute, or whatever > seems > > reasonable when we dig into it. > > > > Comments? > > I don't want us to break backwards compatibility in Bio.PDB (given how > widely used it seems to be based on citations at least), but would like > us to continue making small fixes or enhancements to it. Therefore a > new Bio.Struct module may be the safer option. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From jblanca at btc.upv.es Mon May 31 14:56:46 2010 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 31 May 2010 20:56:46 +0200 Subject: [Biopython-dev] Blast parsers and records Message-ID: <1275332206.4c04066ed4ec5@webmail.upv.es> Mensaje citado por Michael Sandford : > I've got a few comments as well: > > 4) The current Blast record stores its information in attributes. If you > use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the > necessary DTDs to do so), the information is stored in dictionaries. This has > some advantages. For example, it allows you to use record.keys() to find out > what the record contains. Ideally, I think that a Blast Record class should > inherit from a dictionary. I've developed for my own use a dict structure that represents a blast result. This structure also can represent many other results, like exonerate, SSAHA or any other number of aligners. Having a common representations for all of them allows you to create common filters that work with the same interface. I don't know if it is very efficient, but it has proven to be very convinient for us. You can take a look at: http://github.com/JoseBlanca/franklin/blob/master/franklin/alignment_search_result.py Best regards, Jose Blanca ----- Fin del mensaje reenviado ----- -- From eric.talevich at gmail.com Mon May 31 23:44:11 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 31 May 2010 23:44:11 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 11:53 AM, Peter wrote: > On Mon, May 31, 2010 at 4:38 PM, Eric Talevich > wrote: > > Hi all, > > > > This summer our GSoC student Jo?o Rodrigues will be implementing a number > of > > enhancements to Biopython's structural biology modules. Since Bio.PDB is > one > > of the most widely used parts of Biopython, I'd like to find a way to > > let Jo?o add major new features without breaking existing code and > > documentation. > > > > There are a few issues I'd like to address: > > > > 1. The I/O conventions of parse/read/write/convert seem to work very well > in > > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports > > I/O in several formats, but the API is lower-level and isn't unified in > the > > same way (yet). > > Currently Bio.PDB supports the plain text PDB format, and has partial > support for mmCIF. It lacks support for the XML PDB format, PDBML - > Protein Data Bank Markup Language. > Yeah, it would be good to implement that at some point. For now, I'd be happy to be able to read and write PDB files with a single function call each, and design the I/O wrapper for easy extension to mmCIF and PDBML. Under this proposed scheme, what would you see as the basic record type > (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and > Bio.Phylo)? It would be nice to say a protein chain, but there is the issue > of > multiple models (e.g. from NMR). I presume you'd go with the model as the > basic unit (where each model may contain multiple chains). > I'd consider a structure to be the basic unit of I/O. If we're going to make better use of header info, that's generally associated with the whole structure and not individual models -- we'd have to duplicate the header info in each Model object emitted, which would be weird. Are there any formats that store more than one structure in a file? If not, then there's probably no need for a parse() function in Bio.Struct. > > from Bio.Struct import WHATIF, Jpred > > # Servers each get their own module > > Hmm - perhaps we may need have another level here, Bio.Struct.Servers > or Bio.Struct.WWW or something. How many of these do you expect? > Jo?o's project plan includes Dali and WHATIF: http://biopython.org/wiki/GSOC2010_Joao These servers do different things so I wouldn't expect any similarity in the code between them. There are lots of servers that we *could* support... Aesthetically, a Servers or WWW subdirectory would match Bio.Struct.Applications and make the whole package a little more self-documenting. Here's one more idea: Fetching a single PDB file from RCSB requires a separate import and a couple of calls. Should we make this even easier by mimicking the efetch function in Bio.Entrez, something like >>> handle = Bio.PDB.fetch("1MOT") or >>> from Bio.Struct.WWW import RCSB >>> handle = RCSB.fetch("1MOT", "pdb") ? -Eric From jblanca at btc.upv.es Tue May 4 13:31:27 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 4 May 2010 15:31:27 +0200 Subject: [Biopython-dev] [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: <201005041531.27857.jblanca@btc.upv.es> Hi Peter: On Tuesday 04 May 2010 11:13:05 Peter wrote: > On Mon, May 3, 2010 at 11:37 AM, Jose Blanca wrote: > > Hi: > > > > As in many other labs we are working with NGS sequences. We work mostly > > in non model plants and we were repeating the same analyses for different > > projects: sequence cleaning, mapping to a reference, annotation and SNV > > calling and filtering. To solve the problem we have developed a software > > named ngs_backbone. We use this software and we think that it might be of > > some use to the biopython community. To take a look at it you can go to > > http://bioinf.comav.upv.es/ngs_backbone/index.html > > > > This software is build on top of biopython. > > > > If the biopython developers think that some part of this software could > > be added to biopython we would be glad to do it. We are aware of the > > different licences used by both projects, but we could relicence the > > required parts to solve that. > > > > Best regards, > > Hi Jose, > > This sounds very interesting. Are there any bits of low level functionality > you think would be particularly suitable for including in Biopython? That I don't know. Most of the package is of higher level, but maybe there's something. > I've just had a quick look at your function _seqs_in_file_with_bio in > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/readers.py > Would be it be simpler to do FASTA+QUAL parsing using > Bio.SeqIO.PairedFastaQualIterator? I going to look into that, it seems a good tip, thanks Peter. > I see you have a copy of our (private) function Bio.Seq._maketrans() here: > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py > Would it be useful to have this as a public API in Biopython? I dont' think so. We copied the function because of the self.__class__ problem that we discussed some time ago. The complement method or our Seq should return our class and not the Biopython one, that's why we have duplicated this method. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From bugzilla-daemon at portal.open-bio.org Wed May 5 12:29:30 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 08:29:30 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005051229.o45CTUaJ019366@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OS/Version|Mac OS |All Version|1.51 |Not Applicable ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 08:29 EST ------- I've recently started looking at parsing SAM and BAM files. These files just contain the reads - they do not include the reference sequence, that is usually kept in a separate FASTA file. I therefore think it would make sense to parse each read as a SeqRecord in Bio.SeqIO. The SAM format is basically tab separated plain text. Parsing it is straight forward, the complication is turning this into a suitable SeqRecord object. The BAM format can be decompressed in Python using the gzip library (built in), and decoded with the struct library (also built in - we already use this for parsing the binary SFF file format). i.e. This is fiarly straightforward to do in pure python - without any dependence on the samtools C library, an alternative approach which is how pysam works. See http://code.google.com/p/pysam/ Extracting just the read name, sequence, and PHRED quality scores when building the SeqRecord objects is sufficient to implement SAM/BAM to FASTQ/FASTA/QUAL conversion with Bio.SeqIO. The harder part will be deciding how to represent all the other annotation information for each read... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 12:57:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 08:57:11 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005051257.o45CvBtP020884@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #3 from chapmanb at 50mail.com 2010-05-05 08:57 EST ------- I'd really like to see our support for this re-use the work in the pysam project. Agreed that both a pure Python implementation of BAM parsing and Biopython-interoperable objects are useful, and we should either contribute it as part of pysam or consider discussing a closer collaboration with the pysam authors. Biopython should be taking the lead on encouraging better interoperability with other projects. pysam is useful to me in my work right now, and we should support that effort. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 13:37:54 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 09:37:54 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005051337.o45Dbs0P022560@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 09:37 EST ------- Created an attachment (id=1498) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1498&action=view) Basic SAM/BAM parser for Bio.SeqIO This file would go in Bio/SeqIO/SamBamIO.py with the usual additions to file Bio/SeqIO/__init__.py to define the "sam" and "bam" format names plus that "bam" is a binary file format. There are docstring unit tests using ex1.sam and ex1.bam borrowed from the pysam project. (In reply to comment #3) > I'd really like to see our support for this re-use the work in the pysam > project. Agreed that both a pure Python implementation of BAM parsing and > Biopython-interoperable objects are useful, and we should either contribute it > as part of pysam or consider discussing a closer collaboration with the pysam > authors. > > Biopython should be taking the lead on encouraging better interoperability > with other projects. pysam is useful to me in my work right now, and we > should support that effort. Hi Brad, What I was (for now) focussing on was SAM/BAM parser support in Bio.SeqIO, which is really quite narrow in scope. It is also quite simple - I have attached a proof of principle implementation to this bug. The gzip/struct code to interpret the BAM fields is pretty straight forward (having done a lot of similar work on the SFF support helped). The only challenging bit is turning the data into a SeqRecord (and this part seems irrelevant to pysam). Going beyond basic access to the reads, the next step up is working on the alignment data structure - e.g. extracting columns to look at SNPs. Here there are a lot of neat things like indexing schemes etc where the SAMtools API (and thus pysam) is probably a sensible choice. You'll notice in the draft module docstring I've suggested this (and this wasn't prompted by your comment either - grin). On the licence side, pysam and SAMtools both use the MIT licence, so no problems there. Regarding dependencies and cross platform support, pysam is a lightweight wrapper of the samtools C-API, using pyrex. If we want to use pysam in Biopython that means build time dependencies on samtools and pyrex. This won't work under Jython, and at the time of writing pysam doesn't appear to support Windows either. So I'm not so comfortable about this. It would be interesting to see if pysam could have a pure python back end as an alternative to calling the SAMtools C API (and I'm happy for any of my code to be used for that - but this would have to cover far more than just parsing). That would allow pysam under Jython, and might help on Windows too. So in the short term, I don't seem any overlap between SAM/BAM support in Bio.SeqIO and the pysam project. In the medium/long term, working with the cigar strings and of course the alignments rather than just the reads, then yes absolutely - some level of discussion or collaboration would be sensible and desirable. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 15:13:43 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 11:13:43 -0400 Subject: [Biopython-dev] [Bug 3071] EMBL parser does not parse RP lines correctly. In-Reply-To: Message-ID: <201005051513.o45FDhhe026936@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3071 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 11:13 EST ------- Fixed and unit tests updated to cover this: http://github.com/biopython/biopython/commit/c53eb60956f52ada0116c6b0045e0a1d16cb1de8 Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 5 16:22:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 May 2010 12:22:09 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005051622.o45GM9GP030954@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-05 12:22 EST ------- (In reply to comment #8) > (In reply to comment #7) > > > However, if the only out-of-specification thing in the IMGT EMBL files is > > the feature indentation and long feature keys, many your original request > > to make the EMBL parser more tolerant is the best route. > > I think it will actually be a headache to do so. Unless you want to rewrite > the EMBL parser the way that I wrote the IMGT parser. The only thing that > needed changing was handling the header lines. Once it finds an FH line, it > uses the position of the "Location..." string to determine how indented the > qualifiers are. Hi Uri, Could you retest as "embl" format with the trunk? I would expect some warnings from these over indented features in IMGT, and we can certainly remove the warning if we decide not to introduce a separate IMGT format variant. http://github.com/biopython/biopython/commit/e6ba962dd60fe585baa1237445d33f67d47dd57f This change takes a slightly different approach to your work on github, but is quite similar to your two line patch - but this should still work with another odd form: FH Key Location/Qualifiers FT L-V-D-J-C-SEQUEN1..1151 FT /db_xref="taxon:32630" FT /organism="synthetic construct" FT 5'UTR 1..37 ... In the above example (generated by Biopython itself), the strict EMBL column limits have been obeyed but the feature key has been truncated to just L-V-D-J-C-SEQUEN rather than L-V-D-J-C-SEQUENCE. This is a related query - when asked to output such a feature as EMBL or GenBank format, should we raise an exception here? We could add a warning instead, and either leave the code as is, or output this: FH Key Location/Qualifiers FT L-V-D-J-C-SEQUE 1..1151 FT /db_xref="taxon:32630" FT /organism="synthetic construct" FT 5'UTR 1..37 ... > > Thinking ahead would you also want to be able to write out IMGT variant > > EMBL files? > > I personally don't need this functionality, but I am willing to write it to > complement the IMGT parser that I wrote. If we go done the route of formalising IMGT as an EMBL variant with a different feature indent, it should just be a trivial subclass of the existing EMBL writer object but with the indentation constant changed. Note there are other problem in the IMGT data, including locations like "1..428>" and "<1..328>" where the greater than should be BEFORE the location (but we could probably cope with this all the same), and just "1." where half the location is missing (which we can't really do much with other than treat it as simply "1" instead?). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Thu May 6 06:08:34 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Thu, 6 May 2010 02:08:34 -0400 Subject: [Biopython-dev] 5/6 BioStar - Biopython Questions Message-ID: ================================================== 1. Simple Fasta Parsing Is Not Simple. ================================================== May 5, 2010 at 6:25 PM When trying out the examples from chapter 2.3 of the biopython 1.54b tutorial I keep running into this very annoying problem: When I use: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) The Python interpreter tells me I should use a handle in stead of filenames. I use biopython 1.53 and not 1.54b for which this tutorial was meant. I can't find an older archived version of the tutorial I could use to help me learn how to use a handle in stead of the simple filename method (ls_orchid.fasta in this case.). I know that in older versions of biopython you have to do everything in handlers but not in the newer versions. The new tutorial now on-line has an obscure last chapter on how to use handlers but that's hardly helpfull. I think the tutorial is great, I just havent got the right version installed. I use Ubuntu and downloaded biopython via ubuntu software center. Can anyone help me use a handler and get the above parsing example working in 1.53? Thank you, http://biostar.stackexchange.com/questions/969/simple-fasta-parsing-is-not-simple -------------------------------------------------- ================================================== 2. How do I create a SeqRecord in biopython? ================================================== May 5, 2010 at 6:25 PM id1 ="HWI-EAS380:8:1:16:830/1" seq1 ="AGGGCGTTCAGCAGCCAGCTTGCGGCAAAACTGCGTAACCGTCTTCTCGTTCTCT AAAAACCATTTTTCGTCCCCTTCGGGGCGGTGGTCTATAGTGTTATTAATATCAA GTTGGGGGAGCACATTGTAGCATTG" qual1="abbbbbaab`abaabbaabaaaab^E^``^aaabaa_\_abaaaaaaaa`aaaa` Z^`^^aaaaaaa`aa^aaa``_aa_aaaaaaaaaaa`aaaa`aaaaaabaaabba aaaaaaaaaaa_baaaabbbbbaba" assume these are fastq-illumina quality scores and the sequence is unambiguous dna I want to create a SeqRecord object from this. Thanks. http://biostar.stackexchange.com/questions/967/how-do-i-create-a-seqrecord-in-biopython -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From dalloliogm at gmail.com Thu May 6 08:36:51 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 6 May 2010 10:36:51 +0200 Subject: [Biopython-dev] 5/6 BioStar - Biopython Questions In-Reply-To: References: Message-ID: On Thu, May 6, 2010 at 8:08 AM, Feed My Inbox wrote: > ================================================== > 1. Simple Fasta Parsing Is Not Simple. > ================================================== > May 5, 2010 at 6:25 PM > > When trying out the examples from chapter 2.3 of the biopython 1.54b tutorial I keep running into this very annoying problem: When I use: This user is complaining that the current biopython's tutorial describes a feature introduced in biopython 1.54, so if you try it with an earlier version, it doesn't work. In particular, it is the feature that allow to use a string/filename as argument for SeqIO.parse, which is not available in earlier versions. Maybe it would be useful to add a note to the tutorial, explaining users that they should update biopython or that in earlier versions they have to use a filehandler . > > from Bio import SeqIO > > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): > > ? ?print seq_record.id > ? ?print repr(seq_record.seq) > ? ?print len(seq_record) > > > ================================================== > 2. How do I create a SeqRecord in biopython? > ================================================== > May 5, 2010 at 6:25 PM > > id1 ?="HWI-EAS380:8:1:16:830/1" > > seq1 ="AGGGCGTTCAGCAGCCAGCTTGCGGCAAAACTGCGTAACCGTCTTCTCGTTCTCT > ? ? ? AAAAACCATTTTTCGTCCCCTTCGGGGCGGTGGTCTATAGTGTTATTAATATCAA > ? ? ? GTTGGGGGAGCACATTGTAGCATTG" > > qual1="abbbbbaab`abaabbaabaaaab^E^``^aaabaa_\_abaaaaaaaa`aaaa` > ? ? ? Z^`^^aaaaaaa`aa^aaa``_aa_aaaaaaaaaaa`aaaa`aaaaaabaaabba > ? ? ? aaaaaaaaaaa_baaaabbbbbaba" Have a look at this question, too. I don't know how to answer properly. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu May 6 10:57:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 May 2010 11:57:20 +0100 Subject: [Biopython-dev] 5/6 BioStar - Biopython Questions In-Reply-To: References: Message-ID: On Thu, May 6, 2010 at 9:36 AM, Giovanni Marco Dall'Olio wrote: > On Thu, May 6, 2010 at 8:08 AM, Feed My Inbox wrote: >> ================================================== >> 1. Simple Fasta Parsing Is Not Simple. >> ================================================== >> May 5, 2010 at 6:25 PM >> >> When trying out the examples from chapter 2.3 of the biopython >>1.54b tutorial I keep running into this very annoying problem: When I use: > > This user is complaining that the current biopython's tutorial > describes a feature introduced in biopython 1.54, so if you try it > with an earlier version, it doesn't work. > In particular, it is the feature that allow to use a string/filename > as argument for SeqIO.parse, which is not available in earlier > versions. > Maybe it would be useful to add a note to the tutorial, explaining > users that they should update biopython or that in earlier versions > they have to use a filehandler . It did, there was an FAQ on this. However I've added some examples to the appendix section on handles and make the FAQ entry reference that for more details. Peter From bugzilla-daemon at portal.open-bio.org Thu May 6 17:08:58 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 May 2010 13:08:58 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005061708.o46H8wRQ018125@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-06 13:08 EST ------- (In reply to comment #0) > This is the error message I get: > > ====================================================================== > FAIL: Simple round-trip through app with infile. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 56, in test_Mafft_simple > self.assert_("STEP 2 / 2 d" in stderr_string) > AssertionError I have changed that to look for "Progressive alignment ..." instead which is present in both this MAFFT 5.x output and in MAFFT 6.x output. > ====================================================================== > FAIL: Round-trip with complex command line. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 126, in test_Mafft_with_complex_command_line > self.assertEqual(return_code, 0) > AssertionError: 1 != 0 I've changed this to give the command line used to help debug when MAFFT returns an error code. Could you retest and report what MAFFT does for this particular command? Also what is the output of "mafft --help" from MAFFT 5.732? That would be useful if we do have to make running the test conditional on the version of MAFFT. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 6 18:38:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 May 2010 14:38:33 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005061838.o46IcXgw021287@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1498 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-06 14:38 EST ------- (From update of attachment 1498) This code is now on one of my github branches: http://github.com/peterjc/biopython/tree/seqio-sam-bam This includes basic indexing support via Bio.SeqIO.index(), currently for SAM only. BAM should be easy enough. Note that this is *much* simpler than the indexing by mapping location offered by samtools (and pysam). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 01:07:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 6 May 2010 21:07:36 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005070107.o4717a4E002248@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2010-05-06 21:07 EST ------- (In reply to comment #4) > Regarding dependencies and cross platform support, pysam is a lightweight > wrapper of the samtools C-API, using pyrex. If we want to use pysam in > Biopython that means build time dependencies on samtools and pyrex. This > won't work under Jython, and at the time of writing pysam doesn't appear > to support Windows either. So I'm not so comfortable about this. Since the samtools C library is not so large, we could consider writing a plain C wrapper instead of pyrex to get at least rid of this dependency. The samtools dependency is more difficult. There are two reasonably options. One option is to help out the pysam/samtools developers to create a non-pyrex C wrapper and have it included into the samtools distribution. The other option is to have both samtools itself and a Python wrapper included in Biopython -- note that pysam itself includes the samtools source files. However, this would mean keeping the samtools in Biopython up-to-date with the standalone samtools, which I think will cause us headaches in the long run. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 06:46:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 02:46:35 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070646.o476kZOx016899@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2010-05-07 02:46 EST ------- (In reply to comment #2) > (In reply to comment #0) > > This is the error message I get: > > > > ====================================================================== > > FAIL: Simple round-trip through app with infile. > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "test_Mafft_tool.py", line 56, in test_Mafft_simple > > self.assert_("STEP 2 / 2 d" in stderr_string) > > AssertionError > > I have changed that to look for "Progressive alignment ..." instead > which is present in both this MAFFT 5.x output and in MAFFT 6.x output. This error has disappeared -- thanks! > > > ====================================================================== > > FAIL: Round-trip with complex command line. > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "test_Mafft_tool.py", line 126, in test_Mafft_with_complex_command_line > > self.assertEqual(return_code, 0) > > AssertionError: 1 != 0 > > I've changed this to give the command line used to help debug when MAFFT > returns an error code. Could you retest and report what MAFFT does for > this particular command? This is the output I am now getting: ====================================================================== FAIL: Round-trip with complex command line. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Mafft_tool.py", line 144, in test_Mafft_with_complex_command_line % (return_code, cmdline)) AssertionError: Got error code 1 back from: mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 -- ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 If I just run this mafft command directly, I get: $ mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 --ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 /usr/local/bin/mafft: line 184: [: --treeout: integer expression expected Unknown option: --treeout MAFFT version 5.732 (2005/09/14) References: Katoh et al., 2002, NAR 30: 3059-3066 Katoh et al., 2005, NAR 33: 511-518 http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft Options: --localpair : All pairwise local alignment information is included to the objective function. default: off --globalpair : All pairwise global alignment information is included to the objective function. default: off --op # : Gap opening penalty (>0). default: 1.53 --ep # : Offset (>0, works like gap extension penalty). default: 0.123 --bl #, --jtt # : Scoring matrix. default: BLOSUM62 Alternatives are BLOSUM (--bl) 30, 45, 62, 80, or JTT (--jtt) # PAM. --nuc or --amino : Sequence type. default: auto --retree # : The number of tree building in progressive method (see the paper for detail). default: 2 --maxiterate # : Maximum number of iterative refinement. default: 0 --fft or --nofft: FFT is enabled or disabled. default: enabled --memsave: Memory saving mode (beta). default: off --clustalout: Output: clustal format (not tested). default: fasta --reorder: Outorder: aligned (not tested). default: input order --quiet : Do not report progress. Input format: fasta format Typical usages: % mafft --maxiterate 1000 --localpair input > output L-INS-i (most accurate in many cases; assumes there is only one alignable domain) % mafft --maxiterate 1000 --genafpair input > output E-INS-i (works even if there are many unalignable residues between alignable domains) % mafft --maxiterate 1000 --globalpair input > output G-INS-i (suitable for globally alignable sequences) % mafft --maxiterate 1000 input > output FFT-NS-i (accurate and slow, iterative refinement method) % mafft --retree 2 input > output (DEFAULT; same as mafft input > output) FFT-NS-2 (rough and fast; progressive method) % mafft --retree 1 input > output FFT-NS-1 (very rough and very fast, applicable to >5,000 sequences; progressive method with a rough guide tree) > Also what is the output of "mafft --help" from MAFFT 5.732? That would be > useful if we do have to make running the test conditional on the version of > MAFFT. > This is the output of "mafft --help": $ mafft --help Cannot open --help. MAFFT version 5.732 (2005/09/14) References: Katoh et al., 2002, NAR 30: 3059-3066 Katoh et al., 2005, NAR 33: 511-518 http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft Options: --localpair : All pairwise local alignment information is included to the objective function. default: off --globalpair : All pairwise global alignment information is included to the objective function. default: off --op # : Gap opening penalty (>0). default: 1.53 --ep # : Offset (>0, works like gap extension penalty). default: 0.123 --bl #, --jtt # : Scoring matrix. default: BLOSUM62 Alternatives are BLOSUM (--bl) 30, 45, 62, 80, or JTT (--jtt) # PAM. --nuc or --amino : Sequence type. default: auto --retree # : The number of tree building in progressive method (see the paper for detail). default: 2 --maxiterate # : Maximum number of iterative refinement. default: 0 --fft or --nofft: FFT is enabled or disabled. default: enabled --memsave: Memory saving mode (beta). default: off --clustalout: Output: clustal format (not tested). default: fasta --reorder: Outorder: aligned (not tested). default: input order --quiet : Do not report progress. Input format: fasta format Typical usages: % mafft --maxiterate 1000 --localpair input > output L-INS-i (most accurate in many cases; assumes there is only one alignable domain) % mafft --maxiterate 1000 --genafpair input > output E-INS-i (works even if there are many unalignable residues between alignable domains) % mafft --maxiterate 1000 --globalpair input > output G-INS-i (suitable for globally alignable sequences) % mafft --maxiterate 1000 input > output FFT-NS-i (accurate and slow, iterative refinement method) % mafft --retree 2 input > output (DEFAULT; same as mafft input > output) FFT-NS-2 (rough and fast; progressive method) % mafft --retree 1 input > output FFT-NS-1 (very rough and very fast, applicable to >5,000 sequences; progressive method with a rough guide tree) Thanks, --Michiel. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 08:49:11 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 04:49:11 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070849.o478nBF0022623@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-07 04:49 EST ------- (In reply to comment #3) > > This is the output I am now getting: > > ====================================================================== > FAIL: Round-trip with complex command line. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 144, in test_Mafft_with_complex_command_line > % (return_code, cmdline)) > AssertionError: Got error code 1 back from: > mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 > -- > ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 > > If I just run this mafft command directly, I get: > > $ mafft --localpair --weighti 4.2 --retree 5 --maxiterate 200 --nofft --op 2.04 > --ep 0.51 --lop 0.233 --lep 0.2 --reorder --treeout --nuc Fasta/f002 > /usr/local/bin/mafft: line 184: [: --treeout: integer expression expected > Unknown option: --treeout > > MAFFT version 5.732 (2005/09/14) > ... It looks to me like some of the other arguments the test is trying to use are also not supported on this version of MAFFT. > > Also what is the output of "mafft --help" from MAFFT 5.732? That would be > > useful if we do have to make running the test conditional on the version of > > MAFFT. > > > > This is the output of "mafft --help": > > $ mafft --help > Cannot open --help. > > MAFFT version 5.732 (2005/09/14) > ... Great. That's enough to be able to detect the version number. Note that MAFFT v6 doesn't support the --help argument, the point is it will abort with help text and not sit waiting on stdin. I've update the unit test to require MAFFT v6 or later, which should resolve this bug. http://github.com/biopython/biopython/commit/e2219beb156e80b55da3efb4a8efe2c2347ec877 Thanks for your help, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 09:21:08 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 05:21:08 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070921.o479L8bl023566@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2010-05-07 05:21 EST ------- (In reply to comment #4) > I've update the unit test to require MAFFT v6 or later, which should resolve > this bug. Thanks. This works fine now. --Michiel. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 09:42:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 05:42:45 -0400 Subject: [Biopython-dev] [Bug 3042] test_Mafft_tool fails In-Reply-To: Message-ID: <201005070942.o479gj3E024144@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3042 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-07 05:42 EST ------- (In reply to comment #5) > > Thanks. This works fine now. > Great - marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 11:15:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 07:15:28 -0400 Subject: [Biopython-dev] [Bug 3043] test_NCBI_BLAST_tools fails In-Reply-To: Message-ID: <201005071115.o47BFSuG029162@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3043 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-07 07:15 EST ------- Michiel, Could you run this test using the latest code? I've added a hack to ignore the three "extra" arguments, -remote_verbose, -use_test_remote_service, and -verbose so it should work. We can probably then comment this out for general usage because... I've also added a few real tests using the pairwise search functionality in BLAST+ where you can search a FASTA file of queries against a FASTA file of subjects -- without having to setup a BLAST database first. This is rather nice. However the tool will not output XML in this mode, and it seems right now we can't parse the plain text output. Tabular output should be fine. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 7 11:39:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 07:39:45 -0400 Subject: [Biopython-dev] [Bug 3043] test_NCBI_BLAST_tools fails In-Reply-To: Message-ID: <201005071139.o47BdjRT030107@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3043 ------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2010-05-07 07:39 EST ------- (In reply to comment #6) > However the tool will not output XML in this mode, and it seems right > now we can't parse the plain text output. Tabular output should be fine. I'll put together a Blast output parser after Biopython 1.54 final is out. > Could you run this test using the latest code? THis works fine now. Thanks! $ python test_NCBI_BLAST_tools.py Check all blastn arguments are supported ... ok Check all blastp arguments are supported ... ok Check all blastx arguments are supported ... ok Check all psiblast arguments are supported ... ok Check all rpsblast arguments are supported ... ok Check all rpstblastn arguments are supported ... ok Check all tblastn arguments are supported ... ok Check all tblastx arguments are supported ... ok Pairwise BLASTP search ... ok Pairwise BLASTN search ... ok ---------------------------------------------------------------------- Ran 10 tests in 0.389s OK -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Fri May 7 13:23:56 2010 From: krother at rubor.de (Kristian Rother) Date: Fri, 7 May 2010 15:23:56 +0200 Subject: [Biopython-dev] Added Loop Closure algorithm to rna branch. Message-ID: <75a2442e49a53379b4cf707eee2a1f05-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheQ1pWUA5cWQ==-webmailer2@server03.webmailer.hosteurope.de> Hi, We've prepared some code that we would like to propose for inclusion to the community: Bio.PDB.CoordBuilder Constructs 3D coordinates based on 3 atoms + distance + angle + dihedral. Uses the NerF algorithm also used in the ROSETTA protein prediction program. Bio.PDB.FCCDLoopCloser Iteratively optimizes a dangling chain of atoms until they reach a defined target site. Refactored version of Wouter Boomsmas and Thomas Hamelrycks algorithm. Code + tests have been commited to krother/biopython, branch *rna* on github. We hope this is useful. Cheers, Kristian, Magdalena, Tomek From biopython at maubp.freeserve.co.uk Fri May 7 13:42:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 14:42:02 +0100 Subject: [Biopython-dev] Added Loop Closure algorithm to rna branch. In-Reply-To: <75a2442e49a53379b4cf707eee2a1f05-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheQ1pWUA5cWQ==-webmailer2@server03.webmailer.hosteurope.de> References: <75a2442e49a53379b4cf707eee2a1f05-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheQ1pWUA5cWQ==-webmailer2@server03.webmailer.hosteurope.de> Message-ID: On Fri, May 7, 2010 at 2:23 PM, Kristian Rother wrote: > > Hi, > > We've prepared some code that we would like to propose for inclusion to > the community: > > Bio.PDB.CoordBuilder > ? ?Constructs 3D coordinates based on 3 atoms + distance + angle + > ? ?dihedral. > ? ?Uses the NerF algorithm also used in the ROSETTA protein prediction > ? ?program. > > Bio.PDB.FCCDLoopCloser > ? ?Iteratively optimizes a dangling chain of atoms until they > ? ?reach a defined target site. > ? ?Refactored version of Wouter Boomsmas and Thomas Hamelrycks algorithm. > > > Code + tests have been commited to krother/biopython, branch *rna* on github. > > We hope this is useful. > > Cheers, > ? ?Kristian, Magdalena, Tomek > Sounds very interesting. Eric - can you take a look at this? We could potentially merge this after Biopython 1.54 is out the door... Peter From bugzilla-daemon at portal.open-bio.org Fri May 7 14:47:02 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 7 May 2010 10:47:02 -0400 Subject: [Biopython-dev] [Bug 3074] New: Please support additional fields in the SeqIO embl parser Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3074 Summary: Please support additional fields in the SeqIO embl parser Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: Wim.DeSmet at UGent.be Sequences returned from the Bio.SeqIO parser for 'embl' files don't contain a parsed version of at least the following fields: DT (date) DR (database cross references) Possibly also missing: KW the keywords field dataclass field in the ID field It would be useful to me and I imagine others to have access to these additional fields that are in the original embl files. Not having them means that if you parse embl files, manipulate the sequence and write out the result means losing data or having to manually add the original data back into the file. If you wish to hold on to this data. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Sun May 9 06:08:19 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sun, 9 May 2010 02:08:19 -0400 Subject: [Biopython-dev] 5/9 BioStar - Biopython Questions Message-ID: ================================================== 1. I wonder why it is so important to use Seq objects in stead of plain ol' strings in Biopyton? ================================================== May 8, 2010 at 5:10 PM This may seem like a superfluous question, and perhaps it is, but it's important to get the basic raison d'etres of the programming habits that are encouraged in the tutorial straight. (Wow that's a strange and awkward sentence but it's good to write in English and show off literal non-skills.) In short: Why use: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) When you can simply use: >>> from Bio.Seq import translate >>> my_string_messenger_rna = "AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG" >>> translate(my_string_messenger_rna) 'MAIVMGR*KGAR*' http://biostar.stackexchange.com/questions/1001/i-wonder-why-it-is-so-important-to-use-seq-objects-in-stead-of-plain-ol-strings -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From bugzilla-daemon at portal.open-bio.org Wed May 12 23:41:03 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 12 May 2010 19:41:03 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005122341.o4CNf3Ps001897@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #11 from laserson at mit.edu 2010-05-12 19:41 EST ------- Hi Peter, Sorry for my short hiatus...see responses below. (In reply to comment #10) > Could you retest as "embl" format with the trunk? I would expect some warnings > from these over indented features in IMGT, and we can certainly remove the > warning if we decide not to introduce a separate IMGT format variant. I still get the LocationParserErrors for many records. Also note that the SeqIO.index function doesn't treat the IMGT headers correctly, so it's not possible to access any of the records from the index it creates (this was also addressed in my patch where I subclassed an independent IMGT parser). > > http://github.com/biopython/biopython/commit/e6ba962dd60fe585baa1237445d33f67d47dd57f > > This change takes a slightly different approach to your work on github, but > is quite similar to your two line patch - but this should still work with > another odd form: > > FH Key Location/Qualifiers > FT L-V-D-J-C-SEQUEN1..1151 > FT /db_xref="taxon:32630" > FT /organism="synthetic construct" > FT 5'UTR 1..37 > ... I still couldn't get the current master branch 'embl' format to work. But hardcoding the alternate indentation did work, even in the cases where the feature key is right up against the location qualifier. > > In the above example (generated by Biopython itself), the strict EMBL column > limits have been obeyed but the feature key has been truncated to just > L-V-D-J-C-SEQUEN rather than L-V-D-J-C-SEQUENCE. This is a related query - > when asked to output such a feature as EMBL or GenBank format, should we raise > an exception here? We could add a warning instead, and either leave the code > as is, or output this: > > FH Key Location/Qualifiers > FT L-V-D-J-C-SEQUE 1..1151 > FT /db_xref="taxon:32630" > FT /organism="synthetic construct" > FT 5'UTR 1..37 > ... > I think we should probably output all IMGT records using the increased indentation. This way there will be no ambiguity and no information loss. If you want to manually "convert" to standard EMBL format, I think the truncation makes sense as you proposed it, and we could issue a warning about lost information. > > > Thinking ahead would you also want to be able to write out IMGT variant > > > EMBL files? > > > > I personally don't need this functionality, but I am willing to write it to > > complement the IMGT parser that I wrote. > > If we go done the route of formalising IMGT as an EMBL variant with a different > feature indent, it should just be a trivial subclass of the existing EMBL > writer object but with the indentation constant changed. > Agreed. > Note there are other problem in the IMGT data, including locations like > "1..428>" and "<1..328>" where the greater than should be BEFORE the location > (but we could probably cope with this all the same), and just "1." where half > the location is missing (which we can't really do much with other than treat > it as simply "1" instead?). I have already notified IMGT regarding the ">" problem, though they seem like they will be slow to change it. It's a very simple fix to the flatfile, and I did it manually with regular expressions. My preference is that we do NOT support the backwards notation, as it's clearly wrong. We'll have them fix it. In the meanwhile, I can post my python script that corrects it somewhere (maybe as a gist on github) and we can just point people to it in a warning if they are using the IMGT parser. Regarding the 1. problem, I have not yet told the IMGT people, but I will do so shortly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Thu May 13 06:12:15 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Thu, 13 May 2010 02:12:15 -0400 Subject: [Biopython-dev] 5/13 BioStar - Biopython Questions Message-ID: <1c49e04af2a3e80dd942fa741b65a7b2@74.63.51.88> ================================================== 1. Clustalw alignment problem ================================================== May 12, 2010 at 4:54 PM Hi everyone, I tried these lines ................................................ import os from Bio.Clustalw import MultipleAlignCL cline = MultipleAlignCL(os.path.join(os.curdir, "opuntia.fasta")) cline.set_output("test.aln") alignment = Clustalw.do_alignment(cline) ............................................. But couldn't proceed with these errors ............................................. Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python24\align.py", line 5, in ? alignment = Clustalw.do_alignment(cline) File "C:\Python24\lib\site-packages\Bio\Clustalw\__init__.py", line 95, in do_alignment shell=(sys.platform!="win32") File "C:\Python24\lib\subprocess.py", line 534, in __init__ (p2cread, p2cwrite, File "C:\Python24\lib\subprocess.py", line 594, in _get_handles p2cread = self._make_inheritable(p2cread) File "C:\Python24\lib\subprocess.py", line 635, in _make_inheritable DUPLICATE_SAME_ACCESS) TypeError: an integer is required ................................................. test.aln is not generated too .................................................. Thanks http://biostar.stackexchange.com/questions/1041/clustalw-alignment-problem -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From biopython at maubp.freeserve.co.uk Thu May 13 11:37:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 May 2010 12:37:36 +0100 Subject: [Biopython-dev] Ready for Biopython 1.54 release? Message-ID: Hello all, Are there any outstanding issues we should address before making the Biopython 1.54 release? Eric has made a good start on covering Bio.Phylo in the tutorial, which can be easily proof read online: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf One thing I am wondering about is making column extraction in the new alignment object return a string rather than a Seq object. I'll start another thread on this issue... Peter From biopython at maubp.freeserve.co.uk Thu May 13 11:47:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 May 2010 12:47:48 +0100 Subject: [Biopython-dev] Alignment columns as strings or Seq objects? Message-ID: Peter wrote: > Hello all, > > Are there any outstanding issues we should address before making > the Biopython 1.54 release? > > ... > > One thing I am wondering about is making column extraction in > the new alignment object return a string rather than a Seq object. > I'll start another thread on this issue... I remember we debated this a bit before but can't find the thread right now. See also Bug 3066 where I am proposing to add methods to iterate over the rows or columns as strings. http://bugzilla.open-bio.org/show_bug.cgi?id=3066 The main benefit of using a plain string when extracting the alignment columns is speed. Because the data is stored by row, each time we extract a column we would have to build a new instance of the Seq object. For large alignments (and thinking ahead to next-gen alignment objects) this could be a painful overhead. Because the whole alignment has an alphabet, we can use this to assign an alphabet to a column sequence. Note that the rows of the alignments could have slightly different alphabets. So it is possible (and the current code does this) to generate a Seq object with a meaningful alphabet from a column. Why is this useful? Other than the alphabet, the main benefit of using a Seq object is consistency. On a practical level, the Seq object's biological translate method isn't appropriate at all for an alignment column. On the other hand, one might possibly want to use (back)transcribe to flip between DNA and RNA, and maybe even take the complement. Are there any strong views here on how alignment slicing to get a column should behave? i.e. should align[:,9] return the column as a string or as a Seq? Peter From mjldehoon at yahoo.com Fri May 14 00:29:59 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 13 May 2010 17:29:59 -0700 (PDT) Subject: [Biopython-dev] Alignment columns as strings or Seq objects? In-Reply-To: Message-ID: <948611.90987.qm@web62403.mail.re1.yahoo.com> I would definitely use a plain string. A Seq object suggests that we're dealing with a real biological sequence, which a column in the alignment matrix is not. The only advantage of having a Seq object is that it has an alphabet associated with it. But alphabets are very rarely used in practice, if at all. Reverse complementing or (back-)transcribing are available in the Bio.Seq module as functions that can operate on plain strings, so we don't need a Seq object for that. --Michiel. --- On Thu, 5/13/10, Peter wrote: > From: Peter > Subject: [Biopython-dev] Alignment columns as strings or Seq objects? > To: "Biopython-Dev Mailing List" > Date: Thursday, May 13, 2010, 7:47 AM > Peter wrote: > > Hello all, > > > > Are there any outstanding issues we should address > before making > > the Biopython 1.54 release? > > > > ... > > > > One thing I am wondering about is making column > extraction in > > the new alignment object return a string rather than a > Seq object. > > I'll start another thread on this issue... > > I remember we debated this a bit before but can't find the > thread right now. See also Bug 3066 where I am proposing > to add methods to iterate over the rows or columns as > strings. > http://bugzilla.open-bio.org/show_bug.cgi?id=3066 > > The main benefit of using a plain string when extracting > the > alignment columns is speed. Because the data is stored by > row, each time we extract a column we would have to build > a new instance of the Seq object. For large alignments > (and > thinking ahead to next-gen alignment objects) this could > be > a painful overhead. > > Because the whole alignment has an alphabet, we can use > this > to assign an alphabet to a column sequence. Note that the > rows > of the alignments could have slightly different alphabets. > So it > is possible (and the current code does this) to generate a > Seq > object with a meaningful alphabet from a column. > > Why is this useful? Other than the alphabet, the main > benefit > of using a Seq object is consistency. On a practical level, > the > Seq object's biological translate method isn't appropriate > at all > for an alignment column. On the other hand, one might > possibly > want to use (back)transcribe to flip between DNA and RNA, > and maybe even take the complement. > > Are there any strong views here on how alignment slicing > to > get a column should behave? i.e. should align[:,9] return > the > column as a string or as a Seq? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Fri May 14 02:28:13 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 May 2010 19:28:13 -0700 Subject: [Biopython-dev] Alignment columns as strings or Seq objects? In-Reply-To: <948611.90987.qm@web62403.mail.re1.yahoo.com> References: <948611.90987.qm@web62403.mail.re1.yahoo.com> Message-ID: Here's another +1 for plain strings. I agree with Michiel, and if the user really needs to rebuild a Seq with the original alphabet, it's not too difficult to fetch that information from the original alignment object. -Eric On Thu, May 13, 2010 at 5:29 PM, Michiel de Hoon wrote: > I would definitely use a plain string. A Seq object suggests that we're > dealing with a real biological sequence, which a column in the alignment > matrix is not. The only advantage of having a Seq object is that it has an > alphabet associated with it. But alphabets are very rarely used in practice, > if at all. Reverse complementing or (back-)transcribing are available in the > Bio.Seq module as functions that can operate on plain strings, so we don't > need a Seq object for that. > > --Michiel. > > --- On Thu, 5/13/10, Peter wrote: > > > From: Peter > > Subject: [Biopython-dev] Alignment columns as strings or Seq objects? > > To: "Biopython-Dev Mailing List" > > Date: Thursday, May 13, 2010, 7:47 AM > > Peter wrote: > > > Hello all, > > > > > > Are there any outstanding issues we should address > > before making > > > the Biopython 1.54 release? > > > > > > ... > > > > > > One thing I am wondering about is making column > > extraction in > > > the new alignment object return a string rather than a > > Seq object. > > > I'll start another thread on this issue... > > > > I remember we debated this a bit before but can't find the > > thread right now. See also Bug 3066 where I am proposing > > to add methods to iterate over the rows or columns as > > strings. > > http://bugzilla.open-bio.org/show_bug.cgi?id=3066 > > > > The main benefit of using a plain string when extracting > > the > > alignment columns is speed. Because the data is stored by > > row, each time we extract a column we would have to build > > a new instance of the Seq object. For large alignments > > (and > > thinking ahead to next-gen alignment objects) this could > > be > > a painful overhead. > > > > Because the whole alignment has an alphabet, we can use > > this > > to assign an alphabet to a column sequence. Note that the > > rows > > of the alignments could have slightly different alphabets. > > So it > > is possible (and the current code does this) to generate a > > Seq > > object with a meaningful alphabet from a column. > > > > Why is this useful? Other than the alphabet, the main > > benefit > > of using a Seq object is consistency. On a practical level, > > the > > Seq object's biological translate method isn't appropriate > > at all > > for an alignment column. On the other hand, one might > > possibly > > want to use (back)transcribe to flip between DNA and RNA, > > and maybe even take the complement. > > > > Are there any strong views here on how alignment slicing > > to > > get a column should behave? i.e. should align[:,9] return > > the > > column as a string or as a Seq? > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Fri May 14 03:08:52 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 May 2010 20:08:52 -0700 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Thu, May 13, 2010 at 4:37 AM, Peter wrote: > Hello all, > > Are there any outstanding issues we should address before making > the Biopython 1.54 release? > > Eric has made a good start on covering Bio.Phylo in the tutorial, > which can be easily proof read online: > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf > > I wrote some more, but something very inconvenient happened to my laptop just after I pushed the commit to my own branch on GitHub: http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 This is a description of every TreeMixin method, ripped from the docstrings / epydoc output and cleaned up a little. The main thing I still need to explain is how the find_* arguments work. However, if we want to get the release out quickly, then all of this can be moved to the wiki instead, if you'd prefer. -Eric From biopython at maubp.freeserve.co.uk Fri May 14 09:17:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 May 2010 10:17:06 +0100 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Fri, May 14, 2010 at 4:08 AM, Eric Talevich wrote: > On Thu, May 13, 2010 at 4:37 AM, Peter > wrote: >> >> Hello all, >> >> Are there any outstanding issues we should address before making >> the Biopython 1.54 release? >> >> Eric has made a good start on covering Bio.Phylo in the tutorial, >> which can be easily proof read online: >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf >> > > I wrote some more, but something very inconvenient happened to my laptop > just after I pushed the commit to my own branch on GitHub: > http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 > > This is a description of every TreeMixin method, ripped from the docstrings > / epydoc output and cleaned up a little. The main thing I still need to > explain is how the find_* arguments work. > > However, if we want to get the release out quickly, then all of this can be > moved to the wiki instead, if you'd prefer. > > -Eric Hi Eric, It sounds like giving you a little more time will make the Phylo chapter much more useful. I'm not going to have time today, and while I could do the release from home my Windows development machine is at work. Shall we aim for a release early next week then? Say Monday or Tuesday? Peter From biopython at maubp.freeserve.co.uk Fri May 14 10:23:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 May 2010 11:23:30 +0100 Subject: [Biopython-dev] Alignment columns as strings or Seq objects? In-Reply-To: References: <948611.90987.qm@web62403.mail.re1.yahoo.com> Message-ID: On Fri, May 14, 2010 at 3:28 AM, Eric Talevich wrote: > Here's another +1 for plain strings. I agree with Michiel, and if the user > really needs to rebuild a Seq with the original alphabet, it's not too > difficult to fetch that information from the original alignment object. > > -Eric OK, done. Peter From bugzilla-daemon at portal.open-bio.org Fri May 14 13:14:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 09:14:59 -0400 Subject: [Biopython-dev] [Bug 3066] Iterating/looping over colums/rows of a MultipleSeqAlignment In-Reply-To: Message-ID: <201005141314.o4EDExXE021190@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3066 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 09:14 EST ------- (In reply to comment #0) > A related question here is should the columns be returned as strings or as Seq > objects? Possible implementation to follow as a patch... The main __getitem__ method has just been changed to return strings as of Biopython 1.54 (while the beta returned columns as Seq objects): http://github.com/biopython/biopython/commit/dbf72e19d65d1edd6777bd498306fe34eb4e371e Therefore for consistency, any column iterator method should now also return strings. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 14 13:33:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 09:33:48 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005141333.o4EDXm7Q021824@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 09:33 EST ------- (In reply to comment #11) > > I think we should probably output all IMGT records using the increased > indentation. This way there will be no ambiguity and no information loss. If > you want to manually "convert" to standard EMBL format, I think the truncation > makes sense as you proposed it, and we could issue a warning about lost > information. I've found a page describing the IMGT file format, and it does say their feature indent should be 26 (while EMBL files use 21): http://www.ebi.ac.uk/imgt/hla/docs/manual.html > > I have already notified IMGT regarding the ">" problem, though they seem like > they will be slow to change it. It's a very simple fix to the flatfile, and I > did it manually with regular expressions. My preference is that we do NOT > support the backwards notation, as it's clearly wrong. We'll have them fix > it. In the meanwhile, I can post my python script that corrects it somewhere > (maybe as a gist on github) and we can just point people to it in a warning if > they are using the IMGT parser. > > Regarding the 1. problem, I have not yet told the IMGT people, but I will do > so shortly. > The document I found does not discuss the details of the location, so I would expect it to follow the same rules as EMBL (and GenBank and the DDBJ), see: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html I now agree with you it makes sense to treat this as a new format in SeqIO (i.e. "imgt" rather than "embl"). The actual new code should be minimal too. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 14 13:44:46 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 09:44:46 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005141344.o4EDikeW022139@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 09:44 EST ------- (In reply to comment #12) > I've found a page describing the IMGT file format, and it does say their > feature indent should be 26 (while EMBL files use 21): > http://www.ebi.ac.uk/imgt/hla/docs/manual.html See also: http://imgt.cines.fr/download/LIGM-DB/userman_doc.html and: http://imgt.cines.fr/download/LIGM-DB/ftable_doc.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Fri May 14 16:14:15 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 14 May 2010 09:14:15 -0700 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Fri, May 14, 2010 at 2:17 AM, Peter wrote: > On Fri, May 14, 2010 at 4:08 AM, Eric Talevich > wrote: > > On Thu, May 13, 2010 at 4:37 AM, Peter > > wrote: > >> > >> Hello all, > >> > >> Are there any outstanding issues we should address before making > >> the Biopython 1.54 release? > >> > >> Eric has made a good start on covering Bio.Phylo in the tutorial, > >> which can be easily proof read online: > >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html > >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf > >> > > > > I wrote some more, but something very inconvenient happened to my laptop > > just after I pushed the commit to my own branch on GitHub: > > > http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 > > > > This is a description of every TreeMixin method, ripped from the > docstrings > > / epydoc output and cleaned up a little. The main thing I still need to > > explain is how the find_* arguments work. > > > > However, if we want to get the release out quickly, then all of this can > be > > moved to the wiki instead, if you'd prefer. > > > > -Eric > > Hi Eric, > > It sounds like giving you a little more time will make the Phylo > chapter much more useful. > > I'm not going to have time today, and while I could do the release > from home my Windows development machine is at work. > > Shall we aim for a release early next week then? Say Monday or > Tuesday? > > Peter > Sure. I'll aim for pushing my documentation to GitHub on Saturday or Sunday. -Eric From bugzilla-daemon at portal.open-bio.org Fri May 14 17:05:06 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 May 2010 13:05:06 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: Message-ID: <201005141705.o4EH56ok028481@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2905 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 13:05 EST ------- The code on my branch has been updated, and now supports SAM and BAM parsing (currently it only extracts the read name, sequence and quality scores), indexing by name with Bio.SeqIO.index(), and fast conversion to FASTA or Sanger FASTQ with Bio.SeqIO.convert() which is handy for redoing a mapping: http://github.com/peterjc/biopython/tree/seqio-sam-bam Note that suffixes of "/1" or "/2" are added to forward or reverse read names to make them unique. This matches the Illumina pipeline convention and is handled by most tools which take paired end data. I'm actually using this code at the moment: I've started with BAM files of paired end Illumina transcriptome reads mapped onto a draft assembly. I then used the convert code to convert these to FASTQ files, then split them into a pair of FASTQ files (forward and reverse) and used BWA to remap them to a different reference (giving new SAM files). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat May 15 10:32:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 15 May 2010 11:32:39 +0100 Subject: [Biopython-dev] Simpler SeqRecord creation Message-ID: Hi all, Since several of the changes coming in Biopython 1.54 are "syntactic sugar" like accepting filenames or handles in SeqIO, I was wondering about other ways to make life a little easier. One is creation of a SeqRecord with the default argument: from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord rec = SeqRecord(Seq("ACGT"), id="Test") If the SeqRecord __init__ checked for a plain string as the sequence, it could automatically upgrade it into a Seq object with the default argument, thus: from Bio.SeqRecord import SeqRecord rec = SeqRecord("ACGT", id="Test") I'm a little concerned that this will impose a small but noticeable overhead when working on very large files though... What are peoples thoughts on this idea? Peter From mjldehoon at yahoo.com Sat May 15 16:19:05 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 15 May 2010 09:19:05 -0700 (PDT) Subject: [Biopython-dev] Simpler SeqRecord creation In-Reply-To: Message-ID: <140115.14021.qm@web62402.mail.re1.yahoo.com> Simpler SeqRecord creation is good in itself, but I wouldn't spend too much time on int. If hopefully we some day deprecate alphabets, then a Seq object reduces to a string anyway. --Michiel. --- On Sat, 5/15/10, Peter wrote: > From: Peter > Subject: [Biopython-dev] Simpler SeqRecord creation > To: "Biopython-Dev Mailing List" > Date: Saturday, May 15, 2010, 6:32 AM > Hi all, > > Since several of the changes coming in Biopython 1.54 > are "syntactic sugar" like accepting filenames or handles > in SeqIO, I was wondering about other ways to make life > a little easier. One is creation of a SeqRecord with the > default argument: > > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > rec = SeqRecord(Seq("ACGT"), id="Test") > > If the SeqRecord __init__ checked for a plain string as > the sequence, it could automatically upgrade it into a > Seq object with the default argument, thus: > > from Bio.SeqRecord import SeqRecord > rec = SeqRecord("ACGT", id="Test") > > I'm a little concerned that this will impose a small > but noticeable overhead when working on very > large files though... > > What are peoples thoughts on this idea? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Sat May 15 17:53:20 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 15 May 2010 13:53:20 -0400 Subject: [Biopython-dev] Simpler SeqRecord creation In-Reply-To: <140115.14021.qm@web62402.mail.re1.yahoo.com> References: <140115.14021.qm@web62402.mail.re1.yahoo.com> Message-ID: <20100515175320.GA2432@kunkel> Peter and Michiel; > > If the SeqRecord __init__ checked for a plain string as > > the sequence, it could automatically upgrade it into a > > Seq object with the default argument, thus: > > > > from Bio.SeqRecord import SeqRecord > > rec = SeqRecord("ACGT", id="Test") > Simpler SeqRecord creation is good in itself, but I wouldn't spend too > much time on int. If hopefully we some day deprecate alphabets, then a > Seq object reduces to a string anyway. Accepting strings seems like a good way to start a transition from Seq objects to standard strings. +1 for this. It would also be useful if the defaults for id, name and description were empty strings instead of "." These don't seem especially useful, and when generating SeqRecords and writing them to Fasta, this helps avoid having to explicitly set descriptions to an empty string. Brad From biopython at maubp.freeserve.co.uk Sun May 16 11:48:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 16 May 2010 12:48:49 +0100 Subject: [Biopython-dev] Simpler SeqRecord creation In-Reply-To: <20100515175320.GA2432@kunkel> References: <140115.14021.qm@web62402.mail.re1.yahoo.com> <20100515175320.GA2432@kunkel> Message-ID: On Sat, May 15, 2010 at 6:53 PM, Brad Chapman wrote: > Peter and Michiel; > >> > If the SeqRecord __init__ checked for a plain string as >> > the sequence, it could automatically upgrade it into a >> > Seq object with the default argument, thus: >> > >> > from Bio.SeqRecord import SeqRecord >> > rec = SeqRecord("ACGT", id="Test") > >> Simpler SeqRecord creation is good in itself, but I wouldn't spend too >> much time on int. If hopefully we some day deprecate alphabets, then a >> Seq object reduces to a string anyway. > > Accepting strings seems like a good way to start a transition from > Seq objects to standard strings. +1 for this. I'm not convinced about moving from Seq objects to plain strings. I *like* having the biological methods as part of the Seq object. I can also think of several useful Seq subclass objects such as 2bit encoded unambiguous DNA or RNA (BioJava has this) or 4bit encoded ambiguous DNA or RNA (the BAM format uses this). These would be a trade off using less memory at the expense of being a bit slower for many operations - they could be very useful is dealing with next generation sequence data. > It would also be useful if the defaults for id, name and description > were empty strings instead of "." These don't seem > especially useful, and when generating SeqRecords and writing them > to Fasta, this helps avoid having to explicitly set descriptions to > an empty string. Yes, I like that idea for name and description. I'm not 100% sure about having a default ID - I'd prefer that was mandatory since so much depends on it (e.g. SeqIO and AlignIO), and a default of the empty string may have side effects. Changing these defaults won't hurt performance which is good. Something to change after we release Biopython 1.54 this coming week? Peter From eric.talevich at gmail.com Sun May 16 16:20:15 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 16 May 2010 12:20:15 -0400 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: On Fri, May 14, 2010 at 12:14 PM, Eric Talevich wrote: > On Fri, May 14, 2010 at 2:17 AM, Peter wrote: > >> On Fri, May 14, 2010 at 4:08 AM, Eric Talevich >> wrote: >> > On Thu, May 13, 2010 at 4:37 AM, Peter > > >> > wrote: >> >> >> >> Hello all, >> >> >> >> Are there any outstanding issues we should address before making >> >> the Biopython 1.54 release? >> >> >> >> Eric has made a good start on covering Bio.Phylo in the tutorial, >> >> which can be easily proof read online: >> >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html >> >> http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf >> >> >> > >> > I wrote some more, but something very inconvenient happened to my laptop >> > just after I pushed the commit to my own branch on GitHub: >> > >> http://github.com/etal/biopython/commit/960449b1ea9b713ca8111c40d930fc404340fdc2 >> > >> > This is a description of every TreeMixin method, ripped from the >> docstrings >> > / epydoc output and cleaned up a little. The main thing I still need to >> > explain is how the find_* arguments work. >> > >> > However, if we want to get the release out quickly, then all of this can >> be >> > moved to the wiki instead, if you'd prefer. >> > >> > -Eric >> >> Hi Eric, >> >> It sounds like giving you a little more time will make the Phylo >> chapter much more useful. >> >> I'm not going to have time today, and while I could do the release >> from home my Windows development machine is at work. >> >> Shall we aim for a release early next week then? Say Monday or >> Tuesday? >> >> Peter >> > > Sure. I'll aim for pushing my documentation to GitHub on Saturday or > Sunday. > > -Eric > I've pushed the latest docs to GitHub. Does the chapter look all right now? Feel free to modify the text as you see fit; I'm going to be traveling again today and tomorrow and won't be able to respond quickly. (My netbook's still misbehaving, so I had to do some things I'm not proud of -- hence "root" as the committer on the last merge.) -Eric From biopython at maubp.freeserve.co.uk Mon May 17 07:37:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 May 2010 08:37:48 +0100 Subject: [Biopython-dev] Test failures on Jython 2.5.1 Message-ID: Hi all (especially Eric), I've just run the test suite with Jython 2.5.1 and this found some new problems. Most of these are XML related from Bio.Phylo ERROR: Round-trip parsing and serialization of apaf.xml. ExpatError: The element type "phy:clade" must be terminated by the matching end-tag "". ERROR: Round-trip parsing and serialization of bcl_2.xml. ExpatError: The element type "phy:branch_length" must be terminated by the matching end-tag "". ERROR: Round-trip parsing and serialization of o_tol_332_d_dollo.xml. ExpatError: XML document structures must start and end within the same entity. ERROR: Round-trip parsing and serialization of made_up.xml. ExpatError: Premature end of file. ERROR: Round-trip parsing and serialization of phyloxml_examples.xml. ExpatError: XML document structures must start and end within the same entity. It would probably be instructive to look at the serialisation output in an XML validator - if there is a problem it may be the Jython parser is stricter than the C Python XML parser. There are also a couple of SeqIO related problems with large files: ERROR: test_SeqIO OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space ERROR: Write and read back Human_contigs.embl OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space Example Tests/EMBL/Human_contigs.embl is causing the problem. This is a sequence of length 958952, but the file doesn't actually hold the sequence so we use an UnknownSeq object. The two out of heap space error are both from trying to create a string of 958952 "N" characters. I'll have a look at this - we can probably avoid it in the test. Peter From biopython at maubp.freeserve.co.uk Mon May 17 07:57:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 May 2010 08:57:15 +0100 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Mon, May 17, 2010 at 8:37 AM, Peter wrote: > Hi all (especially Eric), > > I've just run the test suite with Jython 2.5.1 and this found some new > problems. Most of these are XML related from Bio.Phylo > > ... > > There are also a couple of SeqIO related problems with large files: > > ERROR: test_SeqIO > OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space > ERROR: Write and read back Human_contigs.embl > OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space > > Example Tests/EMBL/Human_contigs.embl is causing the problem. > This is a sequence of length 958952, but the file doesn't actually > hold the sequence so we use an UnknownSeq object. The two > out of heap space error are both from trying to create a string of > 958952 "N" characters. I'll have a look at this - we can probably > avoid it in the test. Fixed on the trunk. Once we have moved to using string equality for Seq objects comparing two UnknownSeq objects can be handled much more cleanly (without the memory overhead of the naive approach of building the strings). Peter From bugzilla-daemon at portal.open-bio.org Mon May 17 23:44:07 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 17 May 2010 19:44:07 -0400 Subject: [Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records In-Reply-To: Message-ID: <201005172344.o4HNi7Ma008083@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #14 from laserson at mit.edu 2010-05-17 19:44 EST ------- (In reply to comment #12) > I now agree with you it makes sense to treat this as a new format in SeqIO > (i.e. "imgt" rather than "embl"). The actual new code should be minimal too. Great, so how do you want to implement this? I believe the patch I posted does define an 'imgt' format with all the necessary stuff other than writing. But if you'd like to make it more concise, let me know what to do. (The patch also doesn't incorporate the latest changes you made to the EMBL parser. Speaking of which, I was finally able to use the SeqIO.index function successfully using the parser. However, when there are feature keys flush against the location qualifiers, it still raises an exception.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Tue May 18 06:11:31 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Tue, 18 May 2010 02:11:31 -0400 Subject: [Biopython-dev] 5/18 BioStar - Biopython Questions Message-ID: <9255f60053f6ccfb752076b4c86c2c62@74.63.51.88> ================================================== 1. When should we develop biopython that support python 3.X'?? ================================================== May 17, 2010 at 9:17 PM As python 3.X becomming more and more popular,Can we developers take developing biopython that support python 3.X into consideration? I am a newer to biopython and find that biopython doesn't support python 3.X. It's really frustrated. Thank you ! http://biostar.stackexchange.com/questions/1083/when-should-we-develop-biopython-that-support-python-3-x -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From eric.talevich at gmail.com Tue May 18 07:29:53 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 18 May 2010 00:29:53 -0700 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Mon, May 17, 2010 at 12:37 AM, Peter wrote: > Hi all (especially Eric), > > I've just run the test suite with Jython 2.5.1 and this found some new > problems. Most of these are XML related from Bio.Phylo > > ERROR: Round-trip parsing and serialization of apaf.xml. > ExpatError: The element type "phy:clade" must be terminated by the > matching end-tag "". > > ERROR: Round-trip parsing and serialization of bcl_2.xml. > ExpatError: The element type "phy:branch_length" must be terminated by > the matching end-tag "". > > ERROR: Round-trip parsing and serialization of o_tol_332_d_dollo.xml. > ExpatError: XML document structures must start and end within the same > entity. > > ERROR: Round-trip parsing and serialization of made_up.xml. > ExpatError: Premature end of file. > > ERROR: Round-trip parsing and serialization of phyloxml_examples.xml. > ExpatError: XML document structures must start and end within the same > entity. > > It would probably be instructive to look at the serialisation output in > an XML validator - if there is a problem it may be the Jython parser > is stricter than the C Python XML parser. > > (If you're gonna poke around in the bushes, be ready to stir up a few snakes...) Doing a round-trip parsing, rewriting and re-parsing of the test files manually works in Jython, and the XML output looks the same as it does from CPython. I don't immediately see why the test is failing, although I faintly recall reading that Jython's xml.etree implementation is/was a little short of fully baked -- maybe its parser is stopping early for some reason. I'm sorely tempted to just update to the documentation to say Jython support is beta, since I hadn't tried it myself until you pointed this out. But now that we know about this bug, I suppose it warrants another day or so of fussing around with Jython internals. I'll report back after I've done that. -Eric From bugzilla-daemon at portal.open-bio.org Tue May 18 11:48:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 07:48:38 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-line files from IMGT In-Reply-To: Message-ID: <201005181148.o4IBmcof030759@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|More robust feature parser |Support for EMBL-line files |for GenBank/EMBL records |from IMGT ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 07:48 EST ------- Retitling bug as "Support for EMBL-line files from IMGT in Bio.SeqIO". Here is a start at parsing IMGT files based on subclassing the INSDC code with slightly more flexible feature handling: http://github.com/peterjc/biopython/tree/seqio-imgt I've been testing using http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z There are some interesting cases like AB114296 with: FT TRANSMEMBRANE-REGION2163..2240 We can if necessary work around some of the bad locations strings (see the above branch). Note that there are still other problems in the IMGT data like mismatched lengths. Uri - Could you explain what your code was trying to do with the record header parsing? An example or two would be great. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 15:45:10 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 11:45:10 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-line files from IMGT In-Reply-To: Message-ID: <201005181545.o4IFjAhe007264@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #16 from laserson at mit.edu 2010-05-18 11:45 EST ------- (In reply to comment #15) > Uri - Could you explain what your code was trying to do with the record header > parsing? An example or two would be great. Thanks! So the approach I used was to keep the feature parser the exact same as it was in the EMBL parser. In the parse_header function, I would determine for each record what the indentation was, and then changed FEATURE_QUALIFIER_INDENT and FEATURE_QUALIFIER_SPACER for each record. This way, the standard EMBL parser would work fine, and there would never be any problems if the feature key was adjacent to the location qualifier. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 15:53:03 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 11:53:03 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-line files from IMGT In-Reply-To: Message-ID: <201005181553.o4IFr3Rr007470@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #17 from laserson at mit.edu 2010-05-18 11:53 EST ------- Also, here is a script that will fix the location errors with the '>' symbols. Run as: python fix_ligm_locations.py imgt.dat imgt.fixed.dat http://gist.github.com/405146 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 16:10:29 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 12:10:29 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005181610.o4IGATZb008256@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Support for EMBL-line files |Support for EMBL-like files |from IMGT |from IMGT ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 12:10 EST ------- (In reply to comment #16) > (In reply to comment #15) > > Uri - Could you explain what your code was trying to do with the record > > header parsing? An example or two would be great. Thanks! > > So the approach I used was to keep the feature parser the exact same as it was > in the EMBL parser. In the parse_header function, I would determine for each > record what the indentation was, and then changed FEATURE_QUALIFIER_INDENT and > FEATURE_QUALIFIER_SPACER for each record. This way, the standard EMBL parser > would work fine, and there would never be any problems if the feature key was > adjacent to the location qualifier. > I see now. If the IGMT have consistent FH and FT lines we can trust, that would be quite elegant... on the other hand to fix the nasty locations we are forced to subclass parse_features anyway. (In reply to comment #17) > Also, here is a script that will fix the location errors with the '>' > symbols. > > Run as: > > python fix_ligm_locations.py imgt.dat imgt.fixed.dat > > http://gist.github.com/405146 > I've used your regular expression solution in my branch now, http://github.com/biopython/biopython Remind me to add your name as a contributor once this gets merged to the trunk. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 17:36:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 13:36:44 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005181736.o4IHahUL011171@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 13:36 EST ------- (In reply to comment #18) > > I've used your regular expression solution in my branch now, > http://github.com/biopython/biopython > Sorry - I pasted the wrong URL, I mean here: http://github.com/peterjc/biopython/tree/seqio-imgt I've found an even worse example of partial location example from http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z ID M97158 IMGT/LIGM annotation : keyword level; RNA; ROD; 1093 BP. XX ... XX FH Key Location/Qualifiers (from EMBL) FH FT source 1. FT /organism="Mus musculus" FT mRNA join(523. FT intron 1. FT exon 523. FT intron 541. FT exon 638. FT intron 745. XX ... You can see the original at EMBL, http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=M97158&Submit=Go Or in GenBank format at the NCBI, http://www.ncbi.nlm.nih.gov/nuccore/200865 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 18:37:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 14:37:17 -0400 Subject: [Biopython-dev] [Bug 3074] Please support additional fields in the SeqIO embl parser In-Reply-To: Message-ID: <201005181837.o4IIbHO6013307@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3074 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 14:37 EST ------- The database and primary accessions from DR lines are now recorded in the SeqRecord's dbxrefs list: http://github.com/biopython/biopython/commit/d96ab570b196b1b92f65aa945ae6816a60ddb54e The best way to dealing with secondary accessions in a backwards compatible way isn't clear to me - probably as another colon separated entry. See: http://lists.open-bio.org/pipermail/biopython/2010-May/006495.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 18:38:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 14:38:31 -0400 Subject: [Biopython-dev] [Bug 3043] test_NCBI_BLAST_tools fails In-Reply-To: Message-ID: <201005181838.o4IIcVlC013353@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3043 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 14:38 EST ------- Marking this as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 21:46:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 17:46:24 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005182146.o4ILkOWD018452@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #20 from laserson at mit.edu 2010-05-18 17:46 EST ------- (In reply to comment #18) > I see now. If the IGMT have consistent FH and FT lines we can trust, that would > be quite elegant... on the other hand to fix the nasty locations we are forced > to subclass parse_features anyway. My impression so far is that we can trust their feature indentations. At least all their FH lines are one of two indentations, and we can measure the indentation on all the FT lines. Once we get a parser that generally works, I'm going to make a list of all the accessions that have actual errors and submit to IMGT. In the meanwhile, I'll personally settle for catching those exceptions and skipping those records. > Remind me to add your name as a contributor once this gets merged to the trunk. That's very kind. I'm glad I can contribute to a worthy and incredibly useful project. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 21:53:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 17:53:35 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005182153.o4ILrZKm018617@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #21 from laserson at mit.edu 2010-05-18 17:53 EST ------- (In reply to comment #19) > Sorry - I pasted the wrong URL, I mean here: > http://github.com/peterjc/biopython/tree/seqio-imgt I'm still not sure where you integrated the regular expression. > I've found an even worse example of partial location example from > http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z > > ID M97158 IMGT/LIGM annotation : keyword level; RNA; ROD; 1093 BP. > XX > > ... > > You can see the original at EMBL, > http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=M97158&Submit=Go > > Or in GenBank format at the NCBI, > http://www.ncbi.nlm.nih.gov/nuccore/200865 That is awful! And there is no excuse for it either, as they should've just taken the coords from EMBL. I feel as though we should leave these problems as errors, and have IMGT fix them. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 22:26:25 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 18:26:25 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005182226.o4IMQPtB019429@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #22 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-18 18:26 EST ------- (In reply to comment #21) > (In reply to comment #19) > > > Sorry - I pasted the wrong URL, I mean here: > > http://github.com/peterjc/biopython/tree/seqio-imgt > > I'm still not sure where you integrated the regular expression. File Bio/GenBank/Scanner, this commit: http://github.com/peterjc/biopython/commit/a41db092a40542944158278f2cc26517cd464b60 > > I've found an even worse example of partial location example from > > http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z > > > > ID M97158 IMGT/LIGM annotation : keyword level; RNA; ROD; 1093 BP. > > XX > > > > ... > > > > You can see the original at EMBL, > > http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=M97158&Submit=Go > > > > Or in GenBank format at the NCBI, > > http://www.ncbi.nlm.nih.gov/nuccore/200865 > > That is awful! And there is no excuse for it either, as they should've just > taken the coords from EMBL. I feel as though we should leave these problems > as errors, and have IMGT fix them. In this case (and the other locations with missing text) there is no good work around so I would agree - get the IMGT to fix them. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 18 23:06:21 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 19:06:21 -0400 Subject: [Biopython-dev] [Bug 3016] Change WriterTests in test_PhyloXML.py to use StringIO or temp files In-Reply-To: Message-ID: <201005182306.o4IN6LCj020428@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3016 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from eric.talevich at gmail.com 2010-05-18 19:06 EST ------- Fixed in GitHub: http://github.com/biopython/biopython/commit/ad1a618def838d98432e9623367cffb595eadecd The patch uses one named temp file for everything, closes file handles diligently, and deletes the temp file at the end of the script. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Tue May 18 23:26:33 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 18 May 2010 16:26:33 -0700 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Mon, May 17, 2010 at 12:37 AM, Peter wrote: > Hi all (especially Eric), > > I've just run the test suite with Jython 2.5.1 and this found some new > problems. Most of these are XML related from Bio.Phylo > > ERROR: Round-trip parsing and serialization of apaf.xml. > ExpatError: The element type "phy:clade" must be terminated by the > matching end-tag "". > > ... > Fixed in GitHub: http://github.com/biopython/biopython/commit/ad1a618def838d98432e9623367cffb595eadecd I couldn't replicate this crash doing anything remotely normal in the Jython interpreter, but the rewriting scheme in the PhyloXML unit test suite crashed with various confusing tracebacks. The rewritten files were valid, though, and could be read by Jython outside the test suite. I know Jython doesn't clean up file handles as diligently as CPython does, so my best guess is that some file handles remained open or were reused/resurrected while parsing the rewritten files -- i.e. during the second parse, Jython's XML parser either started somplace other than the start of the file, or terminated early/late, expecting the rewritten file to have the same size as the original (which it doesn't because of collapsed whitespace). My patch reworks the file rewriting scheme and manages file handles obsessively; the PhyloXML parser itself stays the same. -Eric From bugzilla-daemon at portal.open-bio.org Wed May 19 00:30:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 May 2010 20:30:53 -0400 Subject: [Biopython-dev] [Bug 3069] Support for EMBL-like files from IMGT In-Reply-To: Message-ID: <201005190030.o4J0UrSW022165@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3069 ------- Comment #23 from laserson at mit.edu 2010-05-18 20:30 EST ------- (In reply to comment #22) So I tried parsing the whole imgt.dat file, and we do pretty well. The only two problems I see are the broken location qualifiers, and a few records where the lengths annotated in their ID strings don't match the actual lengths of the sequences. > In this case (and the other locations with missing text) there is no good work > around so I would agree - get the IMGT to fix them. So let's go ahead and change the warnings back to errors. In the meanwhile, we can parse properly using the SeqIO.index function and just catch and ignore all the bad records. And I will compile a list of bad records and give them to the curators at IMGT. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Wed May 19 06:13:07 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 19 May 2010 02:13:07 -0400 Subject: [Biopython-dev] 5/19 Stack Overflow - Biopython questions Message-ID: <2d537c12e961d821ff59bb55f3a10502@74.63.51.88> ================================================== 1. Can anyone tell me why these lines are not working? ================================================== May 18, 2010 at 7:21 AM I am trying to generate tree with fasta file input and Alignment with MuscleCommandline import sys,os, subprocess from Bio import AlignIO from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="c:\Python26\opuntia.fasta") child= subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) align=AlignIO.read(child.stdout,"fasta") outfile=open('c:\Python26\opuntia.phy','w') AlignIO.write([align],outfile,'phylip') outfile.close() I always encounter with these problems Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python26\muscleIO.py", line 11, in align=AlignIO.read(child.stdout,"fasta") File "C:\Python26\Lib\site-packages\Bio\AlignIO\__init__.py", line 423, in read raise ValueError("No records found in handle") ValueError: No records found in handle http://stackoverflow.com/questions/2856697/can-anyone-tell-me-why-these-lines-are-not-working -------------------------------------------------- ================================================== 2. Subprocess fails to catch the standard output ================================================== May 18, 2010 at 7:21 AM I am trying to generate tree with fasta file input and Alignment with MuscleCommandline import sys,os, subprocess from Bio import AlignIO from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="c:\Python26\opuntia.fasta") child= subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) align=AlignIO.read(child.stdout,"fasta") outfile=open('c:\Python26\opuntia.phy','w') AlignIO.write([align],outfile,'phylip') outfile.close() I always encounter with these problems Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python26\muscleIO.py", line 11, in align=AlignIO.read(child.stdout,"fasta") File "C:\Python26\Lib\site-packages\Bio\AlignIO\__init__.py", line 423, in read raise ValueError("No records found in handle") ValueError: No records found in handle http://stackoverflow.com/questions/2856697/subprocess-fails-to-catch-the-standard-output -------------------------------------------------- =========================================================== Source: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311789/3e0b2a02a42e76a71e4f14abbbfad2f294f545ce/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From biopython at maubp.freeserve.co.uk Wed May 19 07:52:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 May 2010 08:52:30 +0100 Subject: [Biopython-dev] Test failures on Jython 2.5.1 In-Reply-To: References: Message-ID: On Wed, May 19, 2010 at 12:26 AM, Eric Talevich wrote: > On Mon, May 17, 2010 at 12:37 AM, Peter wrote: > >> Hi all (especially Eric), >> >> I've just run the test suite with Jython 2.5.1 and this found some new >> problems. Most of these are XML related from Bio.Phylo >> >> ERROR: Round-trip parsing and serialization of apaf.xml. >> ExpatError: The element type "phy:clade" must be terminated by the >> matching end-tag "". >> >> ... >> > > Fixed in GitHub: > http://github.com/biopython/biopython/commit/ad1a618def838d98432e9623367cffb595eadecd > > I couldn't replicate this crash doing anything remotely normal in the Jython > interpreter, but the rewriting scheme in the PhyloXML unit test suite > crashed with various confusing tracebacks. The rewritten files were valid, > though, and could be read by Jython outside the test suite. > > I know Jython doesn't clean up file handles as diligently as CPython does, > so my best guess is that some file handles remained open or were > reused/resurrected while parsing the rewritten files -- i.e. during the > second parse, Jython's XML parser either started somplace other than the > start of the file, or terminated early/late, expecting the rewritten file to > have the same size as the original (which it doesn't because of collapsed > whitespace). My patch reworks the file rewriting scheme and manages file > handles obsessively; the PhyloXML parser itself stays the same. > > -Eric Good work :) Peter From bugzilla-daemon at portal.open-bio.org Wed May 19 11:35:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 07:35:44 -0400 Subject: [Biopython-dev] [Bug 3016] Change WriterTests in test_PhyloXML.py to use StringIO or temp files In-Reply-To: Message-ID: <201005191135.o4JBZilB013157@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3016 ------- Comment #2 from chapmanb at 50mail.com 2010-05-19 07:35 EST ------- Eric; Just a quick tip on mkstemp. When you do: DUMMY = tempfile.mkstemp()[1] you leave an open handle as the first argument of this tuple. It won't cause you any issues here, but is a problem if you have a long running server process. You will leak open file handles and eventually get an error about too many open files. See: http://www.logilab.org/blogentry/17873 http://vocamus.net/dave/?p=997 No problems here, but rather a heads up on a tricky bit of python I've run into too many times to count, Brad -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 19 15:11:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 11:11:31 -0400 Subject: [Biopython-dev] [Bug 2964] placing x-axis of graph track at the bottom or top of the track in GenomeDiagram In-Reply-To: Message-ID: <201005191511.o4JFBVG2019800@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2964 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-19 11:11 EST ------- I've been looking at this myself recently (I'm drawing transcriptome read coverage data which means I have no negative values), and tried a few things. (In reply to comment #7 and #8) > > By allowing the position of the axis to take any value within the data range, > this still allows 'top', 'middle' and 'bottom' to be defined as functions of > the data with, e.g. > > x_axis_pos = min(data) # bottom > x_axis_pos = max(data) # top > x_axis_pos = median(data) # middle > > and also allows for explicit placing of the axis at specified points on the > y-axis, or as other points that depend on the data (e.g. mean, quartiles, > etc.) There is a problem with this - the x-axis and other bits like the scale are drawn by the Track object. This can contain multiple datasets, which can all be using their own coordinate systems. In specifying the x-axis position we can't therefore talk about max(data), min(data) or median(data) for the track as a whole. What we can do is talk about bottom/middle/top (or even a float between 0 and 1 to be more precise). This is quite easy but doesn't address the plotting side of things... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 19 15:21:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 11:21:48 -0400 Subject: [Biopython-dev] [Bug 2964] placing x-axis of graph track at the bottom or top of the track in GenomeDiagram In-Reply-To: Message-ID: <201005191521.o4JFLmZQ020098@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2964 ------- Comment #14 from lpritc at scri.sari.ac.uk 2010-05-19 11:21 EST ------- (In reply to comment #13) > [...] the x-axis and other bits like the scale are > drawn by the Track object. This can contain multiple datasets, which can all > be using their own coordinate systems. In specifying the x-axis position we > can't therefore talk about max(data), min(data) or median(data) for the track > as a whole. What we can do is talk about bottom/middle/top (or even a float > between 0 and 1 to be more precise). This is quite easy but doesn't address > the plotting side of things... Fair point - something to look at in GD2. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 20 02:21:06 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 May 2010 22:21:06 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <201005200221.o4K2L6gq006217@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2010-05-19 22:21 EST ------- I have trouble understanding the submitted code. Could you provide a patch instead of an updated complete file? Also, don't combine multiple issues in a patch. Your patch should only be to do KDTree NN searches without specifying radius. As far as I can tell from the description, this is not a major change to the code, so if you provide a patch to the C++ version we make the corresponding changes to the current C code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu May 20 07:54:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 08:54:38 +0100 Subject: [Biopython-dev] Ready for Biopython 1.54 release? In-Reply-To: References: Message-ID: Hi all, I've got an urgent bit of work to finish this week (a poster using GenomeDiagram - hence the minor bug fix recently committed), but hope to be able to do the release tomorrow after it has been printed. If I run out of time, I'm away all next week at a conference. I don't mind doing the release when I get back, but if anyone else wanted to volunteer that would be great. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu May 20 16:28:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 17:28:25 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release Message-ID: Hi all, I'm going to start doing the Biopython 1.54 release now, so please don't check anything onto the trunk until further notice. [Working on other branches should be fine of course ;)] Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu May 20 16:39:23 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 20 May 2010 12:39:23 -0400 Subject: [Biopython-dev] [Bug 3016] Change WriterTests in test_PhyloXML.py to use StringIO or temp files In-Reply-To: Message-ID: <201005201639.o4KGdNeN002328@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3016 ------- Comment #3 from eric.talevich at gmail.com 2010-05-20 12:39 EST ------- (In reply to comment #2) > Eric; > Just a quick tip on mkstemp. When you do: > > DUMMY = tempfile.mkstemp()[1] > > you leave an open handle as the first argument of this tuple. It won't cause > you any issues here, but is a problem if you have a long running server > process. You will leak open file handles and eventually get an error about too > many open files. See: > > http://www.logilab.org/blogentry/17873 > http://vocamus.net/dave/?p=997 > > No problems here, but rather a heads up on a tricky bit of python I've run into > too many times to count, > Brad > Thanks! Instead of closing the stray file handle mkstemp generates, I used mktemp. As I understand it, the security issue mentioned in mktemp's docstring is if an attacker creates a symlink to an important, protected file using the same name mktemp chose for this test script. Then if this script is run as root, it would clobber that file even if the attacker didn't originally have permissions to modify that file. http://mail.python.org/pipermail/python-dev/2001-March/013507.html But the Biopython test suite isn't normally run as root, and in any case all of the test scripts reuse file names that aren't protected, which means everything has the same vulnerability. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu May 20 17:05:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 18:05:53 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: References: Message-ID: On Thu, May 20, 2010 at 5:28 PM, Peter wrote: > Hi all, > > I'm going to start doing the Biopython 1.54 release now, so please > don't check anything onto the trunk until further notice. > > [Working on other branches should be fine of course ;)] The archive and windows installers are done and are online. Please feel free to have a quick sanity test now - I'll be back in an hour or so to update the downloads page, send the release announcement etc. [Please consider the trunk still frozen for now] Thanks, Peter From biopython at maubp.freeserve.co.uk Thu May 20 19:07:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 20:07:17 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: References: Message-ID: On Thu, May 20, 2010 at 6:05 PM, Peter wrote: > On Thu, May 20, 2010 at 5:28 PM, Peter wrote: >> Hi all, >> >> I'm going to start doing the Biopython 1.54 release now, so please >> don't check anything onto the trunk until further notice. >> >> [Working on other branches should be fine of course ;)] > > The archive and windows installers are done and are online. > Please feel free to have a quick sanity test now - I'll be back > in an hour or so to update the downloads page, send the > release announcement etc. > > [Please consider the trunk still frozen for now] OK, news article from David updated and posted, downloads page updated. The email and API docs update are still to be done, and Brad - could you do the python package index update please? Ta, Peter From chapmanb at 50mail.com Thu May 20 19:33:13 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 20 May 2010 15:33:13 -0400 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: References: Message-ID: <20100520193313.GF1054@sobchak.mgh.harvard.edu> Peter; > OK, news article from David updated and posted, downloads page updated. > The email and API docs update are still to be done, and Brad - could you do > the python package index update please? Done. Congrats and thanks for getting this done so quickly. Have fun at your meeting next week, Brad From biopython at maubp.freeserve.co.uk Thu May 20 21:59:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 22:59:43 +0100 Subject: [Biopython-dev] Biopython 1.54 Message-ID: Dear Biopythoneers, Earlier today we released Biopython 1.54 (a little later than originally planned) which addresses a few bugs found in the beta release, has some changes to the new Bio.Phylo module, adds a whole chapter to the tutorial. Thank you to everyone who contributed code, reported bugs, etc. For more details please see this announcement (kindly drafted by David Winter): http://news.open-bio.org/news/2010/05/biopython-release-154/ Regards, Peter From updates at feedmyinbox.com Fri May 21 06:13:22 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 21 May 2010 02:13:22 -0400 Subject: [Biopython-dev] 5/21 BioStar - Biopython Questions Message-ID: ================================================== 1. PHYLIP (->prodist) Command line wrapper in biopython ================================================== May 20, 2010 at 5:57 AM Phylip has different applications for different phylogency purposes. Can anyone suggest me how to operate PHYLIP suppose(consense, dnaml,protdist) through commandline in biopython. Each applications has got its own different parameters, how can i handle them? http://biostar.stackexchange.com/questions/1123/phylip-prodist-command-line-wrapper-in-biopython -------------------------------------------------- =========================================================== Source: http://biostar.stackexchange.com/questions/tagged/biopython This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From biopython at maubp.freeserve.co.uk Fri May 21 11:44:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 May 2010 12:44:58 +0100 Subject: [Biopython-dev] Git "freeze" on master branch during 1.54 release In-Reply-To: <20100520193313.GF1054@sobchak.mgh.harvard.edu> References: <20100520193313.GF1054@sobchak.mgh.harvard.edu> Message-ID: On Thu, May 20, 2010 at 8:33 PM, Brad Chapman wrote: > > Peter; > >> OK, news article from David updated and posted, downloads page updated. >> The email and API docs update are still to be done, and Brad - could you do >> the python package index update please? > > Done. Congrats and thanks for getting this done so quickly. Have fun > at your meeting next week, > > Brad Cheers Brad - we've been running the tests and just making small tweaks recently, so the release went very smoothly (the beta helped in that sense). Having the release notice already done by David was also a big help, so thanks David :) I sent the email announcement out last night, and I've just now updated the API docs with epydoc. And I bumped the version number to a plus on the trunk. I think that's it now - all done :) Everyone with commit rights - consider the trunk re-open for small changes. Ideally new features should be implemented on a branch before merging. Regards, Peter From updates at feedmyinbox.com Sun May 23 06:10:29 2010 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sun, 23 May 2010 02:10:29 -0400 Subject: [Biopython-dev] 5/23 Stack Overflow - Biopython questions Message-ID: <802d82a45525bfbe5a8834d227bd623c@74.63.51.88> ================================================== 1. 50 sequences in one line ================================================== May 22, 2010 at 9:27 AM I have Multiple sequence alignment (clustal) file and I want to read this file and arrange sequences in such a way that it looks more clear and precise in order. I am doing this from biopython using AlignIO object. My codes goes like this: alignment = AlignIO.read("opuntia.aln", "clustal") print "Number of rows: %i" % len(align) for record in alignment: print "%s - %s" % (record.id, record.seq) My Output -- http://i48.tinypic.com/ae48ew.jpg , it looks messy and long scrolling. What i want to do is print only 50 sequences in each line and continue till the end of alignment file. I wish to have output like this http://i45.tinypic.com/4vh5rc.jpg from http://www.ebi.ac.uk/Tools/clustalw2/. Any suggestions, algorithm and sample code is appreciated Thanks in advance Br, http://stackoverflow.com/questions/2888257/50-sequences-in-one-line -------------------------------------------------- =========================================================== Source: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active This email was sent to biopython-dev at lists.open-bio.org. Account Login: https://www.feedmyinbox.com/members/login/ Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311789/3e0b2a02a42e76a71e4f14abbbfad2f294f545ce/ ----------------------------------------------------------- This email was carefully delivered by FeedMyInbox.com. 230 Franklin Road Suite 814 Franklin, TN 37064 From jblanca at btc.upv.es Tue May 25 05:53:14 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 25 May 2010 07:53:14 +0200 Subject: [Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: References: <809294.48600.qm@web62407.mail.re1.yahoo.com> Message-ID: <201005250753.14186.jblanca@btc.upv.es> Hi: On Tuesday 25 May 2010 07:03:19 Vincent Davis wrote: > Discussions on the pystatsmodels mailing list are I am sure relevant but it > might be more beneficial to discuss first on the biopython list as > sometimes to discussions get long and tend to be about economic type data. > The google group/mailing list is > http://groups.google.ca/group/pystatsmodels > I think a few good examples of a "typical" biopy data set and or some of > the typical difficulties would be good to have on the wiki. This might help > start collaboration between statsmodels and biopython on this subject. I > think there are few people that cross over between economics and > bioinformatics. My main concern with the current tools is the memory issue. For instance when I try to create a distribution of sequence lengths or qualities using NGS data I end up with millions of numbers. That is too much for any reasonable computer. I've solved the problem by using disk caches that work as iterators. I'm sure that this is not the most performant solucion. It's just a hack and I would like to use better tools for sure. If you want to take a look at my current solution go to: http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/itertools_.py http://github.com/JoseBlanca/franklin/blob/master/franklin/statistics.py Best regards, Jose Blanca > Also If you know of other groups that would be interested please share this > link/information. > > > Thanks, > > --Michiel. > > > > --- On Mon, 5/24/10, Vincent Davis wrote: > > > From: Vincent Davis > > > Subject: [Biopython] SciPy paper: documenting statistical data > > > structure > > > > design issues > > > > > To: "biopython" > > > Date: Monday, May 24, 2010, 3:45 PM > > > "see the message below, cross posted > > > from pystatsmodels" > > > > > > We have ben having some discussion on the pystatsmodels > > > maling list about > > > data objects, numpy arrays... I think it would be valuable > > > for some > > > biopython users to contribute some comments, examples or > > > ideas to the scipy > > > wiki that has been setup for this. I think at the heart of > > > this is that > > > although almost anything can be done with a numpy array we > > > run into many > > > problems that are difficult to solve with the current tools > > > for numpy > > > arrays. Because of this I think some nice examples of the > > > data design > > > problems that you have faced in the biopython and how they > > > have been solved > > > would be valuable. > > > > > > Thanks > > > Vincent > > > > > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney > > > > > > wrote: > > > > For my SciPy talk and paper in a little over a month, > > > > > > I was hoping to > > > > > > > render a somewhat coherent discussion of the design > > > > > > needs of > > > > > > > statistical data structures, based on my experience > > > > > > developing pandas > > > > > > > for quant finance research. I think these broadly fall > > > > > > into a few > > > > > > > categories: implementation ease, usability (for the > > > > > > non-developer > > > > > > > IPython-based console user), performance, and > > > > > > flexibility. Hopefully > > > > > > > this will be useful information that will help guide > > > > > > future > > > > > > > development efforts. What do you folks think? > > > > > > > > As part of this, I was thinking maybe we should start > > > > > > a wiki page (or > > > > > > > pages) somewhere to start listing out the various > > > > > > design issues (big > > > > > > > and small) where people can write their opinions and > > > > > > we can have a > > > > > > > structured discussion (e-mail is a bit hard for this > > > > > > sort of thing). > > > > > > > I'd also like to spend some time reading through other > > > > > > people's code > > > > > > > (e.g. all of the larry code) and writing down what I > > > > > > think about their > > > > > > > design choices in a constructive way. > > > > > > > > Part of what prompted my idea for a wiki was reading > > > > > > some of the larry > > > > > > > code and wanting to share my thoughts on various parts > > > > > > of it. Of > > > > > > > course I'm also prepared for other people to attack > > > > > > (and for me to > > > > > > > have to defend) my own code. For most of these things > > > > > > there isn't a > > > > > > > "right" and "wrong" and I am only interested in having > > > > > > constructive > > > > > > > discussions and hearing people's perspectives. Here's > > > > > > an example: in > > > > > > > pandas when adding two different-labeled 2d arrays, > > > > > > the result has the > > > > > > > *union* of all the labels. In la you get the > > > > > > intersection. Certainly > > > > > > > are pros and cons for either approach (in my case I > > > > > > don't want to lose > > > > > > > information, even if it's nulled out). > > > > > > > > We should also have a place where we document > > > > > > differences in > > > > > > > performance for various operations. I spent a lot of > > > > > > time even before > > > > > > > pandas was open-source obsessing over speed-- I'd like > > > > > > to think I > > > > > > > learned a few things but I was operating in a bubble > > > > > > so I might have > > > > > > > missed really obvious speedups. I also learned lots of > > > > > > odd things > > > > > > > about NumPy (did you know fancy indexing is a LOT > > > > > > slower than > > > > > > > ndarray.take?). We should probably establish some > > > > > > apples-to-apples > > > > > > > performance benchmarks to help people decide what to > > > > > > use for their > > > > > > > applications if speed matters. > > > > > > > > Best, > > > > Wes > > > > > > *Vincent Davis > > > 720-301-3003 * > > > vincent at vincentdavis.net > > > my blog | > > > LinkedIn > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From mailinglist.honeypot at gmail.com Tue May 25 13:52:26 2010 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 25 May 2010 09:52:26 -0400 Subject: [Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: <201005250753.14186.jblanca@btc.upv.es> References: <809294.48600.qm@web62407.mail.re1.yahoo.com> <201005250753.14186.jblanca@btc.upv.es> Message-ID: Hi, > My main concern with the current tools is the memory issue. For instance when > I try to create a distribution of sequence lengths or qualities using NGS > data I end up with millions of numbers. That is too much for any reasonable > computer. Several million numbers aren't all that much, though, right? To simulate your example, I created a 100,000,000 long vector (which, depending on what type of NGS data you have, should be considered a large number of reads) representing faux read-lengths, and it's only taking up ~ 382 MB's[1] and gathering basic statistics on it (variance, mean, histograms, etc.) isn't painful at all. Once you start adding more metadata to the 100,000,000 elements, I can see where you start running into problems, though. > I've solved the problem by using disk caches that work as > iterators. I'm sure that this is not the most performant solucion. It's just > a hack and I would like to use better tools for sure. Have you tried looking at something like PyTables? Might be something to consider ... Just a thought, -steve [1] I'm using R, which only used 32bit integers, but the language itself isn't really the the point since we're all going to be running into a wall with respect to NGS-sized datasets. -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From vincent at vincentdavis.net Tue May 25 19:19:35 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 25 May 2010 13:19:35 -0600 Subject: [Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: <201005250753.14186.jblanca@btc.upv.es> References: <809294.48600.qm@web62407.mail.re1.yahoo.com> <201005250753.14186.jblanca@btc.upv.es> Message-ID: On Mon, May 24, 2010 at 11:53 PM, Jose Blanca wrote: > Hi: > > My main concern with the current tools is the memory issue. For instance > when > I try to create a distribution of sequence lengths or qualities using NGS > data I end up with millions of numbers. That is too much for any reasonable > computer. I've solved the problem by using disk caches that work as > iterators. I'm sure that this is not the most performant solucion. It's > just > a hack and I would like to use better tools for sure. > If you want to take a look at my current solution go to: > > > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/itertools_.py > http://github.com/JoseBlanca/franklin/blob/master/franklin/statistics.py Please feel free to add some of the comments to the wiki. I also cross posted this to the StatsModels list as I thought it might be of interest to the list. Although I believe Steve Lianoglou comments are correct, data set size is a issue in bio and only getting bigger. > > Best regards, > > Jose Blanca > > > Also If you know of other groups that would be interested please share > this > > link/information. > > > > > Thanks, > > > --Michiel. > > > > > > --- On Mon, 5/24/10, Vincent Davis wrote: > > > > From: Vincent Davis > > > > Subject: [Biopython] SciPy paper: documenting statistical data > > > > structure > > > > > > design issues > > > > > > > To: "biopython" > > > > Date: Monday, May 24, 2010, 3:45 PM > > > > "see the message below, cross posted > > > > from pystatsmodels" > > > > > > > > We have ben having some discussion on the pystatsmodels > > > > maling list about > > > > data objects, numpy arrays... I think it would be valuable > > > > for some > > > > biopython users to contribute some comments, examples or > > > > ideas to the scipy > > > > wiki that has been setup for this. I think at the heart of > > > > this is that > > > > although almost anything can be done with a numpy array we > > > > run into many > > > > problems that are difficult to solve with the current tools > > > > for numpy > > > > arrays. Because of this I think some nice examples of the > > > > data design > > > > problems that you have faced in the biopython and how they > > > > have been solved > > > > would be valuable. > > > > > > > > Thanks > > > > Vincent > > > > > > > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney > > > > > > > > wrote: > > > > > For my SciPy talk and paper in a little over a month, > > > > > > > > I was hoping to > > > > > > > > > render a somewhat coherent discussion of the design > > > > > > > > needs of > > > > > > > > > statistical data structures, based on my experience > > > > > > > > developing pandas > > > > > > > > > for quant finance research. I think these broadly fall > > > > > > > > into a few > > > > > > > > > categories: implementation ease, usability (for the > > > > > > > > non-developer > > > > > > > > > IPython-based console user), performance, and > > > > > > > > flexibility. Hopefully > > > > > > > > > this will be useful information that will help guide > > > > > > > > future > > > > > > > > > development efforts. What do you folks think? > > > > > > > > > > As part of this, I was thinking maybe we should start > > > > > > > > a wiki page (or > > > > > > > > > pages) somewhere to start listing out the various > > > > > > > > design issues (big > > > > > > > > > and small) where people can write their opinions and > > > > > > > > we can have a > > > > > > > > > structured discussion (e-mail is a bit hard for this > > > > > > > > sort of thing). > > > > > > > > > I'd also like to spend some time reading through other > > > > > > > > people's code > > > > > > > > > (e.g. all of the larry code) and writing down what I > > > > > > > > think about their > > > > > > > > > design choices in a constructive way. > > > > > > > > > > Part of what prompted my idea for a wiki was reading > > > > > > > > some of the larry > > > > > > > > > code and wanting to share my thoughts on various parts > > > > > > > > of it. Of > > > > > > > > > course I'm also prepared for other people to attack > > > > > > > > (and for me to > > > > > > > > > have to defend) my own code. For most of these things > > > > > > > > there isn't a > > > > > > > > > "right" and "wrong" and I am only interested in having > > > > > > > > constructive > > > > > > > > > discussions and hearing people's perspectives. Here's > > > > > > > > an example: in > > > > > > > > > pandas when adding two different-labeled 2d arrays, > > > > > > > > the result has the > > > > > > > > > *union* of all the labels. In la you get the > > > > > > > > intersection. Certainly > > > > > > > > > are pros and cons for either approach (in my case I > > > > > > > > don't want to lose > > > > > > > > > information, even if it's nulled out). > > > > > > > > > > We should also have a place where we document > > > > > > > > differences in > > > > > > > > > performance for various operations. I spent a lot of > > > > > > > > time even before > > > > > > > > > pandas was open-source obsessing over speed-- I'd like > > > > > > > > to think I > > > > > > > > > learned a few things but I was operating in a bubble > > > > > > > > so I might have > > > > > > > > > missed really obvious speedups. I also learned lots of > > > > > > > > odd things > > > > > > > > > about NumPy (did you know fancy indexing is a LOT > > > > > > > > slower than > > > > > > > > > ndarray.take?). We should probably establish some > > > > > > > > apples-to-apples > > > > > > > > > performance benchmarks to help people decide what to > > > > > > > > use for their > > > > > > > > > applications if speed matters. > > > > > > > > > > Best, > > > > > Wes > > > > > > > > *Vincent Davis > > > > 720-301-3003 * > > > > vincent at vincentdavis.net > > > > my blog | > > > > LinkedIn > > > > _______________________________________________ > > > > Biopython mailing list - Biopython at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > *Vincent Davis > > 720-301-3003 * > > vincent at vincentdavis.net > > my blog | > > LinkedIn > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > -- > Jose M. Blanca Postigo > Instituto Universitario de Conservacion y > Mejora de la Agrodiversidad Valenciana (COMAV) > Universidad Politecnica de Valencia (UPV) > Edificio CPI (Ciudad Politecnica de la Innovacion), 8E > 46022 Valencia (SPAIN) > Tlf.:+34-96-3877000 (ext 88473) > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From mjldehoon at yahoo.com Sat May 29 03:23:21 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 28 May 2010 20:23:21 -0700 (PDT) Subject: [Biopython-dev] Blast parsers and records Message-ID: <901919.44402.qm@web62402.mail.re1.yahoo.com> Hi everybody, With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use its new Blast+ suite of Blast programs, maybe this is a good time to tackle some older bugs related to Blast output parsing in Biopython: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 (inconsistencies in the output of different Blast parsers) http://bugzilla.open-bio.org/show_bug.cgi?id=2929 (inconsistencies between Psi-blast parsers) http://bugzilla.open-bio.org/show_bug.cgi?id=2319 (parsing Blast table output) and more generally think about the design of the Blast record class and Blast parsing. In my opinion, these are the major issues: 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should have one read() function and one parse() function under Bio.Blast, with arguments specifying which format the Blast output is in. 2) Blast records produced by any of the parsers should be consistent with each other. As XML output by blast and psi-blast follow the same DTD, we should be able to represent both by a single Record class. 3) Different parsers should store information in this Record class in the same way. 4) The current Blast record stores its information in attributes. If you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the necessary DTDs to do so), the information is stored in dictionaries. This has some advantages. For example, it allows you to use record.keys() to find out what the record contains. Ideally, I think that a Blast Record class should inherit from a dictionary. 5) We should be able to print a Blast record object to generate output that is close to the plain-text output generated by blast. This would allow us to generate and store Blast output as XML, and to convert the output to plain-text to make it more human-readable. 6) The current Blast record inherits from Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. I don't see the rationale for this inheritance, and I think we should remove it. Any comments, suggestions (in particular about by proposal to have a Blast Record class that inherits from a dictionary? Btw, to avoid breaking scripts, I propose that any changes to the Blast record and parser are implemented separately from the existing parsers and record, and to leave those untouched. --Michiel. From bugzilla-daemon at portal.open-bio.org Sat May 29 17:53:35 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 13:53:35 -0400 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <201005291753.o4THrY4v013325@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #10 from eric.talevich at gmail.com 2010-05-29 13:53 EST ------- I've applied Konstanin's patch to a branch on GitHub: http://github.com/etal/biopython/tree/pdbfixes I'm going to apply some more small patches for the various PDB bugs here, so testers are welcome to try out/monitor this branch. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From lgautier at gmail.com Sat May 29 18:29:00 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 29 May 2010 20:29:00 +0200 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: References: Message-ID: <4C015CEC.30908@gmail.com> Hi, Few thoughts below: On 5/29/10 6:00 PM, biopython-dev-request at lists.open-bio.org wrote: > Hi everybody, > > With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use > its new Blast+ suite of Blast programs, maybe this is a good time to > tackle some older bugs related to Blast output parsing in Biopython: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 (inconsistencies in > the output of different Blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 (inconsistencies > between Psi-blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2319 (parsing Blast > table output) > > and more generally think about the design of the Blast record class > and Blast parsing. In my opinion, these are the major issues: > > 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, > Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we > should have one read() function and one parse() function under > Bio.Blast, with arguments specifying which format the Blast output is > in. Having a factory function would be handy, but since the file formats differ having different classes to model them can be nice. Modularity is good, and what is known as duck-typing makes it for an intuitive API. What would you think of a design such as: - module/package 'Blast' - an abstract class 'Output' is defined in that module/package. - classes '; each one of those classes defines a method 'read()' and 'parse()' (read() and parse() would formally be declared by an interface, and 'Output' require their implementation). > 2) Blast records produced by any of the parsers should be consistent > with each other. As XML output by blast and psi-blast follow the same > DTD, we should be able to represent both by a single Record class. Definitely the case for XML - blast/psi-blast... however, the various formats (XML, others) may contain different levels of details (I do not know for sure, just considering the possibility here). > 3) Different parsers should store information in this Record class in > the same way. I'd see two options : - either the same Record class is returned by all parsers or - a hierarchy of classes with common accessors and methods whenever possible (e.g., an abstract parent class (or interface) 'Blast.Record' with child classes 'Blast.XMLRecord', blahblahblah...) > 4) The current Blast record stores its information in attributes. If > you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains > the necessary DTDs to do so), the information is stored in > dictionaries. This has some advantages. For example, it allows you to > use record.keys() to find out what the record contains. Ideally, I > think that a Blast Record class should inherit from a dictionary. Indeed. Attributes also have constrains regarding valid names that dictionaries do not have. Still, there is no need to require a strict inheritance from Python's dict, and require the implementation of the interface (methods such as __getitem__(), __iter__(), iteritems(), keys(), etc...) might has well do it. I am thinking of the cost of conversion here: there might be time where the only purpose is to loop through record and only access limited information (and in that case a custom class performing a lazy access to information would be neat). Keeping it as an interface rather than expect a direct inheritance will give more freedom to implement it, while keeping compatibility with the rest of the code base. > 5) We should be able to print a Blast record object to generate > output that is close to the plain-text output generated by blast. > This would allow us to generate and store Blast output as XML, and to > convert the output to plain-text to make it more human-readable. > > 6) The current Blast record inherits from Bio.Blast.Record.Header, > Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. I > don't see the rationale for this inheritance, and I think we should > remove it. > > Any comments, suggestions (in particular about by proposal to have a > Blast Record class that inherits from a dictionary? Btw, to avoid > breaking scripts, I propose that any changes to the Blast record and > parser are implemented separately from the existing parsers and > record, and to leave those untouched. > > --Michiel. > > > > > > ------------------------------ > > _______________________________________________ Biopython-dev mailing > list Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 88, Issue 20 > ********************************************* From bugzilla-daemon at portal.open-bio.org Sat May 29 19:31:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 15:31:56 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <201005291931.o4TJVuIi015483@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #4 from eric.talevich at gmail.com 2010-05-29 15:31 EST ------- grep tells me that the only place "__delitem__" appears in Bio/PDB/ is in Chain.py: the definition of Chain.__delitem__, and the call to the nonexistent Entity.__delitem__. The method would be required for this do work: struct = PDBParser().get_structure('asdf', 'ASDF.pdb') del struct[0] This seems useful, since __getitem__ is already implemented, but I can't imagine it functioning any differently than detach_child. Solution 1: Comment out Chain.__delitem__, since they're no way this ever worked for anybody. ( http://github.com/etal/biopython/commit/835b444df6b7f2c63b427535bc1c796c26ccce60 ) Solution 2: Implement Entity.__delitem__, essentially identical to Entity.detach_child. Look at the implementations of __getitem__ in the other subclasses of Entity to see if anything fancy needs to be done to support __delitem__ in each of them. (to do) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 29 19:52:13 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 15:52:13 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <201005291952.o4TJqDHP015930@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #5 from eric.talevich at gmail.com 2010-05-29 15:52 EST ------- (In reply to comment #4) > Solution 2: Implement Entity.__delitem__, essentially identical to > Entity.detach_child. Look at the implementations of __getitem__ in the other > subclasses of Entity to see if anything fancy needs to be done to support > __delitem__ in each of them. Done on the same branch. Nothing fancy was needed in the other Entity subclasses. NB: It looks like Entity supports some methods that are handled just as well by the appropriate magic methods and Python syntax: get_list => __iter__, detach_child => __delitem__, has_id => __contains__. We should eventually deprecate those methods, make some others non-public, and promote the use of properties and magic syntax instead, I think. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 29 21:03:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 29 May 2010 17:03:31 -0400 Subject: [Biopython-dev] [Bug 2948] _parse_pdb_header_list: bug in TITLE handling In-Reply-To: Message-ID: <201005292103.o4TL3Vij017627@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2948 ------- Comment #7 from eric.talevich at gmail.com 2010-05-29 17:03 EST ------- I've applied the patch to my pdbfixes branch on GitHub: http://github.com/etal/biopython/commit/cc9da03002ae90a3b8eedae69a8adae7216506b8 And added a couple unit tests so we know when further modifications change the way header info is parsed (which will be desirable at some point). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sandford at ufl.edu Sun May 30 02:35:18 2010 From: sandford at ufl.edu (Michael Sandford) Date: Sat, 29 May 2010 22:35:18 -0400 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: <4C01CEE6.8030603@ufl.edu> I've got a few comments as well: > 4) The current Blast record stores its information in attributes. If you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the necessary DTDs to do so), the information is stored in dictionaries. This has some advantages. For example, it allows you to use record.keys() to find out what the record contains. Ideally, I think that a Blast Record class should inherit from a dictionary. > The disadvantage that I can immediately think of using this methodology is that you lose the ability to have a heavyweight IDE give you intellisense on what fields are available. Many may say that intellisense is evil and/or a crutch and I won't really argue that. But Eclipse is pretty good at giving you options if you type in "variablename." and then it'll bring up a whole list of attributes and functions, and I find that handy. Moving to a dictionary based approach will stop that. Calling dir(variablename) will enable you to see not only the attributes available, but the functions as well. That may not be as elegant as iterating over keys in a dictionary but it is some measure of an alternative. It seems to me that there is a fair amount of xml parsing that gets done in bioinformatics these days. I know that one of the goals of the project is minimal dependence on external libraries, however, I think that lxml ( http://codespeak.net/lxml/) might provide some rather substantial gains in terms of parsing code complexity reduction. I also think that the lxml/etree representation of parsed data is fairly reasonable. Mike > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From biopython at maubp.freeserve.co.uk Mon May 31 09:10:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 10:10:43 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: On Sat, May 29, 2010 at 4:23 AM, Michiel de Hoon wrote: > Hi everybody, > > With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use its new Blast+ > suite of Blast programs, maybe this is a good time to tackle some older bugs related > to Blast output parsing in Biopython: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > (inconsistencies in the output of different Blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 > (inconsistencies between Psi-blast parsers) > > http://bugzilla.open-bio.org/show_bug.cgi?id=2319 > (parsing Blast table output) > > and more generally think about the design of the Blast record class and Blast > parsing. In my opinion, these are the major issues: > > 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, > Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should > have one read() function and one parse() function under Bio.Blast, with > arguments specifying which format the Blast output is in. I see the point, but some of these parsers give very different output (your points 2 and 3). > 2) Blast records produced by any of the parsers should be consistent > with each other. See also (3) below. > As XML output by blast and psi-blast follow the same > DTD, we should be able to represent both by a single Record class. I think this was a short term hack by the NCBI - and rules out having a single XML file hold multiple PSI queries and their iterations. > 3) Different parsers should store information in this Record class in > the same way. Where possible, yes, but different BLAST output formats contain different information - e.g. some contain the hit sequences while others do not. > 4) The current Blast record stores its information in attributes. If you > use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains > the necessary DTDs to do so), the information is stored in dictionaries. > This has some advantages. For example, it allows you to use > record.keys() to find out what the record contains. Ideally, I think > that a Blast Record class should inherit from a dictionary. As already pointed out, it has disadvantages too. With traditional attributes or properties you can use dir(record) and also setup docstrings for properties etc. I think they are clearer than dictionary keys. I would look at a base BLAST record (covering the core information found in all formats including tabular) with subclasses for the richer output formats (default plain text and XML). > 5) We should be able to print a Blast record object to generate > output that is close to the plain-text output generated by blast. > This would allow us to generate and store Blast output as XML, > and to convert the output to plain-text to make it more human- > readable. Nice - but that could make the str(record) output very long. > 6) The current Blast record inherits from Bio.Blast.Record.Header, > Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. > I don't see the rationale for this inheritance, and I think we should > remove it. I agree this is a rather odd design choice (even if the three sections did map onto three parts of the plain text output). We can probable do this without changing the exposed Blast record behaviour. > Any comments, suggestions (in particular about by proposal to > have a Blast Record class that inherits from a dictionary? Btw, to > avoid breaking scripts, I propose that any changes to the Blast > record and parser are implemented separately from the existing > parsers and record, and to leave those untouched. Some of these suggestions like (5) and (6) could be done to the existing BLAST parsers and objects, and would seem a good idea. Regarding the main proposal (1), I would be more interested in more ambitious proposal along the lines of BioPerl's SearchIO covering not just BLAST but also FASTA, BLAT, HMMER and any other "pairwise searches" (and potentially we could share code for this with AlignIO for pairwise alignment formats). This is more work of course, and could come later. http://www.bioperl.org/wiki/HOWTO:SearchIO Peter From sbassi at clubdelarazon.org Mon May 31 13:52:45 2010 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Mon, 31 May 2010 10:52:45 -0300 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com> References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: On Sat, May 29, 2010 at 12:23 AM, Michiel de Hoon wrote: > 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should have one read() function and one parse() function under Bio.Blast, with arguments specifying which format the Blast output is in. > .... I would add another issue (7). The interface to run the BLAST search is different. The "clasic" will execute the search while the blast+ one "just" generate the command line and it is up to the programmer to actually run it (and get the result back to the program). >From my POV, it is adding 3 lines in my code, but you didn't have to use subprocess before blast+, so I find this a little inconsistent. From biopython at maubp.freeserve.co.uk Mon May 31 14:08:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 15:08:01 +0100 Subject: [Biopython-dev] Blast parsers and records In-Reply-To: References: <901919.44402.qm@web62402.mail.re1.yahoo.com> Message-ID: On Mon, May 31, 2010 at 2:52 PM, Sebastian Bassi wrote: > > I would add another issue (7). The interface to run the BLAST search > is different. The "clasic" will execute the search while the blast+ > one "just" generate the command line and it is up to the programmer to > actually run it (and get the result back to the program). > From my POV, it is adding 3 lines in my code, but you didn't have to > use subprocess before blast+, so I find this a little inconsistent. This is a separate issue to BLAST parsing, but yes, I've been meaning to post a new thread regarding making this easier for the typical use cases. I have a cunning plan... Peter From biopython at maubp.freeserve.co.uk Mon May 31 14:53:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 15:53:45 +0100 Subject: [Biopython-dev] More SeqRecord methods Message-ID: Hi all, What do people think of adding upper and lower methods to the SeqRecord? http://bugzilla.open-bio.org/show_bug.cgi?id=3054 If that is well received, how about adding another Seq method to the SeqRecord, the newish ungap method? http://bugzilla.open-bio.org/show_bug.cgi?id=3060 Peter From biopython at maubp.freeserve.co.uk Mon May 31 14:50:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 15:50:36 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers Message-ID: Hi all, With the new command line wrappers and the tutorial pushing users towards using subprocess we've had more queries about how to use it. The subprocess module itself is rather scary I guess, and things could be made a lot easier. I think the most typical use cases are: (1) Run the command, return the error code (integer) (2) Run the command, return stdout, stderr and error code In theory the function subprocess.call() would take care of the first example, but there is a cross platform annoyance here with the shell parameter. Also, if you want the output too things get even more tricky. It hasn't helped that there are a few platform specific quirks/bugs in subprocess itself (the different behaviour of the shell option on Windows, bug http://bugs.python.org/issue1124861 in old Pythons, the risk of deadlocks with large output files, etc). A while ago while doing the Bio.Motif application wrappers Bartek suggested adding a run or execute method to the application wrapper. I wasn't so receptive at the time, but the utility of this has grown on me. However, adding methods could potentially clash with arbitrary parameter names. We could instead make the wrapper objects callable (define the magic method __call__) to offer this kind of functionality. This seems quite elegant to me. I've just posted a possible such implementation for comment on this new branch, http://github.com/peterjc/biopython/tree/app-exec Thus far there is just one commit, http://github.com/peterjc/biopython/commit/b53fb443e2153576509e159a1eac9da55124e41b Is this a nice approach? Assuming this is a well received idea, there are several details to discuss. First of all, if we are going to return the stdout and stderr would it be best as strings or wrapped as handles (using StringIO) which can be passed to a parser? Peter From eric.talevich at gmail.com Mon May 31 15:38:51 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 31 May 2010 11:38:51 -0400 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements Message-ID: Hi all, This summer our GSoC student Jo?o Rodrigues will be implementing a number of enhancements to Biopython's structural biology modules. Since Bio.PDB is one of the most widely used parts of Biopython, I'd like to find a way to let Jo?o add major new features without breaking existing code and documentation. There are a few issues I'd like to address: 1. The I/O conventions of parse/read/write/convert seem to work very well in SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports I/O in several formats, but the API is lower-level and isn't unified in the same way (yet). 2. PDB headers seem to have become better structured in recent years, in both the wwPDB spec and submitted files. But header info isn't well integrated with PDB Structure object, and parse_pdb_header needs some attention as well. 3. Kristian asked on this list awhile ago about the proper location for his new code that works with RNA structures. While RCSB's PDB contains some RNA structures, the RNA world doesn't revolve around it. Similarly, Jo?o needs a place to put code for structure prediction/validation servers, command-line wrappers, secondary structures, etc. I propose a new sub-package called Bio.Struct for these enhancements: from Bio import Struct mystruct = Struct.read("1MOT.pdb", "pdb") # Or, letting the format argument default to "pdb": mystruct = Struct.read("1MOT.pdb") # Eventually this will work too: Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml") from Bio.Struct.Applications import DSSP # Like the other command-line wrappers # (I'm curious about Peter's cunning new scheme...) from Bio.Struct import WHATIF, Jpred # Servers each get their own module from Bio.Struct import RNA # Would this work for you, Kristian? Alternatively, we could do all of this within the PDB module -- so picture the above examples with "PDB" in place of "Struct". This raises the chance of naming collisions, though, and doesn't solve issue #3 above. We'll leave the existing PDB module layout alone, in general. I think it will be necessary to add a few more attributes to the Bio.PDB.Structure.Structure class, but we can do this without breaking compatibility. Since fewer people depend on the exact formatting of the Structure.header data (we believe), it's safer to change this dictionary, moving the more essential entries to a separate attribute, or whatever seems reasonable when we dig into it. Comments? Thanks, Eric From biopython at maubp.freeserve.co.uk Mon May 31 15:53:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 16:53:31 +0100 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 4:38 PM, Eric Talevich wrote: > Hi all, > > This summer our GSoC student Jo?o Rodrigues will be implementing a number of > enhancements to Biopython's structural biology modules. Since Bio.PDB is one > of the most widely used parts of Biopython, I'd like to find a way to > let Jo?o add major new features without breaking existing code and > documentation. > > There are a few issues I'd like to address: > > 1. The I/O conventions of parse/read/write/convert seem to work very well in > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports > I/O in several formats, but the API is lower-level and isn't unified in the > same way (yet). Currently Bio.PDB supports the plain text PDB format, and has partial support for mmCIF. It lacks support for the XML PDB format, PDBML - Protein Data Bank Markup Language. Under this proposed scheme, what would you see as the basic record type (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and Bio.Phylo)? It would be nice to say a protein chain, but there is the issue of multiple models (e.g. from NMR). I presume you'd go with the model as the basic unit (where each model may contain multiple chains). > 2. PDB headers seem to have become better structured in recent years, in > both the wwPDB spec and submitted files. But header info isn't well > integrated with PDB Structure object, and parse_pdb_header needs some > attention as well. Agreed. > 3. Kristian asked on this list awhile ago about the proper location for his > new code that works with RNA structures. While RCSB's PDB contains some RNA > structures, the RNA world doesn't revolve around it. Similarly, Jo?o needs a > place to put code for structure prediction/validation servers, command-line > wrappers, secondary structures, etc. > > > I propose a new sub-package called Bio.Struct for these enhancements: > > from Bio import Struct > mystruct = Struct.read("1MOT.pdb", "pdb") > # Or, letting the format argument default to "pdb": > mystruct = Struct.read("1MOT.pdb") > # Eventually this will work too: > Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml") I'd probably go with "pdbml" rather than "pdbxml" since that seems to be what the PDB themselves call it: http://www.pdb.org/pdb/static.do?p=file_formats/index.jsp > from Bio.Struct.Applications import DSSP > # Like the other command-line wrappers > # (I'm curious about Peter's cunning new scheme...) See: http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007773.html > from Bio.Struct import WHATIF, Jpred > # Servers each get their own module Hmm - perhaps we may need have another level here, Bio.Struct.Servers or Bio.Struct.WWW or something. How many of these do you expect? > from Bio.Struct import RNA > # Would this work for you, Kristian? > > > Alternatively, we could do all of this within the PDB module -- so picture > the above examples with "PDB" in place of "Struct". This raises the chance > of naming collisions, though, and doesn't solve issue #3 above. > > > We'll leave the existing PDB module layout alone, in general. I think it > will be necessary to add a few more attributes to the > Bio.PDB.Structure.Structure class, but we can do this without breaking > compatibility. Since fewer people depend on the exact formatting of the > Structure.header data (we believe), it's safer to change this dictionary, > moving the more essential entries to a separate attribute, or whatever seems > reasonable when we dig into it. > > Comments? I don't want us to break backwards compatibility in Bio.PDB (given how widely used it seems to be based on citations at least), but would like us to continue making small fixes or enhancements to it. Therefore a new Bio.Struct module may be the safer option. Peter From rodrigo_faccioli at uol.com.br Mon May 31 17:51:10 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Mon, 31 May 2010 14:51:10 -0300 Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB enhancements In-Reply-To: References: Message-ID: Hi, I would like to comment some ideas: Firstly, I suggest to maintain the getStructure command. This command has the goal load whole structure (models, chains, ATOM, HETAM, etc). So, the getStructure command is executed: structure = getStructure(id) Afterwards, users can execute it as they need. Below I try to show some specific exemaples. In structure contains whole structure loaded including its errors. The command can be like: structure.get_StructureErrors().getStructureErrors() This command returns a dictionary containing all errors of the strcurure. For complete example is [1]. One idea: this dictionary is created by WHATIF module. Other example is about convert command. It may have more options such as model and chain. So, it can be called: convert(structure, SelectedModels, SelectedChains,"1MOT.xml", "pdbxml") When SelectedModels and SelectedChains options are None will be considered all values of, respectively, models and chains of protein. In this way we've developed a new Bio.PDB.Parser methodology. Please read loadStructureFromFile function in [2]. This new methodology is an alternative developed by my group research. With it we have worked with pdb file and our database applying one parser only. In that example is showing to work with pdb file only. I hope this mail may contribute with something. Sorry my English mistakes. [1] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/scripts/check_structure.py [2] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDBParser.py Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 2010/5/31 Peter > On Mon, May 31, 2010 at 4:38 PM, Eric Talevich > wrote: > > Hi all, > > > > This summer our GSoC student Jo?o Rodrigues will be implementing a number > of > > enhancements to Biopython's structural biology modules. Since Bio.PDB is > one > > of the most widely used parts of Biopython, I'd like to find a way to > > let Jo?o add major new features without breaking existing code and > > documentation. > > > > There are a few issues I'd like to address: > > > > 1. The I/O conventions of parse/read/write/convert seem to work very well > in > > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports > > I/O in several formats, but the API is lower-level and isn't unified in > the > > same way (yet). > > Currently Bio.PDB supports the plain text PDB format, and has partial > support for mmCIF. It lacks support for the XML PDB format, PDBML - > Protein Data Bank Markup Language. > > Under this proposed scheme, what would you see as the basic record type > (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and > Bio.Phylo)? It would be nice to say a protein chain, but there is the issue > of > multiple models (e.g. from NMR). I presume you'd go with the model as the > basic unit (where each model may contain multiple chains). > > > 2. PDB headers seem to have become better structured in recent years, in > > both the wwPDB spec and submitted files. But header info isn't well > > integrated with PDB Structure object, and parse_pdb_header needs some > > attention as well. > > Agreed. > > > 3. Kristian asked on this list awhile ago about the proper location for > his > > new code that works with RNA structures. While RCSB's PDB contains some > RNA > > structures, the RNA world doesn't revolve around it. Similarly, Jo?o > needs a > > place to put code for structure prediction/validation servers, > command-line > > wrappers, secondary structures, etc. > > > > > > I propose a new sub-package called Bio.Struct for these enhancements: > > > > from Bio import Struct > > mystruct = Struct.read("1MOT.pdb", "pdb") > > # Or, letting the format argument default to "pdb": > > mystruct = Struct.read("1MOT.pdb") > > # Eventually this will work too: > > Struct.convert("1MOT.pdb", "pdb", "1MOT.xml", "pdbxml") > > I'd probably go with "pdbml" rather than "pdbxml" since that seems to be > what the PDB themselves call it: > http://www.pdb.org/pdb/static.do?p=file_formats/index.jsp > > > from Bio.Struct.Applications import DSSP > > # Like the other command-line wrappers > > # (I'm curious about Peter's cunning new scheme...) > > See: > http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007773.html > > > from Bio.Struct import WHATIF, Jpred > > # Servers each get their own module > > Hmm - perhaps we may need have another level here, Bio.Struct.Servers > or Bio.Struct.WWW or something. How many of these do you expect? > > > from Bio.Struct import RNA > > # Would this work for you, Kristian? > > > > > > Alternatively, we could do all of this within the PDB module -- so > picture > > the above examples with "PDB" in place of "Struct". This raises the > chance > > of naming collisions, though, and doesn't solve issue #3 above. > > > > > > We'll leave the existing PDB module layout alone, in general. I think it > > will be necessary to add a few more attributes to the > > Bio.PDB.Structure.Structure class, but we can do this without breaking > > compatibility. Since fewer people depend on the exact formatting of the > > Structure.header data (we believe), it's safer to change this dictionary, > > moving the more essential entries to a separate attribute, or whatever > seems > > reasonable when we dig into it. > > > > Comments? > > I don't want us to break backwards compatibility in Bio.PDB (given how > widely used it seems to be based on citations at least), but would like > us to continue making small fixes or enhancements to it. Therefore a > new Bio.Struct module may be the safer option. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From jblanca at btc.upv.es Mon May 31 18:56:46 2010 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 31 May 2010 20:56:46 +0200 Subject: [Biopython-dev] Blast parsers and records Message-ID: <1275332206.4c04066ed4ec5@webmail.upv.es> Mensaje citado por Michael Sandford : > I've got a few comments as well: > > 4) The current Blast record stores its information in attributes. If you > use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the > necessary DTDs to do so), the information is stored in dictionaries. This has > some advantages. For example, it allows you to use record.keys() to find out > what the record contains. Ideally, I think that a Blast Record class should > inherit from a dictionary. I've developed for my own use a dict structure that represents a blast result. This structure also can represent many other results, like exonerate, SSAHA or any other number of aligners. Having a common representations for all of them allows you to create common filters that work with the same interface. I don't know if it is very efficient, but it has proven to be very convinient for us. You can take a look at: http://github.com/JoseBlanca/franklin/blob/master/franklin/alignment_search_result.py Best regards, Jose Blanca ----- Fin del mensaje reenviado ----- --