From p.j.a.cock at googlemail.com Tue Mar 4 13:39:10 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Mar 2014 18:39:10 +0000 Subject: [Biopython] Fwd: [OBF Members] BOSC 2014 Call for Abstracts In-Reply-To: <5077D423-549B-4E80-B70A-D005F731E51D@gmail.com> References: <5077D423-549B-4E80-B70A-D005F731E51D@gmail.com> Message-ID: Dear Biopythoneers, I hope to see some of you in Boston this summer for BOSC and the Codefest :) Peter ---------- Forwarded message ---------- From: Nomi Harris Date: Tue, Mar 4, 2014 at 5:40 PM Subject: [OBF Members] BOSC 2014 Call for Abstracts To: bosc-announce at lists.open-bio.org, members at open-bio.org, GMOD Announcements List Cc: BOSC 2014 Call for Abstracts for the 15th Annual Bioinformatics Open Source Conference (BOSC 2014) A Special Interest Group (SIG) of ISMB 2014 Dates: July 11-12, 2014 Location: Boston, MA, USA Web site: http://www.open-bio.org/wiki/BOSC_2014 Email: bosc at open-bio.org BOSC announcements mailing list: http://lists.open-bio.org/mailman/listinfo/bosc-announce Important Dates: March 24, 2014: Registration opens for ISMB and BOSC (https://www.iscb.org/ismb2014-registration) April 4, 2014: Deadline for submitting BOSC abstracts (http://www.open-bio.org/wiki/BOSC_Abstract_Submission) May 1, 204: Notification of accepted talk abstracts emailed to authors July 9-10, 2014: Codefest 2014, Boston (http://www.open-bio.org/wiki/Codefest_2014) July 11-12, 2014: BOSC 2014, Boston (http://www.open-bio.org/wiki/BOSC_2014) July 11-15, 2014: ISMB 2014, Boston The Bioinformatics Open Source Conference (BOSC) covers the wide range of open source bioinformatics software being developed, and encompasses the growing movement of Open Science, with its focus on transparency, reproducibility, and data provenance. We welcome submissions relating to all aspects of bioinformatics and open science software, including new computational methods, reusable software components, visualization, interoperability, and other approaches that help to advance research in the biomolecular sciences. Two full days of talks, posters, panel discussions, and informal discussion groups will enable BOSC attendees to interact with other developers and share ideas and code, as well as learning about some of the latest developments in the field of open source bioinformatics. BOSC is sponsored by the Open Bioinformatics Foundation, a non-profit, volunteer-run group dedicated to promoting the practice and philosophy of Open Source software development and Open Scien! ce within the biological research community. We invite you to submit one-page abstracts for talks and posters. This year's session topics are: Open Science and Reproducible Research Software Interoperability Genome-scale Data and Beyond Visualization Translational Bioinformatics Bioinformatics Open Source Libraries and Projects Once again we thank Eagle Genomics for sponsoring the BOSC Student Travel Awards, and welcome the open access journal GigaScience as a new sponsor for BOSC 2014. BOSC 2014 Organizing Committee: Nomi Harris and Peter Cock (co-chairs), Raoul Jean Pierre Bonnal, Brad Chapman, Robert Davey, Christopher Fields, Hans-Rudolf Hotz, Hilmar Lapp _______________________________________________ Members mailing list Members at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/members From mmokrejs at fold.natur.cuni.cz Thu Mar 6 14:34:48 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 06 Mar 2014 20:34:48 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FF95A2.7070102@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FF95A2.7070102@fold.natur.cuni.cz> Message-ID: <5318CDD8.8050602@fold.natur.cuni.cz> Hi list mates, I am rather happy with SearchIO, I haven't found more issues while converting my code. Let's see what devs do with proposed sanitization of objects lacking certain attributes. More impressions at the very end of the email. Martin Mokrejs wrote: > Martin Mokrejs wrote: >> Hi, >> I am in the process of conversion to the new XML parsing code written by Bow. >> So far, I have deciphered the following replacement strings (somewhat written in sed(1) format): >> >> >> /hsp.identities/hsp.ident_num/ >> /hsp.score/hsp.bitscore/ >> /hsp.expect/hsp.evalue/ >> /hsp.bits/hsp.bitscore/ >> /hsp.gaps/hsp.gap_num/ >> /hsp.bits/hsp.bitscore_raw/ > > Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names: > > /_hsp.score/_hsp.bitscore_raw/ > /_hsp.bits/_hsp.bitscore/ > > >> /hsp.positives/hsp.pos_num/ >> /hsp.sbjct_start/hsp.hit_start/ >> /hsp.sbjct_end/hsp.hit_end/ >> # hsp.query_start # no change from NCBIXML >> # hsp.query_end # no change from NCBIXML >> /record.query.split()[0]/record.id/ >> /alignment.hit_def.split(' ')[0]/alignment.hit_id/ >> /record.alignments/record.hits/ >> >> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not) >> >> >> >> >> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence. >> >> Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) > > > Answering myself: > > /alignment.hit_id/alignment.id/ > /alignment.length/_record.hits[0].seq_len/ > > > Other changes: > > _hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] > _hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] > _hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| | ||| || ||||||| |||||' > > I think the dictionary key should have been better named "similarity". > > > > The strand does not translate simply to SearchIO, one needs to do: > /_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc. > > > >> >> >> >> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;) > > I got around with try/except although it is more expensive than previously sufficient if/else tests: > > # undo the off-by-one change in SearchIO and transform back to real-life numbers > _hit_start = _hsp.hit_start + 1 > _query_start = _hsp.query_start + 1 > > try: > _ident_num = _hsp.ident_num > except: > _ident_num = 0 > > try: > _pos_num = _hsp.pos_num > except: > _pos_num = 0 > > try: > _gap_num = _hsp.gap_num > except: > # calculate gaps count missing sometimes in legacy blast XML output > # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected > _gap_num = _hsp.aln_span - _ident_num > > > > > > > So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing. After a while of using I could say that now with SearchIO I get at least 2x, mostly 4x faster XML parsing speed (wallclock) and notably, 256GB large XML files from blastn take now only 200-300MB of RAM (unlike 25GB of RAM using NCBIXML before). Congratulations, Bow, seemed silly I couldn't use my laptop to parse such huge files which are generate in a few hours but parsing takes days! [ Why I need old blastn and XML is out of question here. ;) -- I just need them because blastn+ with tabular plaintext output does not give me required data. ] Martin From richard.squires at nih.gov Sat Mar 8 20:05:33 2014 From: richard.squires at nih.gov (Squires, Richard (NIH/NIAID) [C]) Date: Sun, 9 Mar 2014 01:05:33 +0000 Subject: [Biopython] Biopython tutorial at SciPy 2014 Message-ID: Hello, I thought I would check to see if anyone was already planning to offer a Biopython tutorial at the upcoming SciPy 2014 meeting in Austin, Texas USA in July? I am considering doing it just did not want to step on any toes. :-) Burke Squires -- R. Burke Squires Computational Genomics Specialist Contractor ? Medical Sciences & Computing, Inc. Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH 31 Center Drive, Room 3B62E.2 Bethesda, MD 20892 Office: 301-402-9408 Mobile: 240-454-4515 http://bioinformatics.niaid.nih.gov (Within NIH) http://exon.niaid.nih.gov (Public) NIAID Bioinfo Twitter: @niaidbioit Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. From dmccully at mail.nih.gov Mon Mar 10 15:14:59 2014 From: dmccully at mail.nih.gov (McCully, Dwayne (NIH/NIAMS) [C]) Date: Mon, 10 Mar 2014 19:14:59 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> References: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> Message-ID: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> During the testing of Biopython I get the following error. Should I be concerned? Linux 6.5 Python 2.7.6 Dwayne Bio.Statistics.lowess docstring test ... ok Bio.PDB.Polypeptide docstring test ... ok Bio.PDB.Selection docstring test ... ok ====================================================================== ERROR: test_read_from_url (test_Entrez_online.EntrezOnlineCase) Test Entrez.read from URL ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez_online.py", line 44, in test_read_from_url rec = Entrez.read(einfo) File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/__init__.py", line 372, in read record = handler.read(handle) File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 187, in read self.parser.ParseFile(handle) File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 529, in externalEntityRefHandler raise RuntimeException("Failed to access %s at %s" % (filename, url)) NameError: global name 'RuntimeException' is not defined ---------------------------------------------------------------------- Ran 223 tests in 513.473 seconds FAILED (failures = 1) From dmccully at mail.nih.gov Mon Mar 10 16:25:20 2014 From: dmccully at mail.nih.gov (McCully, Dwayne (NIH/NIAMS) [C]) Date: Mon, 10 Mar 2014 20:25:20 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <29944B51-B3EE-4A73-AE33-6FD590A938A2@vanderbilt.edu> References: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> <29944B51-B3EE-4A73-AE33-6FD590A938A2@vanderbilt.edu> Message-ID: <432A8E6B26DC62439F0201C069BE2B671B368188@MLBXV01.nih.gov> Thanks for the info. Not sure If I should deploy it! Dwayne From: Willis, Jordan R [mailto:jordan.r.willis at Vanderbilt.Edu] Sent: Monday, March 10, 2014 3:46 PM To: McCully, Dwayne (NIH/NIAMS) [C] Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] Biopython: python setup.py test I think this is a bug in the code. I think it should be RuntimeError from built in exceptions that is used in most of that file. Jordan On Mar 10, 2014, at 2:14 PM, McCully, Dwayne (NIH/NIAMS) [C] > wrote: RuntimeException From jordan.r.willis at Vanderbilt.Edu Mon Mar 10 15:46:24 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Mon, 10 Mar 2014 19:46:24 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> References: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> Message-ID: <29944B51-B3EE-4A73-AE33-6FD590A938A2@vanderbilt.edu> I think this is a bug in the code. I think it should be RuntimeError from built in exceptions that is used in most of that file. Jordan On Mar 10, 2014, at 2:14 PM, McCully, Dwayne (NIH/NIAMS) [C] > wrote: RuntimeException From mjldehoon at yahoo.com Mon Mar 10 21:43:16 2014 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 10 Mar 2014 18:43:16 -0700 (PDT) Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> Message-ID: <1394502196.27158.YahooMailBasic@web164006.mail.gq1.yahoo.com> I have fixed the typo in github. Best, -Michiel. -------------------------------------------- On Mon, 3/10/14, McCully, Dwayne (NIH/NIAMS) [C] wrote: Subject: [Biopython] Biopython: python setup.py test To: "'biopython at lists.open-bio.org'" Date: Monday, March 10, 2014, 3:14 PM During the testing of Biopython I get the following error. Should I be concerned? Linux 6.5 Python 2.7.6 Dwayne Bio.Statistics.lowess docstring test ... ok Bio.PDB.Polypeptide docstring test ... ok Bio.PDB.Selection docstring test ... ok ====================================================================== ERROR: test_read_from_url (test_Entrez_online.EntrezOnlineCase) Test Entrez.read from URL ---------------------------------------------------------------------- Traceback (most recent call last): ? File "test_Entrez_online.py", line 44, in test_read_from_url ? ? rec = Entrez.read(einfo) ? File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/__init__.py", line 372, in read ? ? record = handler.read(handle) ? File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 187, in read ? ? self.parser.ParseFile(handle) ? File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 529, in externalEntityRefHandler ? ? raise RuntimeException("Failed to access %s at %s" % (filename, url)) NameError: global name 'RuntimeException' is not defined ---------------------------------------------------------------------- Ran 223 tests in 513.473 seconds FAILED (failures = 1) _______________________________________________ Biopython mailing list? -? Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Mar 11 06:49:49 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Mar 2014 10:49:49 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <1394502196.27158.YahooMailBasic@web164006.mail.gq1.yahoo.com> References: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> <1394502196.27158.YahooMailBasic@web164006.mail.gq1.yahoo.com> Message-ID: On Tue, Mar 11, 2014 at 1:43 AM, Michiel de Hoon wrote: > I have fixed the typo in github. > Best, > -Michiel. Thanks Michiel & Dwayne, The typo was in an error message when there was a problem assessing a DTD file (describing the XML structure). In this case I would guess it was missing this file: http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd The NCBI has added several new DTD files which will be bundled with the next Biopython release (which will also cache missing DTD files automatically). Dwayne: You could fix this typo manually; download the missing DTD file; or install Biopython from GitHub. Regards, Peter From mike.thon at gmail.com Tue Mar 11 08:43:23 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 11 Mar 2014 13:43:23 +0100 Subject: [Biopython] parsing hmmer results Message-ID: <4E2F14C8-CE16-4F36-91A2-7D171C96947E@gmail.com> I?m trying to parse a batch hmmer v3.1b1 report but I keep getting this error. I think its happening when the parser hits a hmmer report with no hits, but I?m not sure. Here?s the error: Traceback (most recent call last): File "parse-dbcan.py", line 6, in for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): File "/Library/Python/2.7/site-packages/Bio/SearchIO/__init__.py", line 316, in parse for qresult in generator: File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__ for qresult in self._parse_qresult(): File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 109, in _parse_qresult qid = regx.group(1).strip() AttributeError: 'NoneType' object has no attribute ?group' Here?s my script: #!/usr/bin/python from Bio import SearchIO from sys import argv import pdb for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): hits = qresult.hits if len(hits) > 0: beste = hits[0].hsps[0].evalue query = hits[0].query_id hit = hits[0].id.replace('.hmm', '') print query + ',' + hit + ',' + str(beste) #pdb.set_trace() I ran hmmscan like this: hmmscan --cpu 2 -E 1e-3 HMMs.txt prots.fasta >oute-3.txt From w.arindrarto at gmail.com Tue Mar 11 09:18:25 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 11 Mar 2014 14:18:25 +0100 Subject: [Biopython] parsing hmmer results In-Reply-To: <4E2F14C8-CE16-4F36-91A2-7D171C96947E@gmail.com> References: <4E2F14C8-CE16-4F36-91A2-7D171C96947E@gmail.com> Message-ID: Hi Michael, Do you have an example file you can send over (sending it to me privately also works). The parser has not been tested with HMMER version 3.1, and I suppose they introduced some changes which breaks the parser. Best, Bow On Tue, Mar 11, 2014 at 1:43 PM, Michael Thon wrote: > I?m trying to parse a batch hmmer v3.1b1 report but I keep getting this error. I think its happening when the parser hits a hmmer report with no hits, but I?m not sure. > > Here?s the error: > > Traceback (most recent call last): > File "parse-dbcan.py", line 6, in > for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): > File "/Library/Python/2.7/site-packages/Bio/SearchIO/__init__.py", line 316, in parse > for qresult in generator: > File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__ > for qresult in self._parse_qresult(): > File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 109, in _parse_qresult > qid = regx.group(1).strip() > AttributeError: 'NoneType' object has no attribute ?group' > > > Here?s my script: > > #!/usr/bin/python > from Bio import SearchIO > from sys import argv > import pdb > > for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): > hits = qresult.hits > if len(hits) > 0: > beste = hits[0].hsps[0].evalue > query = hits[0].query_id > hit = hits[0].id.replace('.hmm', '') > print query + ',' + hit + ',' + str(beste) > #pdb.set_trace() > > I ran hmmscan like this: > > hmmscan --cpu 2 -E 1e-3 HMMs.txt prots.fasta >oute-3.txt > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From philipp.schiffer at gmail.com Wed Mar 12 03:09:01 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Wed, 12 Mar 2014 08:09:01 +0100 Subject: [Biopython] Handle for OrthoXML Message-ID: Hi all! Is there a handle/module for OrthoXML parsing? http://orthoxml.org/xml/Documentation.html Cheers Philipp -- Sent from Gmail Mobile From p.j.a.cock at googlemail.com Wed Mar 12 05:26:20 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 12 Mar 2014 09:26:20 +0000 Subject: [Biopython] Handle for OrthoXML In-Reply-To: References: Message-ID: Hi Philipp, No, Biopython does not have a parser for OrtherXML, but Bio.SeqIO can read/write the related SeqXML format. You could try using the ElementTree XML parser from the Python standard library? Peter On Wed, Mar 12, 2014 at 7:09 AM, Philipp Schiffer wrote: > Hi all! > > Is there a handle/module for OrthoXML parsing? > http://orthoxml.org/xml/Documentation.html > > Cheers > > Philipp > > > -- > Sent from Gmail Mobile > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From philipp.schiffer at gmail.com Wed Mar 12 06:21:59 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Wed, 12 Mar 2014 11:21:59 +0100 Subject: [Biopython] Handle for OrthoXML In-Reply-To: References: Message-ID: <92559B2799CB4D1CBD790AEA1DCBA136@googlemail.com> Hi Peter, thanks for the info. I?ll check out the ElementTree parser and also see if some of the SeqXML functionality fulfils my purposes. Best Philipp -- Philipp Schiffer Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, 12 March 2014 at 10:26, Peter Cock wrote: > Hi Philipp, > > No, Biopython does not have a parser for OrtherXML, but > Bio.SeqIO can read/write the related SeqXML format. > > You could try using the ElementTree XML parser from the > Python standard library? > > Peter > > On Wed, Mar 12, 2014 at 7:09 AM, Philipp Schiffer > wrote: > > Hi all! > > > > Is there a handle/module for OrthoXML parsing? > > http://orthoxml.org/xml/Documentation.html > > > > Cheers > > > > Philipp > > > > > > -- > > Sent from Gmail Mobile > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org (mailto:Biopython at lists.open-bio.org) > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From kevin.rue at ucdconnect.ie Wed Mar 12 07:32:11 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 11:32:11 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? Message-ID: Hi all, Some may consider this a repeat of my StackOverflow post ( http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function) but over there I didn't mention the possibility of implementing the feature in Biopython. I am looking for a function which, given sequence1 and sequence2, would return whether sequence1 matches a subsequence of sequence2 allowing up to I insertions, D deletions, and S substitutions. So far, all I could find in Python were fuzzy matching functions using edit distances (Levenshtein and others), but none of those distances distinguish between insertions, deletions and substitution ( http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison ). There is a Perl module called String::Approx ( http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the function amatch() does exactly what I want.. except in Perl. A quick-and-dirty fix could be to make an external call to that Perl function from my Python script, but it would be so much cleaner (and probably faster) if I could avoid external calls and being dependent on multiple interpreters. I believe that such the feature I described could rapidly become popular if implemented in Biopython, but after reading the Perl module code and not understanding most of it, I think any Python module I could write to do the job wouldn't be nearly as optimised and fast. (an external call to the Perl module would surely be faster than my Python implementation) So.... - What are your thoughts? - Did I miss the magic Python package that does what I want? - Does anyone else think such a package would be useful to the bioinformatics community? - Did anyone solve the same issue I'm having in a different way? (I haven't found an "think out of the box" idea yet) - Does anyone feel like implementing this feature? Many thanks for your advice! -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From ivangreg at gmail.com Wed Mar 12 09:38:31 2014 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Wed, 12 Mar 2014 09:38:31 -0400 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: If that Perl function existed in Biopython, I would use it everyday, night and day. I sense that I would not be the only one. Ivan Ivan Gregoretti, PhD Bioinformatics On Wed, Mar 12, 2014 at 7:32 AM, Kevin Rue wrote: > Hi all, > > Some may consider this a repeat of my StackOverflow post ( > > http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function > ) > but over there I didn't mention the possibility of implementing the feature > in Biopython. > > I am looking for a function which, given sequence1 and sequence2, would > return whether sequence1 matches a subsequence of sequence2 allowing up to > I insertions, D deletions, and S substitutions. > > So far, all I could find in Python were fuzzy matching functions using edit > distances (Levenshtein and others), but none of those distances distinguish > between insertions, deletions and substitution ( > > http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison > ). > > There is a Perl module called String::Approx ( > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the > function amatch() does exactly what I want.. except in Perl. A > quick-and-dirty fix could be to make an external call to that Perl function > from my Python script, but it would be so much cleaner (and probably > faster) if I could avoid external calls and being dependent on multiple > interpreters. > > I believe that such the feature I described could rapidly become popular if > implemented in Biopython, but after reading the Perl module code and not > understanding most of it, I think any Python module I could write to do the > job wouldn't be nearly as optimised and fast. (an external call to the Perl > module would surely be faster than my Python implementation) > > So.... > - What are your thoughts? > - Did I miss the magic Python package that does what I want? > - Does anyone else think such a package would be useful to the > bioinformatics community? > - Did anyone solve the same issue I'm having in a different way? (I haven't > found an "think out of the box" idea yet) > - Does anyone feel like implementing this feature? > > Many thanks for your advice! > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From saketkc at gmail.com Wed Mar 12 09:46:41 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Wed, 12 Mar 2014 13:46:41 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi Kevin, There is a package which does something similar. https://github.com/taleinat/fuzzysearch Saket On 12 March 2014 11:32, Kevin Rue wrote: > Hi all, > > Some may consider this a repeat of my StackOverflow post ( > http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function) > but over there I didn't mention the possibility of implementing the feature > in Biopython. > > I am looking for a function which, given sequence1 and sequence2, would > return whether sequence1 matches a subsequence of sequence2 allowing up to > I insertions, D deletions, and S substitutions. > > So far, all I could find in Python were fuzzy matching functions using edit > distances (Levenshtein and others), but none of those distances distinguish > between insertions, deletions and substitution ( > http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison > ). > > There is a Perl module called String::Approx ( > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the > function amatch() does exactly what I want.. except in Perl. A > quick-and-dirty fix could be to make an external call to that Perl function > from my Python script, but it would be so much cleaner (and probably > faster) if I could avoid external calls and being dependent on multiple > interpreters. > > I believe that such the feature I described could rapidly become popular if > implemented in Biopython, but after reading the Perl module code and not > understanding most of it, I think any Python module I could write to do the > job wouldn't be nearly as optimised and fast. (an external call to the Perl > module would surely be faster than my Python implementation) > > So.... > - What are your thoughts? > - Did I miss the magic Python package that does what I want? > - Does anyone else think such a package would be useful to the > bioinformatics community? > - Did anyone solve the same issue I'm having in a different way? (I haven't > found an "think out of the box" idea yet) > - Does anyone feel like implementing this feature? > > Many thanks for your advice! > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From kevin.rue at ucdconnect.ie Wed Mar 12 11:16:03 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 15:16:03 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi, @Ivan: Glad to hear you confirm my thought! @Saket: You're right.. I have already been in touch for the past two days with "taleinat" the person who developped that code :) You will see in his github that in agreement with him, I suggested my feature as a possible enhancement of his package (issue #2 https://github.com/taleinat/fuzzysearch/issues), and he agreed to consider it for future development. No promised release date, but: 1) I wouldn't dare to ask for one as I am already asking for a huge favor for someone else to program that "for me" and the community 2) I am not particularly rushed, his Levenshtein distance does an acceptable job for the time being. I would love to be able to write the code myself, but my PhD thesis is more about using scripts to gain biology knowledge, while my issue would be better dealt with by someone with a much stronger low-level programming skillset using abstract mathematical notions to optimise the code beyond anything I could do with my scripting skills. Cheers Kevin PhD candidate :) On 12 March 2014 13:46, Saket Choudhary wrote: > Hi Kevin, > > There is a package which does something similar. > > https://github.com/taleinat/fuzzysearch > > > Saket > > On 12 March 2014 11:32, Kevin Rue wrote: > > Hi all, > > > > Some may consider this a repeat of my StackOverflow post ( > > > http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function > ) > > but over there I didn't mention the possibility of implementing the > feature > > in Biopython. > > > > I am looking for a function which, given sequence1 and sequence2, would > > return whether sequence1 matches a subsequence of sequence2 allowing up > to > > I insertions, D deletions, and S substitutions. > > > > So far, all I could find in Python were fuzzy matching functions using > edit > > distances (Levenshtein and others), but none of those distances > distinguish > > between insertions, deletions and substitution ( > > > http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison > > ). > > > > There is a Perl module called String::Approx ( > > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the > > function amatch() does exactly what I want.. except in Perl. A > > quick-and-dirty fix could be to make an external call to that Perl > function > > from my Python script, but it would be so much cleaner (and probably > > faster) if I could avoid external calls and being dependent on multiple > > interpreters. > > > > I believe that such the feature I described could rapidly become popular > if > > implemented in Biopython, but after reading the Perl module code and not > > understanding most of it, I think any Python module I could write to do > the > > job wouldn't be nearly as optimised and fast. (an external call to the > Perl > > module would surely be faster than my Python implementation) > > > > So.... > > - What are your thoughts? > > - Did I miss the magic Python package that does what I want? > > - Does anyone else think such a package would be useful to the > > bioinformatics community? > > - Did anyone solve the same issue I'm having in a different way? (I > haven't > > found an "think out of the box" idea yet) > > - Does anyone feel like implementing this feature? > > > > Many thanks for your advice! > > > > > > -- > > K?vin RUE-ALBRECHT > > Wellcome Trust Computational Infection Biology PhD Programme > > University College Dublin > > Ireland > > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From p.j.a.cock at googlemail.com Wed Mar 12 11:48:21 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 12 Mar 2014 15:48:21 +0000 Subject: [Biopython] SciPy 2014 (July 6-12, Austin, Texas, USA) Message-ID: Hi all, It is a bit short notice, but some of you may be interested in attending SciPy 2014, which will again have a bioinformatics session. There is still time to submit an abstract (deadline 14 March): https://conference.scipy.org/scipy2014/participate/presentations/ "SciPy 2014, the thirteenth annual Scientific Computing with Python conference, will be held this July 6th-12th in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest projects, learn from skilled users and developers, and collaborate on code development." Unfortunately SciPy 2014 clashes with BOSC 2014 in Boston, which you may prefer to attend, which is also currently accepting abstracts: http://www.open-bio.org/wiki/BOSC_2014 http://www.open-bio.org/wiki/Codefest_2014 *Disclaimer*: I am co-chairing BOSC this year. Regards, Peter From taleinat at gmail.com Wed Mar 12 11:55:33 2014 From: taleinat at gmail.com (Tal Einat) Date: Wed, 12 Mar 2014 17:55:33 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? Message-ID: Kevin wrote: > @Saket: You're right.. I have already been in touch for the past two days > with "taleinat" the person who developped that code :) You will see in his > github that in agreement with him, I suggested my feature as a possible > enhancement of his package (issue #2 > https://github.com/taleinat/fuzzysearch/issues), and he agreed to consider > it for future development. No promised release date, but: > 1) I wouldn't dare to ask for one as I am already asking for a huge favor > for someone else to program that "for me" and the community > 2) I am not particularly rushed, his Levenshtein distance does an > acceptable job for the time being. I would love to be able to write the > code myself, but my PhD thesis is more about using scripts to gain biology > knowledge, while my issue would be better dealt with by someone with a much > stronger low-level programming skillset using abstract mathematical notions > to optimise the code beyond anything I could do with my scripting skills. Hi again guys, I'm the author of the fuzzysearch Python library. I mentioned it on this list a few months ago thinking it might be useful. The fuzzysearch library is meant to be used for searching, which isn't really what you're doing. As far as I can tell it isn't really good enough for your purpose. I'll be happy to help if I can, however, especially given the additional interest expressed here! The python-Levenshtein library supports generating a sequence of operations transforming one string into another. For example (from the docs): >>> editops('spam', 'park') [('delete', 0, 0), ('insert', 3, 2), ('replace', 3, 3)] However, the requirement you described is significantly different: telling whether a string can be transformed into another using a maximum allowed number of replacements and insertions, but no deletions. For the above example, it could also be transformed without deletions using 4 substitutions! I'd be happy to collaborate on this, including writing code, if you like. I believe that what you need can be implemented relatively easily. - Tal Einat From taleinat at gmail.com Wed Mar 12 14:00:16 2014 From: taleinat at gmail.com (Tal Einat) Date: Wed, 12 Mar 2014 20:00:16 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: > Kevin wrote: > > I am looking for a function which, given sequence1 and sequence2, would > return whether sequence1 matches a subsequence of sequence2 allowing up to > I insertions, D deletions, and S substitutions. Kevin, if you don't want to allow deletions at all (I got the impression this is what you're looking for) then the following code will do the trick (and should be quite fast). Is this generally useful, or would it usually be more useful to also allow a (possibly limited) number of deletions? from array import array def kevin(str1, str2, max_substitutions, max_insertions): """check if it is possible to transform str1 into str2 given limitations The limiations are the maximum allowed number of new characters inserted and the maximum allowed number of character substitutions. """ # check simple cases which are obviously impossible if not len(str1) <= len(str2) <= len(str1) + max_insertions: return False scores = array('L', [0] * (len(str2) - len(str1) + 1)) new_scores = scores[:] for (str1_idx, char1) in enumerate(str1): # make min() always take the other value in the first iteration of the # inner loop prev_score = len(str2) for (n_insertions, char2) in enumerate( str2[str1_idx:len(str2)-len(str1)+str1_idx+1] ): new_scores[n_insertions] = prev_score = min( scores[n_insertions] + (0 if char1 == char2 else 1), prev_score ) # swap scores <-> new_scores scores, new_scores = new_scores, scores return min(scores) <= max_substitutions - Tal From kevin.rue at ucdconnect.ie Wed Mar 12 14:28:53 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 18:28:53 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: HI Tal, To answer your question, I currently don't expect any deletion indeed. It is possible that other users would like that feature, maybe me in the future. Therefore, this code could be a very convenient temporary fix that I could include directly in the code as an annex function. Thank you very much. I'll have a closer look at how it works to teach myself :) You actually (positively) surprised me by taking this approach: I didn't even think of taking the advantage of 0 deletions in my particular use case ! I can imagine that it does reduce the possible combination of edits. I guess I was too impressed by the Perl function that seemed to be doing all at once, but indeed code can be much simpler and faster when one takes advantage of the known pre-conditions and specific use cases. I can imagine that a master function could handle Sub, Del and Ins together, but if Del=0 then it could call this one if it solves the specific problem faster. Now, in a moment of inspiration, am I wrong in saying that similarly, a case where no insertion is allowed would be the same problem while switching sequence1 and sequence2 in the function call? PS: your updated fuzzysearch 0.2.0 package works great for its job. I just haven't had time to check the documentation updates. Cheers! On 12 March 2014 18:00, Tal Einat wrote: > > Kevin wrote: > > > > I am looking for a function which, given sequence1 and sequence2, would > > return whether sequence1 matches a subsequence of sequence2 allowing up > to > > I insertions, D deletions, and S substitutions. > > Kevin, if you don't want to allow deletions at all (I got the > impression this is what you're looking for) then the following code > will do the trick (and should be quite fast). > > Is this generally useful, or would it usually be more useful to also > allow a (possibly limited) number of deletions? > > > from array import array > > def kevin(str1, str2, max_substitutions, max_insertions): > """check if it is possible to transform str1 into str2 given > limitations > > The limiations are the maximum allowed number of new characters > inserted > and the maximum allowed number of character substitutions. > """ > # check simple cases which are obviously impossible > if not len(str1) <= len(str2) <= len(str1) + max_insertions: > return False > > scores = array('L', [0] * (len(str2) - len(str1) + 1)) > new_scores = scores[:] > > for (str1_idx, char1) in enumerate(str1): > # make min() always take the other value in the first iteration of > the > # inner loop > prev_score = len(str2) > for (n_insertions, char2) in enumerate( > str2[str1_idx:len(str2)-len(str1)+str1_idx+1] > ): > new_scores[n_insertions] = prev_score = min( > scores[n_insertions] + (0 if char1 == char2 else 1), > prev_score > ) > > # swap scores <-> new_scores > scores, new_scores = new_scores, scores > > return min(scores) <= max_substitutions > > > - Tal > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Wed Mar 12 14:43:10 2014 From: taleinat at gmail.com (Tal Einat) Date: Wed, 12 Mar 2014 20:43:10 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue wrote: > HI Tal, > > To answer your question, I currently don't expect any deletion indeed. It is > possible that other users would like that feature, maybe me in the future. > Therefore, this code could be a very convenient temporary fix that I could > include directly in the code as an annex function. Thank you very much. I'll > have a closer look at how it works to teach myself :) If anyone else on this list thinks this would be useful, I'd be happy to publish it as a publicly available library. For now, consider the code I posted freely available in the public domain (i.e. use it at will but don't take credit for it in my stead and don't sell it without my consent). > Now, in a moment of inspiration, am I wrong in saying that similarly, a case > where no insertion is allowed would be the same problem while switching > sequence1 and sequence2 in the function call? Indeed, you are correct :) - Tal Einat From kevin.rue at ucdconnect.ie Wed Mar 12 19:56:10 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 23:56:10 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: HI Tal, I just tested your function and it is doing something slightly different than what I had in mind. I need a few simple examples to illustrate my point: The string "TEST" is present in "TESTER" with 0 substitutions/0 insertions. Therefore I expect the call below to return TRUE. >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, max_insertions=0) but instead it returns FALSE. Meanwhile, >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, max_insertions=2) returns TRUE and >>> kevin("TEST", "TESTER", 0, max_insertions=1) returns FALSE Now, I haven't decrypted your code yet, but my guess is that what your function does is answer the question "Is str1 approximately EQUAL to str2 while allowing a maximum of S substitutions and I insertions". My problem (and believe most bioinformaticians like Ivan who answered earlier) is formulated "Is str1 present somewhere in str2 while allowing S sub and I ins?" In fact, it's your previous answer that made me realise that the function solving my problem "str1 in str2 with max of i insertions and s substitutions" should not be able to solve the problem "str1 in str2 with max of d deletions and s substitutions". Meanwhile, you're answer seems right for the function you sent us "str1 == str2 with max of i insertions and s substitutions" should be solved by the same function called by switching the strings "str2 == str1 with max of d deletions and s substitutions". (Just a guess, but it makes sense to me) If I am right, one solution to solve my problem (with only substitutions and insertions) using your function "kevin" is: - set str1 as the string I am trying to match - call your function for each substring of str2 of length [len(str1) : len(str1)+max_insertions+1] and set each of those as str2 - i can save time by returning TRUE the first time I find a match, because I don't care if there are more This would compare str1 to all possible str2 substrings that could be "approximately EQUAL to str1 allowing up to I insertions and S substitutions in str1". Obviously, another option is to design another function (say.. "kevin2" ^_^) which addresses directly my problem. I don't mind using the solution above (I appreciate your help and time), but I believe an implementation dealing directly with my problem would be faster to solve it, right? Looking forward to your answer! Cheers Kevin On 12 March 2014 18:43, Tal Einat wrote: > On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue > wrote: > > HI Tal, > > > > To answer your question, I currently don't expect any deletion indeed. > It is > > possible that other users would like that feature, maybe me in the > future. > > Therefore, this code could be a very convenient temporary fix that I > could > > include directly in the code as an annex function. Thank you very much. > I'll > > have a closer look at how it works to teach myself :) > > If anyone else on this list thinks this would be useful, I'd be happy > to publish it as a publicly available library. For now, consider the > code I posted freely available in the public domain (i.e. use it at > will but don't take credit for it in my stead and don't sell it > without my consent). > > > Now, in a moment of inspiration, am I wrong in saying that similarly, a > case > > where no insertion is allowed would be the same problem while switching > > sequence1 and sequence2 in the function call? > > Indeed, you are correct :) > > - Tal Einat > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From hlapp at drycafe.net Thu Mar 13 08:50:09 2014 From: hlapp at drycafe.net (Hilmar Lapp) Date: Thu, 13 Mar 2014 08:50:09 -0400 Subject: [Biopython] Fwd: [numfocus] 2014 John Hunter Fellowship - Call for Applications In-Reply-To: References: Message-ID: Some of you folks have probably already seen this. If someone at postdoc or senior PhD student level would love to focus a couple months on furthering Biopython development, this would seem like an excellent opportunity to get some support for that. And I'm sure mentors wouldn't be hard to come by :-) And I suppose I don't need to tell this community who John Hunter is (or, unfortunately, was). -hilmar ---------- Forwarded message ---------- From: Ralf Gommers Date: Tue, Mar 11, 2014 at 3:22 PM Subject: [numfocus] 2014 John Hunter Fellowship - Call for Applications To: numfocus at googlegroups.com, Discussion of Numerical Python < numpy-discussion at scipy.org>, SciPy Users List , matplotlib-users , ipython-dev at scipy.org, sympy at googlegroups.com Hi all, I'm excited to announce, on behalf of the Numfocus board, that applications for the 2014 John Hunter Technology Fellowship are now being accepted. This is the first fellowship Numfocus is able to offer, which we see as a significant milestone. The John Hunter Technology Fellowship aims to bridge the gap between academia and real-world, open-source scientific computing projects by providing a capstone experience for individuals coming from a scientific, engineering or mathematics background. The program consists of a 6 month project-based training program for postdoctoral scientists or senior graduate students. Fellows work on scientific computing open source projects under the guidance of mentors who are leading scientists and software engineers. The aim of the Fellowship is to enable Fellows to develop the skills needed to contribute to cutting-edge open source software projects while at the same time advancing or supporting the research program they and their mentor are involved in. While proposals in any area of science and engineering are welcome, the following areas are encouraged in particular: - Accessible and reproducible computing - Enabling technology for open access publishing - Infrastructural technology supporting open-source scientific software stacks - Core open-source projects promoted by NumFOCUS Eligible applicants are postdoctoral scientists or senior PhD students, or have equivalent experience in physics, mathematics, engineering, statistics, or a related science. The program is open to applicants from any nationality and can be performed at any university or institute world-wide (US export laws permitting). All applications are due May 15, 2014 by 11:59 p.m. Central Standard Time. For more details on the program see: http://numfocus.org/john_hunter_fellowship_2014.html (this call) http://numfocus.org/fellowships.html (program) And for some background see this blog post: http://numfocus.org/announcing-the-numfocus-technology-fellowship-program.html We're looking forward to receiving your applications! Ralf From kevin.rue at ucdconnect.ie Thu Mar 13 14:08:18 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Thu, 13 Mar 2014 18:08:18 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi Tal, I finally had the time to look at your function in details, and understood how it works and why each line is there. Thanks, even though it doesn't do exactly what I had in mind, it does what the docstring says :) I'd call your "kevin" function "approxEqual(str1, str2, max_ins, max_sub)". When I understood your code, I thought of a way to increase the speed of your function, but ended with a (fast) function actually doing something slightly different. This one would actually only look at the str2 substrings from str1_idx but which are no longer than len(str1)+max_insertions. Longer str2 are pointless as they imply that str1 requires more insertions at the start than allowed to match str2. I would call that function "approxStartsWith(str1, str2, max_ins, max_sub)" def approxStartsWith(str1, str2, max_substitutions, max_insertions): """check if it is possible to map str1 to the start of str2 given limitations The limitations are the maximum allowed number of new characters inserted and the maximum allowed number of character substitutions. """ # check simple cases which are obviously impossible if not len(str1) <= len(str2): return False scores = array('L', [0] * (max_insertions + 1)) new_scores = scores[:] for (str1_idx, char1) in enumerate(str1): # make min() always take the other value in the first iteration of the # inner loop prev_score = len(str2) for (n_insertions, char2) in enumerate( str2[str1_idx:max_insertions+str1_idx+1] ): new_scores[n_insertions] = prev_score = min(scores[n_insertions] + (0 if char1 == char2 else 1), prev_score ) # swap scores <-> new_scores scores, new_scores = new_scores, scores return min(scores) <= max_substitutions Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply implemented as: def approxWithin(str1, str2, max_ins, max_sub): """check if it is possible to find str1 within str2 given limitations The limitations are the maximum allowed number of new characters inserted and the maximum allowed number of character substitutions. """ for str2_idx in range(len(str2)-len(str1)+1): print (str2_idx) result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub) if result: return result return False Comments? Cheers, kevin On 12 March 2014 23:56, Kevin Rue wrote: > HI Tal, > > I just tested your function and it is doing something slightly different > than what I had in mind. > > I need a few simple examples to illustrate my point: > > The string "TEST" is present in "TESTER" with 0 substitutions/0 > insertions. Therefore I expect the call below to return TRUE. > >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, > max_insertions=0) > but instead it returns FALSE. > > Meanwhile, > >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, > max_insertions=2) > returns TRUE > > and > >>> kevin("TEST", "TESTER", 0, max_insertions=1) > returns FALSE > > Now, I haven't decrypted your code yet, but my guess is that what your > function does is answer the question "Is str1 approximately EQUAL to str2 > while allowing a maximum of S substitutions and I insertions". > My problem (and believe most bioinformaticians like Ivan who answered > earlier) is formulated "Is str1 present somewhere in str2 while allowing S > sub and I ins?" > > > In fact, it's your previous answer that made me realise that the function > solving my problem "str1 in str2 with max of i insertions and s > substitutions" should not be able to solve the problem "str1 in str2 with > max of d deletions and s substitutions". > Meanwhile, you're answer seems right for the function you sent us "str1 > == str2 with max of i insertions and s substitutions" should be solved by > the same function called by switching the strings "str2 == str1 with max of > d deletions and s substitutions". (Just a guess, but it makes sense to me) > > If I am right, one solution to solve my problem (with only substitutions > and insertions) using your function "kevin" is: > - set str1 as the string I am trying to match > - call your function for each substring of str2 of length [len(str1) : > len(str1)+max_insertions+1] and set each of those as str2 > - i can save time by returning TRUE the first time I find a match, because > I don't care if there are more > This would compare str1 to all possible str2 substrings that could be > "approximately EQUAL to str1 allowing up to I insertions and S > substitutions in str1". > > Obviously, another option is to design another function (say.. "kevin2" > ^_^) which addresses directly my problem. I don't mind using the solution > above (I appreciate your help and time), but I believe an implementation > dealing directly with my problem would be faster to solve it, right? > > Looking forward to your answer! > Cheers > Kevin > > > > > > > > > On 12 March 2014 18:43, Tal Einat wrote: > >> On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue >> wrote: >> > HI Tal, >> > >> > To answer your question, I currently don't expect any deletion indeed. >> It is >> > possible that other users would like that feature, maybe me in the >> future. >> > Therefore, this code could be a very convenient temporary fix that I >> could >> > include directly in the code as an annex function. Thank you very much. >> I'll >> > have a closer look at how it works to teach myself :) >> >> If anyone else on this list thinks this would be useful, I'd be happy >> to publish it as a publicly available library. For now, consider the >> code I posted freely available in the public domain (i.e. use it at >> will but don't take credit for it in my stead and don't sell it >> without my consent). >> >> > Now, in a moment of inspiration, am I wrong in saying that similarly, a >> case >> > where no insertion is allowed would be the same problem while switching >> > sequence1 and sequence2 in the function call? >> >> Indeed, you are correct :) >> >> - Tal Einat >> > > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From mary.kindall at gmail.com Thu Mar 13 15:57:19 2014 From: mary.kindall at gmail.com (Mary Kindall) Date: Thu, 13 Mar 2014 15:57:19 -0400 Subject: [Biopython] Get all alignments of a sequence against another Message-ID: This is a primitive question but somehow I could not find a solution to it. I have two sequences 'large' and 'small' as given below. >large XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS >small GGGTTVTTSS I need to align the 'small' sequence to the 'large' sequence. Clearly there are two places where it can be aligned. I need to get indices of both the locations. I was trying BioPython's "pairwise2.align.globalms" function but it is only able to align to the second position. pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0, penalize_end_gaps=False) Ans: [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS', '-----------------------------------------------------------------------------------------GGGTTLTTSS', 20.0, 0, 99)] Which parameter can I change here or which other pachage/lightweight free software can compute this? -- Mary From kevin.rue at ucdconnect.ie Fri Mar 14 05:16:45 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 09:16:45 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Mary, There is one blurry area in your question: how exactly do you define "a location where your small_sequence aligns" ? >From your example, it seems you're not looking for exact matches, but you allow in this case 1 mismatch. Is it a maximal number of mismatches? Do you also want to allow indels? Do you want to control the number of insertions, deletions, substitutions separately? Is a match a local alignment above a score threshold? I would suggest that you have a look at the definition of the Levenshtein distance.( see the example: http://en.wikipedia.org/wiki/Levenshtein_distance#Example). If this metric suits you, for instance to find all the matches of the small_sequences in the large_sequence with a maximal edit distance of 1, you can use one of the Python packages implementing the Levenshtein distance, like "fuzzysearch" (https://pypi.python.org/pypi/fuzzysearch/0.2.0) this way: >>> import fuzzysearch >>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1) The output will find two matches. Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)] BUG: I did notice that the second match is reported twice instead and I assume this is a bug where the first match was somehow replaced by the second, which is why I copied Tal (the developer of this package) to this email Another example where I added you sequence (with a mismatch) a third time: >>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1) returns Out[9]: [Match(start=42, end=52, dist=1), Match(start=99, end=109, dist=0), Match(start=99, end=109, dist=0)] You can see three matches, one of the mismatched sequence was detected correctly (edit distance of 1), but the bug seems to duplicate the last match and replace the one before the last match with it. Tal, can you fix that? I will add the issue to your repository :) Cheers Kevin On 13 March 2014 19:57, Mary Kindall wrote: > This is a primitive question but somehow I could not find a solution to it. > I have two sequences 'large' and 'small' as given below. > > >large > > XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS > > > >small > GGGTTVTTSS > > > I need to align the 'small' sequence to the 'large' sequence. Clearly there > are two places where it can be aligned. I need to get indices of both the > locations. I was trying BioPython's "pairwise2.align.globalms" function but > it is only able to align to the second position. > > > > pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0, > penalize_end_gaps=False) > Ans: > > [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS', > > '-----------------------------------------------------------------------------------------GGGTTLTTSS', > 20.0, > 0, > 99)] > > > > Which parameter can I change here or which other pachage/lightweight free > software can compute this? > > -- > Mary > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From kevin.rue at ucdconnect.ie Fri Mar 14 05:29:26 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 09:29:26 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Sorry for multiple emails: My mistake, the duplication of the last one does not replace the one before the last, but instead the first match is simply not returned in the output list (even though the right NUMBER of matches is returned). On 14 March 2014 09:16, Kevin Rue wrote: > Hi Mary, > > There is one blurry area in your question: how exactly do you define "a > location where your small_sequence aligns" ? > From your example, it seems you're not looking for exact matches, but you > allow in this case 1 mismatch. Is it a maximal number of mismatches? Do you > also want to allow indels? Do you want to control the number of insertions, > deletions, substitutions separately? Is a match a local alignment above a > score threshold? > > I would suggest that you have a look at the definition of the Levenshtein > distance.( see the example: > http://en.wikipedia.org/wiki/Levenshtein_distance#Example). > If this metric suits you, for instance to find all the matches of the > small_sequences in the large_sequence with a maximal edit distance of 1, > you can use one of the Python packages implementing the Levenshtein > distance, like "fuzzysearch" ( > https://pypi.python.org/pypi/fuzzysearch/0.2.0) this way: > > >>> import fuzzysearch > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > 1) > > The output will find two matches. > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)] > > BUG: > I did notice that the second match is reported twice instead and I assume > this is a bug where the first match was somehow replaced by the second, > which is why I copied Tal (the developer of this package) to this email > > Another example where I added you sequence (with a mismatch) a third time: > > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > 1) > > returns > Out[9]: > [Match(start=42, end=52, dist=1), > Match(start=99, end=109, dist=0), > Match(start=99, end=109, dist=0)] > > You can see three matches, one of the mismatched sequence was detected > correctly (edit distance of 1), but the bug seems to duplicate the last > match and replace the one before the last match with it. > > Tal, can you fix that? I will add the issue to your repository :) > > Cheers > Kevin > > > > > On 13 March 2014 19:57, Mary Kindall wrote: > >> This is a primitive question but somehow I could not find a solution to >> it. >> I have two sequences 'large' and 'small' as given below. >> >> >large >> >> XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS >> >> >> >small >> GGGTTVTTSS >> >> >> I need to align the 'small' sequence to the 'large' sequence. Clearly >> there >> are two places where it can be aligned. I need to get indices of both the >> locations. I was trying BioPython's "pairwise2.align.globalms" function >> but >> it is only able to align to the second position. >> >> >> >> pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0, >> penalize_end_gaps=False) >> Ans: >> >> [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS', >> >> '-----------------------------------------------------------------------------------------GGGTTLTTSS', >> 20.0, >> 0, >> 99)] >> >> >> >> Which parameter can I change here or which other pachage/lightweight free >> software can compute this? >> >> -- >> Mary >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Fri Mar 14 06:53:18 2014 From: taleinat at gmail.com (Tal Einat) Date: Fri, 14 Mar 2014 12:53:18 +0200 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue wrote: > >>> import fuzzysearch > >>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>> 1) > > The output will find two matches. > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)] > > BUG: > I did notice that the second match is reported twice instead and I assume > this is a bug where the first match was somehow replaced by the second, > which is why I copied Tal (the developer of this package) to this email > > Another example where I added you sequence (with a mismatch) a third time: > >>>> >>>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", >>>> 1) > > returns > Out[9]: > [Match(start=42, end=52, dist=1), > Match(start=99, end=109, dist=0), > Match(start=99, end=109, dist=0)] > > You can see three matches, one of the mismatched sequence was detected > correctly (edit distance of 1), but the bug seems to duplicate the last > match and replace the one before the last match with it. > > Tal, can you fix that? I will add the issue to your repository :) Thanks for bringing this to my attention! Fixed. Upgrade to version 0.2.1 and your example will work as expected. (To upgrade, run: pip install --upgrade fuzzysearch) - Tal Einat From kevin.rue at ucdconnect.ie Fri Mar 14 06:57:36 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 10:57:36 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Cheers! (Man, we're a team ;-) ) Kevin On 14 March 2014 10:53, Tal Einat wrote: > On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue > wrote: > > >>> import fuzzysearch > > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > > >>> 1) > > > > The output will find two matches. > > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, > dist=0)] > > > > BUG: > > I did notice that the second match is reported twice instead and I assume > > this is a bug where the first match was somehow replaced by the second, > > which is why I copied Tal (the developer of this package) to this email > > > > Another example where I added you sequence (with a mismatch) a third > time: > > > >>>> > >>>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>>> 1) > > > > returns > > Out[9]: > > [Match(start=42, end=52, dist=1), > > Match(start=99, end=109, dist=0), > > Match(start=99, end=109, dist=0)] > > > > You can see three matches, one of the mismatched sequence was detected > > correctly (edit distance of 1), but the bug seems to duplicate the last > > match and replace the one before the last match with it. > > > > Tal, can you fix that? I will add the issue to your repository :) > > Thanks for bringing this to my attention! Fixed. > > Upgrade to version 0.2.1 and your example will work as expected. > > (To upgrade, run: pip install --upgrade fuzzysearch) > > - Tal Einat > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From kevin.rue at ucdconnect.ie Fri Mar 14 07:07:46 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 11:07:46 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Mary, Please do let us know if that solution suits you or if the Levenshtein distance metric does not fit your needs. The approach below gives you the number of matches (length of the output list), the start and stop positions of the match (be careful about Python 0-based indexing), and the edit distance between each match and the sequence you search for. It's already a good place to start from. Best Kevin On 14 March 2014 10:53, Tal Einat wrote: > On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue > wrote: > > >>> import fuzzysearch > > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > > >>> 1) > > > > The output will find two matches. > > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, > dist=0)] > > > > BUG: > > I did notice that the second match is reported twice instead and I assume > > this is a bug where the first match was somehow replaced by the second, > > which is why I copied Tal (the developer of this package) to this email > > > > Another example where I added you sequence (with a mismatch) a third > time: > > > >>>> > >>>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>>> 1) > > > > returns > > Out[9]: > > [Match(start=42, end=52, dist=1), > > Match(start=99, end=109, dist=0), > > Match(start=99, end=109, dist=0)] > > > > You can see three matches, one of the mismatched sequence was detected > > correctly (edit distance of 1), but the bug seems to duplicate the last > > match and replace the one before the last match with it. > > > > Tal, can you fix that? I will add the issue to your repository :) > > Thanks for bringing this to my attention! Fixed. > > Upgrade to version 0.2.1 and your example will work as expected. > > (To upgrade, run: pip install --upgrade fuzzysearch) > > - Tal Einat > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Fri Mar 14 07:11:42 2014 From: taleinat at gmail.com (Tal Einat) Date: Fri, 14 Mar 2014 13:11:42 +0200 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue wrote: > >>> import fuzzysearch > >>> fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", >>>> 1) By the way, you should usually just use fuzzysearch.find_near_matches(...), which will choose an appropriate search method for you depending on the given parameters. - Tal Einat From mary.kindall at gmail.com Fri Mar 14 11:14:01 2014 From: mary.kindall at gmail.com (Mary Kindall) Date: Fri, 14 Mar 2014 11:14:01 -0400 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Tal and Kevin, Thanks for mail and the user friendly package. I was not aware of the existence of "fuzzysearch" package. Kevin, I am allowing the mismatches (to a defined maximum which is a function of the length of the pattern) but there is a strict 'no' for insertions and deletions. I do see that the functions 'fuzzysearch.find_near_matches' and 'fuzzysearch.find_near_matches_with_ngrams' works perfect for mismatches. However, I could not find a way to avoid alignment when there is an insertion or deletion. Is there a way to restrict the maximum distance to mismatches only? Levenshtein distance seems to have the same issue. Thanks and regards Mary On Fri, Mar 14, 2014 at 7:11 AM, Tal Einat wrote: > On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue > wrote: > > >>> import fuzzysearch > > >>> > fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>>> 1) > > By the way, you should usually just use > fuzzysearch.find_near_matches(...), which will choose an appropriate > search method for you depending on the given parameters. > > - Tal Einat > -- ------------- Mary Kindall Yorktown Heights, NY USA From kevin.rue at ucdconnect.ie Fri Mar 14 11:52:12 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 15:52:12 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Mary, Tal, In that case you describe, the solution to your problem is rather straightforward to implement.. I split it in two functions pasted below. I actually implemented yesterday the function "approx_substitute(str1, str2, max_substitutions)", see below. This function requires two strings of the same length, and will tell you TRUE if there are less than N mismatches between them, comparing the characters at the same position in the two strings. Now the function that answers your question is something I just implemented for you: "list_start_approx_matches_substitutions(str1, str2, max_mismatches)" see below. This function will use the previous one to compare your small_string to each substring of str2 of the same length of str1, and keep the start position of all positive matches. It will return an "array" object, which you can easily turn into a regular list using array.tolist() See http://docs.python.org/2/library/array.html In short, run the fwo function definitions below. And then use: >>>list_start_approx_matches_substitutions("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1) it will return Out[20]: array('L', [19, 42, 99]) if that's scary, just run: >>>list_start_approx_matches_substitutions("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1).tolist() Out[21]: [19, 42, 99] Kevin def approx_substitute(str1, str2, max_substitutions): """Checks that str1 is less than max_substitutions away from str2. Note: No insertions or deletions are allowed. Sequences of different length will automatically return FALSE. Args: str1, str2, max_substitutions Returns: Boolean. TRUE if str1 is less than ax_substitutions away from str2, FALSE otherwise. """ # Solves a simple scenario which does not require to parse the sequences. if len(str1) != len(str2): return False # If we reach here, we know that the two strings are the same length, therefore len(str1) is synonym to len(str2) # Initialise a counter of substitutions between the two strings substitutions = 0 # For each (index, character) value pair in str1 for (index, str1_char) in enumerate(str1): # add 1 if the characters from str1 and str2 are different, add 0 otherwise substitutions += (0 if str1_char == str2[index] else 1) # time saver: if the counter exceeds max_substitutions at some stage, don't bother checking the rest... if substitutions > max_substitutions: # ... just return FALSE return False # If max_substitutions is never reached, the function will eventually leave the loop above # The simple fact of arriving here proves that str1 is less than max_substitutions away from str2, # therefore return TRUE return True from array import array # define a function that returns the start position of all matches of a given str1 # in a given larger str2, given a maximum number of mismatches allowed def list_start_approx_matches_substitutions(str1, str2, max_mismatches): # Initialise an empty list to save the start positions of the matches starts = array('L') # for each substring of str2 which is the same length as str1 for i in range(len(str2)-len(str1)+1): # if there are less than N mismatches between str1 and substr2 if approx_substitute(str1, str2[i:i+len(str1)], max_mismatches): # save the start position of the match (the end position can be guessed # from the length of str1, the only information lost is the number of mismatches # between str1 and substr2) starts.append(i) return starts On 14 March 2014 15:14, Mary Kindall wrote: > Hi Tal and Kevin, > Thanks for mail and the user friendly package. I was not aware of the > existence of "fuzzysearch" package. > > Kevin, I am allowing the mismatches (to a defined maximum which is a > function of the length of the pattern) but there is a strict 'no' for > insertions and deletions. > > I do see that the functions 'fuzzysearch.find_near_matches' and > 'fuzzysearch.find_near_matches_with_ngrams' works perfect for mismatches. > However, I could not find a way to avoid alignment when there is an > insertion or deletion. > > Is there a way to restrict the maximum distance to mismatches only? > Levenshtein distance seems to have the same issue. > > Thanks and regards > Mary > > > > > > On Fri, Mar 14, 2014 at 7:11 AM, Tal Einat wrote: > >> On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue >> wrote: >> > >>> import fuzzysearch >> > >>> >> fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", >> >>>> 1) >> >> By the way, you should usually just use >> fuzzysearch.find_near_matches(...), which will choose an appropriate >> search method for you depending on the given parameters. >> >> - Tal Einat >> > > > > -- > ------------- > Mary Kindall > Yorktown Heights, NY > USA > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Fri Mar 14 14:01:51 2014 From: taleinat at gmail.com (Tal Einat) Date: Fri, 14 Mar 2014 20:01:51 +0200 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: On Fri, Mar 14, 2014 at 5:52 PM, Kevin Rue wrote: > Hi Mary, Tal, > > In that case you describe, the solution to your problem is rather > straightforward to implement.. I split it in two functions pasted below. Kevin, that's a nice solution! Here's a somewhat more efficient solution, based on the same basic principals but implemented with some optimizations and non-trivial index juggling. This will be included in future versions of fuzzysearch. Raw code is below; for the next month or so a highlighted version can be found at: dpaste.com/1728155/ I still feel we're reinventing the wheel here. Surely it is possible to do this with BioPython. Unfortunately, I too couldn't easily figure out how to do so from reading the documentation and a bit of trial and error. - Tal Einat from collections import deque, defaultdict, namedtuple from itertools import islice Match = namedtuple('Match', ['start', 'end', 'dist']) def find_near_matches_only_substitutions(subsequence, sequence, max_substitutions): """search for near-matches of subsequence in sequence This searches for near-matches, where the nearly-matching parts of the sequence must meet the following limitations (relative to the subsequence): * the number of character substitutions must be less than max_substitutions * no deletions or insertions are allowed """ if not subsequence: raise ValueError('Given subsequence is empty!') # simple optimization: prepare some often used things in advance _SUBSEQ_LEN = len(subsequence) _SUBSEQ_LEN_MINUS_ONE = _SUBSEQ_LEN - 1 # prepare quick lookup of where a character appears in the subsequence char_indexes_in_subsequence = defaultdict(list) for (index, char) in enumerate(subsequence): char_indexes_in_subsequence[char].append(index) # we'll iterate over the sequence once, but the iteration is split into two # for loops; therefore we prepare an iterator in advance which will be used # in for of the loops sequence_enum_iter = enumerate(sequence) # We'll count the number of matching characters assuming various attempted # alignments of the subsequence to the sequence. At any point in the # sequence there will be N such alignments to update. We'll keep # these in a "circular array" (a.k.a. a ring) which we'll rotate after each # iteration to re-align the indexing. # Initialize the candidate counts by iterating over the first N-1 items in # the sequence. No possible matches in this step! candidates = deque([0], maxlen=_SUBSEQ_LEN) for (index, char) in islice(sequence_enum_iter, _SUBSEQ_LEN_MINUS_ONE): for subseq_index in [idx for idx in char_indexes_in_subsequence[char] if idx <= index]: candidates[subseq_index] += 1 candidates.appendleft(0) matches = [] # From the N-th item onwards, we'll update the candidate counts exactly as # above, and additionally check if the part of the sequence whic began N-1 # items before the current index was a near enough match to the given # sub-sequence. for (index, char) in sequence_enum_iter: for subseq_index in char_indexes_in_subsequence[char]: candidates[subseq_index] += 1 # rotate the ring of candidate counts candidates.rotate(1) # fetch the count for the candidate which started N-1 items ago n_substitutions = _SUBSEQ_LEN - candidates[0] # set the count for the next index to zero candidates[0] = 0 # if the candidate had few enough mismatches, yield a match if n_substitutions <= max_substitutions: matches.append(Match( start=index - _SUBSEQ_LEN_MINUS_ONE, end=index + 1, dist=n_substitutions, )) return matches From eric.talevich at gmail.com Sat Mar 15 01:29:21 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 14 Mar 2014 22:29:21 -0700 Subject: [Biopython] Google Summer of Code 2014: Call for student applications Message-ID: Hi everyone, Google Summer of Code is an annual program that funds students all over the world to work with open-source software projects to develop new code. This summer, the Open Bioinformatics Foundation (OBF) is taking on students through the Google Summer of Code program to work with mentors on established bioinformatics software projects including BioPython. We invite students to submit applications by Friday, March 21. Full details are here: http://news.open-bio.org/news/2014/03/obf-gsoc-2014-call-for-student-applications/ All the best, Eric & Raoul OBF GSoC organization admins From taleinat at gmail.com Sat Mar 15 14:59:07 2014 From: taleinat at gmail.com (Tal Einat) Date: Sat, 15 Mar 2014 20:59:07 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: On Thu, Mar 13, 2014 at 8:08 PM, Kevin Rue wrote: > Hi Tal, > > I finally had the time to look at your function in details, and understood > how it works and why each line is there. Thanks, even though it doesn't do > exactly what I had in mind, it does what the docstring says :) > I'd call your "kevin" function "approxEqual(str1, str2, max_ins, max_sub)". > > > When I understood your code, I thought of a way to increase the speed of > your function, but ended with a (fast) function actually doing something > slightly different. This one would actually only look at the str2 substrings > from str1_idx but which are no longer than len(str1)+max_insertions. Longer > str2 are pointless as they imply that str1 requires more insertions at the > start than allowed to match str2. > I would call that function "approxStartsWith(str1, str2, max_ins, max_sub)" > > def approxStartsWith(str1, str2, max_substitutions, max_insertions): > """check if it is possible to map str1 to the start of str2 given > limitations > > The limitations are the maximum allowed number of new characters > inserted > and the maximum allowed number of character substitutions. > """ > # check simple cases which are obviously impossible > if not len(str1) <= len(str2): > return False > > scores = array('L', [0] * (max_insertions + 1)) > new_scores = scores[:] > > for (str1_idx, char1) in enumerate(str1): > # make min() always take the other value in the first iteration of > the > # inner loop > prev_score = len(str2) > for (n_insertions, char2) in enumerate( > str2[str1_idx:max_insertions+str1_idx+1] > ): > new_scores[n_insertions] = prev_score = min(scores[n_insertions] > + (0 if char1 == char2 else 1), prev_score ) > > # swap scores <-> new_scores > scores, new_scores = new_scores, scores > > return min(scores) <= max_substitutions > > > Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply > implemented as: > > def approxWithin(str1, str2, max_ins, max_sub): > """check if it is possible to find str1 within str2 given limitations > > The limitations are the maximum allowed number of new characters inserted > and the maximum allowed number of character substitutions. > """ > for str2_idx in range(len(str2)-len(str1)+1): > print (str2_idx) > result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub) > if result: > return result > return False > > Comments? Aha! So you are, in fact, searching for a short sequence in a long sequence (or many such long sequences). This -is- exactly what fuzzysearch is meant for! Regarding the code you posted, it looks like it would work (though I haven't tested it). It certainly isn't very efficient, however, since it makes many copies of long parts of str2 (see "str2[str2_idx:]"). It is also fairly straightforward, leaving some room for further optimization. I do like that it is very readable and easy to understand! Well, except for the loop in approxStartsWith(), but that's based on my own code... Inspired by your use-case, I've added highly generic fuzzy searching functionality to fuzzysearch. You can now limit the number of substitutions, insertions and deletions as well as their total (i.e. limit the Levenshtein distance). You can also limit only some of these as you like. Specifically, this supports your use-case of searching for fuzzy matches allowing only a limited number of substitutions and insertions, but no deletions. The user-friendly utility function fuzzysearch.find_near_matches() now accepts parameters for limiting the substitutions etc., and chooses a suitable implementation based on the given parameters. I haven't yet implemented an optimized search function allowing only substitutions and insertions. If the current version not fast enough for your needs, there are plenty of optimizations still to be done. I'd be happy if you could give it a whirl and tell me what happens! I haven't released it yet, but you can install the latest development version using pip (you'll need to have git installed for this): pip install --upgrade git+git://github.com/taleinat/fuzzysearch.git#egg=fuzzysearch - Tal From lluis.revilla at gmail.com Mon Mar 17 15:09:31 2014 From: lluis.revilla at gmail.com (=?ISO-8859-1?Q?Llu=EDs_Revilla?=) Date: Mon, 17 Mar 2014 20:09:31 +0100 Subject: [Biopython] Google Summer of Code 2014: Student application Message-ID: Hi everyone, I am a Biotechnology student and I want to contribute to Biopython. I have read the wiki GSoC page and I found two ideas. But I think I don't have the desired skills, I am not much familiarized with the Biopython's existing sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with javascript ("Interactive GenomeDiagram Module"). So I am thinking to make a proposal for the Google Summer of Code about a comparing tool. My idea comes from the following: I have been several time in charge of selecting a tool to do a certain process e.g.: A list of predicted genes, a list of possible structures, a list of alignments... But usually in bioinformatics there are many programs to do the same thing, usually they use a different algorithm a different training set data (prokaryote, eukaryote ), or have different specifications. And they return a more or less sophisticated list, in some standard format, FASTA, GFF, Genebank... The problem when starting a project is to select from this different programs which one use for the task, e.g.: Which gene predictor is better for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The answer will be specific to the project but sometimes its difficult to ensure that it is a good selection. (Other times it is good enough to do what the majority do.) But does not solve the problem when new algorithms appears, or even to compare between different program versions. To cover this problem I would like to develop for Biopython a module to compare between the different programs output to asses which one is better for the task. Currently I developed a parser for the afford mentioned programs and it compares them in a (very) rude way. I would like to develop further and release it to the Biopython community. What are your thoughts about this idea? Thanks, Llu?s From kevin.rue at ucdconnect.ie Tue Mar 18 06:34:21 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Tue, 18 Mar 2014 10:34:21 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi Tal, In my particular case, I am searching for a 36-characters sequence in a 90-characters one. If you're really curious, you can have a look at RNA-sequencing. It's technique in biology where we obtain the sequence of RNA molecules expressed from the genome. But before the analysis of those sequences (we usually call them "reads"), we have to identify and filter out reads that contain an "adapter" sequence used by our machines. The problem is that the "adapter" sequence can have mistakes in the read, hence the fuzzy matching. Haven't tried your code yet, I'll try when I get a chance, but I've got a few other tasks to deal with for my own work first . Cheers, Kevin On 15 March 2014 18:59, Tal Einat wrote: > On Thu, Mar 13, 2014 at 8:08 PM, Kevin Rue > wrote: > > Hi Tal, > > > > I finally had the time to look at your function in details, and > understood > > how it works and why each line is there. Thanks, even though it doesn't > do > > exactly what I had in mind, it does what the docstring says :) > > I'd call your "kevin" function "approxEqual(str1, str2, max_ins, > max_sub)". > > > > > > When I understood your code, I thought of a way to increase the speed of > > your function, but ended with a (fast) function actually doing something > > slightly different. This one would actually only look at the str2 > substrings > > from str1_idx but which are no longer than len(str1)+max_insertions. > Longer > > str2 are pointless as they imply that str1 requires more insertions at > the > > start than allowed to match str2. > > I would call that function "approxStartsWith(str1, str2, max_ins, > max_sub)" > > > > def approxStartsWith(str1, str2, max_substitutions, max_insertions): > > """check if it is possible to map str1 to the start of str2 given > > limitations > > > > The limitations are the maximum allowed number of new characters > > inserted > > and the maximum allowed number of character substitutions. > > """ > > # check simple cases which are obviously impossible > > if not len(str1) <= len(str2): > > return False > > > > scores = array('L', [0] * (max_insertions + 1)) > > new_scores = scores[:] > > > > for (str1_idx, char1) in enumerate(str1): > > # make min() always take the other value in the first iteration > of > > the > > # inner loop > > prev_score = len(str2) > > for (n_insertions, char2) in enumerate( > > str2[str1_idx:max_insertions+str1_idx+1] > > ): > > new_scores[n_insertions] = prev_score = > min(scores[n_insertions] > > + (0 if char1 == char2 else 1), prev_score ) > > > > # swap scores <-> new_scores > > scores, new_scores = new_scores, scores > > > > return min(scores) <= max_substitutions > > > > > > Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply > > implemented as: > > > > def approxWithin(str1, str2, max_ins, max_sub): > > """check if it is possible to find str1 within str2 given limitations > > > > The limitations are the maximum allowed number of new characters inserted > > and the maximum allowed number of character substitutions. > > """ > > for str2_idx in range(len(str2)-len(str1)+1): > > print (str2_idx) > > result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub) > > if result: > > return result > > return False > > > > Comments? > > Aha! So you are, in fact, searching for a short sequence in a long > sequence (or many such long sequences). This -is- exactly what > fuzzysearch is meant for! > > > Regarding the code you posted, it looks like it would work (though I > haven't tested it). It certainly isn't very efficient, however, since > it makes many copies of long parts of str2 (see "str2[str2_idx:]"). It > is also fairly straightforward, leaving some room for further > optimization. I do like that it is very readable and easy to > understand! Well, except for the loop in approxStartsWith(), but > that's based on my own code... > > > Inspired by your use-case, I've added highly generic fuzzy searching > functionality to fuzzysearch. You can now limit the number of > substitutions, insertions and deletions as well as their total (i.e. > limit the Levenshtein distance). You can also limit only some of these > as you like. > > Specifically, this supports your use-case of searching for fuzzy > matches allowing only a limited number of substitutions and > insertions, but no deletions. > > The user-friendly utility function fuzzysearch.find_near_matches() now > accepts parameters for limiting the substitutions etc., and chooses a > suitable implementation based on the given parameters. > > I haven't yet implemented an optimized search function allowing only > substitutions and insertions. If the current version not fast enough > for your needs, there are plenty of optimizations still to be done. > > I'd be happy if you could give it a whirl and tell me what happens! I > haven't released it yet, but you can install the latest development > version using pip (you'll need to have git installed for this): > > pip install --upgrade > git+git://github.com/taleinat/fuzzysearch.git#egg=fuzzysearch > > - Tal > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From eric.talevich at gmail.com Tue Mar 18 19:30:42 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 18 Mar 2014 16:30:42 -0700 Subject: [Biopython] Google Summer of Code 2014: Student application In-Reply-To: References: Message-ID: On Mon, Mar 17, 2014 at 12:09 PM, Llu?s Revilla wrote: > Hi everyone, > > I am a Biotechnology student and I want to contribute to Biopython. I have > read the wiki GSoC page and I found two ideas. But I think I don't have the > desired skills, I am not much familiarized with the Biopython's existing > sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with > javascript ("Interactive GenomeDiagram Module"). So I am thinking to make > a proposal for the Google Summer of Code about a comparing tool. > > My idea comes from the following: I have been several time in charge of > selecting a tool to do a certain process e.g.: A list of predicted genes, a > list of possible structures, a list of alignments... > > But usually in bioinformatics there are many programs to do the same thing, > usually they use a different algorithm a different training set data > (prokaryote, eukaryote ), or have different specifications. And they return > a more or less sophisticated list, in some standard format, FASTA, GFF, > Genebank... > > The problem when starting a project is to select from this different > programs which one use for the task, e.g.: Which gene predictor is better > for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The > answer will be specific to the project but sometimes its difficult to > ensure that it is a good selection. (Other times it is good enough to do > what the majority do.) But does not solve the problem when new algorithms > appears, or even to compare between different program versions. > > To cover this problem I would like to develop for Biopython a module to > compare between the different programs output to asses which one is better > for the task. > Currently I developed a parser for the afford mentioned programs and it > compares them in a (very) rude way. I would like to develop further and > release it to the Biopython community. > > What are your thoughts about this idea? > Thanks, > > Llu?s > Hi Llu?s, This is an interesting idea, though a bit broad. You could maybe find some inspiration or focus by looking at Critical Assessment of Function Prediction (CAFA): http://biofunctionprediction.org/ Perhaps Iddo Friedberg or another AFP enthusiast could comment on how this project could support benchmarking of automated annotations. On the technical side, I also recommend looking at nestly, a program that will execute another specific command-line program with a variety of different parameters and automatically organize, summarize and compare the outputs. http://fhcrc.github.io/nestly/ All the best, Eric From lluis.revilla at gmail.com Wed Mar 19 06:12:26 2014 From: lluis.revilla at gmail.com (=?ISO-8859-1?Q?Llu=EDs_Revilla?=) Date: Wed, 19 Mar 2014 11:12:26 +0100 Subject: [Biopython] Google Summer of Code 2014: Student application In-Reply-To: References: Message-ID: Dear Eric and all. I summarize here some of the comments you made to the proposal: 1. It is a bit broad (Eric) 2. Provides a common visual representation of the different inputs? (Christian) 3. Supposed to actually rank different tools / outputs? If so is a surprisingly hard problem (Christian and bow) 4. Difficult and difficult to fit in Biopython (bow) 5. Useful just once for each task (bow) 6. More useful to write parsers using a common object mode, but generalizing their outputs is also not a trivial task (bow) And here my comments: 1. It is intended to be broad, to be applied not just to Gene Predictors but also to RNA secondary structure predictors, or ncRNA predictors, functional site predictors or secondary or even protein tertiary structure predictors. 2. Well, my initial thought was to compare their results, but to do so they need to be in the same format so adding a common visual representation it could be added. 3. If there is a reference to which compare the programs it is not so hard, but then it loses the point to compare the programs. But I actually ranked them according of how much they share between them and how much they differ. If they are supposed to do the same thing their results should tend to be the same, at least this can set apart some very deviated programs, although it doesn't ensure that the other ones are the wrong ones. 4. I agree, that is way I mailed it, to know if it would fit or not, and how useful it would be. 5. Even it is useful once, the program versions can change and then they will need to be evaluated again (If they keep the output format it would work) and not all the project search the same type of result even with the same task to do. Some would like to test with a reference what happen with the false positives genes predicted, or want the minimum false rate even if they get just 40% of the annotated genes. But mainly it is true that it is to use just once. 6. As it would be part of my idea I could make the parsers. The common object could include the essential information and for each parser then add the particular output information of each program. In short: It either seems to difficult or out of my skills to complete my idea and there are doubts if it fits in Biopython library. If it is more useful I can change my proposal to code parsers for gene predictors or any other program not already parsed in Biopython. Thanks all for your comments and feed-back, I will be glad to read more comments and improve or change my proposal. Best, Llu?s 2014-03-19 0:30 GMT+01:00 Eric Talevich : > On Mon, Mar 17, 2014 at 12:09 PM, Llu?s Revilla wrote: > >> Hi everyone, >> >> I am a Biotechnology student and I want to contribute to Biopython. I have >> read the wiki GSoC page and I found two ideas. But I think I don't have >> the >> desired skills, I am not much familiarized with the Biopython's existing >> sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with >> javascript ("Interactive GenomeDiagram Module"). So I am thinking to make >> a proposal for the Google Summer of Code about a comparing tool. >> >> My idea comes from the following: I have been several time in charge of >> selecting a tool to do a certain process e.g.: A list of predicted genes, >> a >> list of possible structures, a list of alignments... >> >> But usually in bioinformatics there are many programs to do the same >> thing, >> usually they use a different algorithm a different training set data >> (prokaryote, eukaryote ), or have different specifications. And they >> return >> a more or less sophisticated list, in some standard format, FASTA, GFF, >> Genebank... >> >> The problem when starting a project is to select from this different >> programs which one use for the task, e.g.: Which gene predictor is better >> for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The >> answer will be specific to the project but sometimes its difficult to >> ensure that it is a good selection. (Other times it is good enough to do >> what the majority do.) But does not solve the problem when new algorithms >> appears, or even to compare between different program versions. >> >> To cover this problem I would like to develop for Biopython a module to >> compare between the different programs output to asses which one is better >> for the task. >> Currently I developed a parser for the afford mentioned programs and it >> compares them in a (very) rude way. I would like to develop further and >> release it to the Biopython community. >> >> What are your thoughts about this idea? >> Thanks, >> >> Llu?s >> > > Hi Llu?s, > > This is an interesting idea, though a bit broad. You could maybe find some > inspiration or focus by looking at Critical Assessment of Function > Prediction (CAFA): > http://biofunctionprediction.org/ > > Perhaps Iddo Friedberg or another AFP enthusiast could comment on how this > project could support benchmarking of automated annotations. > > On the technical side, I also recommend looking at nestly, a program that > will execute another specific command-line program with a variety of > different parameters and automatically organize, summarize and compare the > outputs. > http://fhcrc.github.io/nestly/ > > All the best, > Eric > From ericmajinglong at gmail.com Thu Mar 20 11:15:46 2014 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 20 Mar 2014 11:15:46 -0400 Subject: [Biopython] Are there tools for automatically parsing glycan names into tree structures? Message-ID: Hi everybody, Many apologies if you have seen this post cross-posted elsewhere. I have tried digging around but could not find an answer to my question. My colleague and I are working on a project involving data produced at a glycan microarray facility. The array data that came back to us were a list of glycan names (in the format (random example from the top of my head): GlcNAc...). We would like to parse the list of 610 names into the graphical representation of the glycan. Is this possible? If so, what tools are available to get this done? Thank you! Cheers, Eric ---------- w: http://about.me/ericmjl L: http://www.linkedin.com/in/ericmjl #: (857) 209-1375 From zruan1991 at gmail.com Thu Mar 20 16:21:23 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Thu, 20 Mar 2014 16:21:23 -0400 Subject: [Biopython] Exonerate Parser Error Message-ID: Hi, I'm trying to use Bio.SearchIO to parse a file generated by exonerate. It is a pairwise alignment between a protein sequence and nucleotide sequence. I notice that if I put the protein sequence first, SearchIO can happily parse the file, but if the nucleotide sequence comes first, it will raise an error. Here is an example exonerate output that failed the parser: Command line: [exonerate --showvulgar no --showalignment yes nuc.fa pro.fa] Hostname: [localhost.localdomain] C4 Alignment: ------------ Query: dna Target: protein Model: ungapped:dna2protein Raw score: 214 Query range: 2 -> 116 Target range: 314 -> 352 3 : CAGTCCGTTCCNAAAAGGCCCGCTGGCTCTGTGCAGAATCCTGTCTATCACAATCAGCCTCTGA : 66 GlnSerValProLysArgProAlaGlySerValGlnAsnProValTyrHisAsnGlnProLeuA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 315 : GlnSerValProLysArgProAlaGlySerValGlnAsnProValTyrHisAsnGlnProLeuA : 336 67 : ACCCCGCGCCCAGCAGAGACCCACACTACCAGGACCCCCACAGCACTGCA : 116 snProAlaProSerArgAspProHisTyrGlnAspProHisSerThrAla |||||||||||||||||||||||||||||||||||||||||||||||||| 337 : snProAlaProSerArgAspProHisTyrGlnAspProHisSerThrAla : 352 -- completed exonerate analysis Here is the error I get: >>> from Bio.SearchIO import read /home/rz/code/biopython/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> p = read('nuc_pro.exn', 'exonerate-text') Traceback (most recent call last): File "", line 1, in File "/home/rz/code/biopython/Bio/SearchIO/__init__.py", line 359, in read first = next(generator) File "/home/rz/code/biopython/Bio/SearchIO/__init__.py", line 316, in parse for qresult in generator: File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 235, in __iter__ for qresult in self._parse_qresult(): File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 361, in _parse_qresult hsp = _create_hsp(prev_hid, prev_qid, prev['hsp']) File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 187, in _create_hsp frags = _adjust_aa_seq(frags) File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 41, in _adjust_aa_seq assert frag.query_strand == 0 AssertionError Thanks, Zheng Ruan From w.arindrarto at gmail.com Thu Mar 20 16:31:05 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 20 Mar 2014 21:31:05 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Zheng, > I'm trying to use Bio.SearchIO to parse a file generated by exonerate. It > is a pairwise alignment between a protein sequence and nucleotide sequence. > I notice that if I put the protein sequence first, SearchIO can happily > parse the file, but if the nucleotide sequence comes first, it will raise > an error. Formatting on the plaintext mail seems inadequate for me at the moment. Would you mind sending me the file that contains the alignment? If it's too big, partial files are ok, too. Looking at our test cases, this particular case may have slipped testing. We do test for several cases of dna2protein (which could explain why it works when the nucleotide sequence comes first), but not protein2dna. Please let me know if I can also use your example as a test in our test corpus :). Cheers, Bow From w.arindrarto at gmail.com Thu Mar 20 16:33:38 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 20 Mar 2014 21:33:38 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: > Looking at our test cases, this particular case may have slipped > testing. We do test for several cases of dna2protein (which could > explain why it works when the nucleotide sequence comes first), but > not protein2dna. Please let me know if I can also use your example as > a test in our test corpus :). Oops, I meant the reverse ~ we have several test cases for protein2dna which may explain why it works when the protein sequence comes first ;). From w.arindrarto at gmail.com Thu Mar 20 19:30:40 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 21 Mar 2014 00:30:40 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Zheng, Thank you for the files :). I found out what was causing the error and have pushed a patch along with some tests to our codebase (https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d). You should be able to parse your file using the latest `master` branch. Hope this helps, Bow On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan wrote: > Hi Bow, > > I'm happy to provide the example for testing. See attachment. > > The command to generate the output above. > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa > > I'll check the test suite to see if I can find why. > > Best, > Zheng > > > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto > wrote: >> >> > Looking at our test cases, this particular case may have slipped >> > testing. We do test for several cases of dna2protein (which could >> > explain why it works when the nucleotide sequence comes first), but >> > not protein2dna. Please let me know if I can also use your example as >> > a test in our test corpus :). >> >> Oops, I meant the reverse ~ we have several test cases for protein2dna >> which may explain why it works when the protein sequence comes first >> ;). > > From zruan1991 at gmail.com Fri Mar 21 10:39:29 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 21 Mar 2014 10:39:29 -0400 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Thanks Bow, That works for me. But it seems the parser doesn't take the nucleotide information into the hsps. All I get is a pairwise alignment between two proteins. Nucleotide information is useful because I want to know the codon -- amino acid correspondence. In the case of frameshift the situation may not be that straightforward. Maybe you have other concern of not doing this. Best, Zheng On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto wrote: > Hi Zheng, > > Thank you for the files :). I found out what was causing the error and > have pushed a patch along with some tests to our codebase > ( > https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d > ). > You should be able to parse your file using the latest `master` > branch. > > Hope this helps, > Bow > > On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan wrote: > > Hi Bow, > > > > I'm happy to provide the example for testing. See attachment. > > > > The command to generate the output above. > > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa > > > > I'll check the test suite to see if I can find why. > > > > Best, > > Zheng > > > > > > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto < > w.arindrarto at gmail.com> > > wrote: > >> > >> > Looking at our test cases, this particular case may have slipped > >> > testing. We do test for several cases of dna2protein (which could > >> > explain why it works when the nucleotide sequence comes first), but > >> > not protein2dna. Please let me know if I can also use your example as > >> > a test in our test corpus :). > >> > >> Oops, I meant the reverse ~ we have several test cases for protein2dna > >> which may explain why it works when the protein sequence comes first > >> ;). > > > > > From w.arindrarto at gmail.com Fri Mar 21 10:59:40 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 21 Mar 2014 15:59:40 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Zheng, The nucleotide information is stored as the alignment annotation. You can access it using hsp.aln_annotation['query_annotation']. There, they are stored as triplets, reprensenting the codons. This is indeed a tradeoff that I had to make because there is no proper model yet to represent alignment objects containing sequences with different length in our master branch. In this case, the length of the DNA is most of the time 3x the length of the protein. And yes, this is not ideal since the actual query are now stored as an annotation ~ trading places with the translated query. HSPs themselves are basically modelled based on our MultipleSeqAlignment objects (you can get such objects when accessing the `aln` attribute from an HSP object). I think in order to properly model these types of alignment, we need to have a proper model of three-letter protein Seq objects as well. Your CodonSeqAlignment object may help here :), but I have not looked into it that much to be honest. How does it work with Seq objects with ProteinAlphabet? Is it possible to align protein and codon sequences? I tried storing as much information as possible using the current approach (e.g. notice the start and end coordinates of each hit and query, they are parsed from the file and the difference is not the same as the value you get when doing a `len` on hsp.query and/or hsp.hit). Note also that when dealing with frameshifts, you may want to access the hsp.fragments attribute, since frameshifts mean that you can break further your HSP alignment into multiple subalignments (fragments as it is called in SearchIO). Hope this helps :), Bow P.S. Also CC-ing the Development list ~ this looks like something interesting for dev in general. On Fri, Mar 21, 2014 at 3:39 PM, Zheng Ruan wrote: > Thanks Bow, > > That works for me. But it seems the parser doesn't take the nucleotide > information into the hsps. All I get is a pairwise alignment between two > proteins. Nucleotide information is useful because I want to know the codon > -- amino acid correspondence. In the case of frameshift the situation may > not be that straightforward. Maybe you have other concern of not doing this. > > Best, > Zheng > > > On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto > wrote: >> >> Hi Zheng, >> >> Thank you for the files :). I found out what was causing the error and >> have pushed a patch along with some tests to our codebase >> >> (https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d). >> You should be able to parse your file using the latest `master` >> branch. >> >> Hope this helps, >> Bow >> >> On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan wrote: >> > Hi Bow, >> > >> > I'm happy to provide the example for testing. See attachment. >> > >> > The command to generate the output above. >> > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa >> > >> > I'll check the test suite to see if I can find why. >> > >> > Best, >> > Zheng >> > >> > >> > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto >> > >> > wrote: >> >> >> >> > Looking at our test cases, this particular case may have slipped >> >> > testing. We do test for several cases of dna2protein (which could >> >> > explain why it works when the nucleotide sequence comes first), but >> >> > not protein2dna. Please let me know if I can also use your example as >> >> > a test in our test corpus :). >> >> >> >> Oops, I meant the reverse ~ we have several test cases for protein2dna >> >> which may explain why it works when the protein sequence comes first >> >> ;). >> > >> > > > From zruan1991 at gmail.com Fri Mar 21 15:32:33 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 21 Mar 2014 15:32:33 -0400 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Bow, I have the same problem when trying to model codon alignment with frameshift being considered. Basically, I have a CodonSeq object to store a coding sequence. The only difference between CodonSeq and Seq object is that CodonSeq has an attribute -- `rf_table` (reading frame table). It's actually a list of positions each codon starts with, so that translate() method will go through the list to translate codon into amino acid. In this case, it is easy to store a coding sequence with frameshift events. And it's not necessary to split the protein to dna alignment into multiple part when frameshift occurs. However, the problem now becomes how to obtain such information (`rf_table`). I find exonerate is quite capable of handling this task, especially with introns in the dna. I do think an object to store protein to dna alignment is necessary in this scenario. Best, Zheng On Fri, Mar 21, 2014 at 10:59 AM, Wibowo Arindrarto wrote: > Hi Zheng, > > The nucleotide information is stored as the alignment annotation. You > can access it using hsp.aln_annotation['query_annotation']. There, > they are stored as triplets, reprensenting the codons. > > This is indeed a tradeoff that I had to make because there is no > proper model yet to represent alignment objects containing sequences > with different length in our master branch. In this case, the length > of the DNA is most of > the time 3x the length of the protein. And yes, this is not ideal > since the actual query are now stored as an annotation ~ trading > places with the translated query. HSPs themselves are basically > modelled based on our MultipleSeqAlignment objects (you can get such > objects when accessing the `aln` attribute from an HSP object). I > think in order to properly model these types of alignment, we need to > have a proper model of three-letter protein Seq objects as well. > > Your CodonSeqAlignment object may help here :), but I have not looked > into it that much to be honest. How does it work with Seq objects with > ProteinAlphabet? Is it possible to align protein and codon sequences? > > I tried storing as much information as possible using the current > approach (e.g. notice the start and end coordinates of each hit and > query, they are parsed from the file and the difference is not the > same as the value you get when doing a `len` on hsp.query and/or > hsp.hit). Note also that when dealing with frameshifts, you may want > to access the hsp.fragments attribute, since frameshifts mean that you > can break further your HSP alignment into multiple subalignments > (fragments as it is called in SearchIO). > > Hope this helps :), > Bow > > P.S. Also CC-ing the Development list ~ this looks like something > interesting for dev in general. > > On Fri, Mar 21, 2014 at 3:39 PM, Zheng Ruan wrote: > > Thanks Bow, > > > > That works for me. But it seems the parser doesn't take the nucleotide > > information into the hsps. All I get is a pairwise alignment between two > > proteins. Nucleotide information is useful because I want to know the > codon > > -- amino acid correspondence. In the case of frameshift the situation may > > not be that straightforward. Maybe you have other concern of not doing > this. > > > > Best, > > Zheng > > > > > > On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto < > w.arindrarto at gmail.com> > > wrote: > >> > >> Hi Zheng, > >> > >> Thank you for the files :). I found out what was causing the error and > >> have pushed a patch along with some tests to our codebase > >> > >> ( > https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d > ). > >> You should be able to parse your file using the latest `master` > >> branch. > >> > >> Hope this helps, > >> Bow > >> > >> On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan > wrote: > >> > Hi Bow, > >> > > >> > I'm happy to provide the example for testing. See attachment. > >> > > >> > The command to generate the output above. > >> > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa > >> > > >> > I'll check the test suite to see if I can find why. > >> > > >> > Best, > >> > Zheng > >> > > >> > > >> > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto > >> > > >> > wrote: > >> >> > >> >> > Looking at our test cases, this particular case may have slipped > >> >> > testing. We do test for several cases of dna2protein (which could > >> >> > explain why it works when the nucleotide sequence comes first), but > >> >> > not protein2dna. Please let me know if I can also use your example > as > >> >> > a test in our test corpus :). > >> >> > >> >> Oops, I meant the reverse ~ we have several test cases for > protein2dna > >> >> which may explain why it works when the protein sequence comes first > >> >> ;). > >> > > >> > > > > > > From ferreirafm at usp.br Tue Mar 25 09:48:35 2014 From: ferreirafm at usp.br (Frederico Moraes Ferreira) Date: Tue, 25 Mar 2014 10:48:35 -0300 Subject: [Biopython] Levenshtein vs. blast sequence similarity Message-ID: <53318933.90203@usp.br> Biopython list, Sorry about this perhaps off-topic question concerning more to the use than the algorithm implementation of sequence similarity tools. Feel free to send answers directly to my e-mail if you judge it's inappropriate to the list contends. I would like to compare the sequence similarity (Blast "Positive" output) and/or the Levenshtein score of four groups of sequences (variable region!) against a given peptide and use a multiple comparison test to support the hypothesis that such peptide is more closely relate to one group than another. My original implementation was done using the ratio between the Blast positive score and the peptide length. Well, I've read that the Levenshtein distance is generally considered to be more suitable for distance measures of biological sequences. On the other side, similarity includes additional information like conservative and semi-conservative replacements. So, I'm writing to ask your opinion about this topic and perhaps get another score function to tackle this problem. Any comments are appreciated. Best, Fred P.S.: at the moment I'm ignoring the multiple Blats hsps matches and considering only the highest positives per comparison mate. From asmariyaz23 at gmail.com Wed Mar 26 13:39:17 2014 From: asmariyaz23 at gmail.com (Asma Riyaz) Date: Wed, 26 Mar 2014 13:39:17 -0400 Subject: [Biopython] GenomeDiagram: scale down the the track Message-ID: Hi, I am using GenomeDiagram to show some gene Id I was interested in. However the length of the sequences I am using are huge and hence when I add a feature, and in my case the feature may run only a couple of thousand base pairs due to which the sigil is smaller. Here is a snippet of my code: * gd_track_for_features= gd_diagram.new_track(1,name=seq_record.name ,greytrack=True,start=0,end=len(seq_record))* * gd_feature_set=gd_track_for_features.new_set() * * if rec in oo_list:* * max_len=max(max_len,len(seq_record))* * for feature in seq_record.features:* * if feature.type == "gene":* * try: * * name=feature.qualifiers['gene']* * if name[0] in ids:* * gd_feature_set.add_feature(feature,sigil="ARROW",arrowshaft_height=1.0,arrowhead_length=1.0,color=idToColorDict[name[0]],* * label=True,name=name[0],label_position="start",* * label_color=idToColorDict[name[0]],label_size=10,label_angle=0)* * except KeyError:* * pass * * gd_diagram.draw(format="linear",pagesize='A4',fragments=1,start=0,end=7000000)* Here is one thing I tried to do: when adding a new track, i specified start=0 and end=100 or 1000 or 10,0000-----> however this doesn't seem to scale down the track in anyway, instead what I see are empty tracks as the feature I am looking at do not exist in 1 to 1000 or 1 to 10,000. I attempted the same with "draw" and specified a different end BUT neither of it worked. How could I display all the features that I need to ? Thanks, Asma From p.j.a.cock at googlemail.com Wed Mar 26 13:47:56 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Mar 2014 17:47:56 +0000 Subject: [Biopython] GenomeDiagram: scale down the the track In-Reply-To: References: Message-ID: On Wed, Mar 26, 2014 at 5:39 PM, Asma Riyaz wrote: > Hi, > > I am using GenomeDiagram to show some gene Id I was interested in. However > the length of the sequences I am using are huge and hence when I add a > feature, and in my case the feature may run only a couple of thousand base > pairs due to which the sigil is smaller. > Here is a snippet of my code: > > ... > gd_feature_set.add_feature(feature,sigil="ARROW",arrowshaft_height=1.0,arrowhead_length=1.0,color=idToColorDict[name[0]],* > ... > gd_diagram.draw(format="linear",pagesize='A4',fragments=1,start=0,end=7000000)* > > Here is one thing I tried to do: > > when adding a new track, i specified start=0 and end=100 or 1000 or > 10,0000-----> however this doesn't seem to scale down the track in anyway, > instead what I see are empty tracks as the feature I am looking at do not > exist in 1 to 1000 or 1 to 10,000. > I attempted the same with "draw" and specified a different end BUT neither > of it worked. > > How could I display all the features that I need to ? > > Thanks, > Asma See the "Multiple tracks" example in the tutorial - you can show just sub-regions of different tracks (giving white space to the left and/or right). This is useful when the tracks do not show the same sequence exactly (e.g. with cross-links). What you want to change is the start/end in this line: gd_diagram.draw(...) Peter From asmariyaz23 at gmail.com Wed Mar 26 14:26:55 2014 From: asmariyaz23 at gmail.com (Asma Riyaz) Date: Wed, 26 Mar 2014 14:26:55 -0400 Subject: [Biopython] Graphics Message-ID: GenomeDiagram I am using the multiple tracks example in the tutorial as my base, selecting only "gene" whose id exist in my list and hence I can see the white space to the left and right of the feature. I specified a lower "end" in gd_diagram.draw() but this shows up in such a way that everything after the end position is not displayed even though there a more features. I have attached my figure below. My requirement: I want to show all the ids with an arrow sigil wherever it occurs on a genome(which I accomplished) BUT the arrows turn out to be too small to make sense of Any ideas to make it look better? Asma -------------- next part -------------- A non-text attachment was scrubbed... Name: wrong.pdf Type: application/pdf Size: 89796 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Wed Mar 26 19:05:19 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Mar 2014 23:05:19 +0000 Subject: [Biopython] GenomeDiagram: scale down the the track In-Reply-To: References: Message-ID: On Wed, Mar 26, 2014 at 6:15 PM, Asma Riyaz wrote: > I am using the multiple tracks example as my base, selecting only "gene" > whose id exist in my list and hence I can see the white space to the left > and right of the feature. > > I specified a lower "end" in gd_diagram.draw() but this shows up in such a > way that everything after the end position is not displayed. Yes, the start & end arguments are about which sub-region of the linear sequence to draw. > I have attached my figure below. > > My requirement, I want to show all the ids with an arrow sigil wherever it > occurs on a genome(which I accomplished) BUT the arrows turn out to be too > small to make sense of The length of the sigils (here arrows) is determined by the length of the feature (usually base pairs as we're normally drawing DNA), relative to the length of the region shown. If you want to make the arrows look longer, define a larger feature location (e.g. if the feature is from 1000 to 1010, exaggerate and use 900 to 1020 - perhaps not a good idea?), or draw a smaller region of interest, or make the whole diagram bigger etc. Or are you asking about the vertical height? Peter P.S. You seem to have sent this email multiple times, probably confused my the automatic moderation of the message because of the attachment. The delay is because a human (often me) has to manually approve any suspicious emails (which are usually spam). From asmariyaz23 at gmail.com Wed Mar 26 19:27:38 2014 From: asmariyaz23 at gmail.com (Asma Riyaz) Date: Wed, 26 Mar 2014 19:27:38 -0400 Subject: [Biopython] GenomeDiagram: scale down the the track Message-ID: Hi, Thank you for replying. I was asking to make arrows appear longer even the feature location is smaller when compared to the length of the genome. I will try exaggerating the feature location but don't know if it would turn out right scientifically. Drawing a smaller region of interest is not really what I am hoping for as the gene id's I am looking for are located all over the genome, moreover since these are tracks across multiple organisms, its difficult to focus on a particular region Also, sorry for sending multiple mails as for some reason mailing list rejected each of my mail saying it had a inappropriate subject line :"[Biopython] GenomeDiagram: scale down the the track". Thanks Asma On Wed, Mar 26, 2014 at 7:05 PM, Peter Cock wrote: > On Wed, Mar 26, 2014 at 6:15 PM, Asma Riyaz wrote: > > I am using the multiple tracks example as my base, selecting only "gene" > > whose id exist in my list and hence I can see the white space to the left > > and right of the feature. > > > > I specified a lower "end" in gd_diagram.draw() but this shows up in such > a > > way that everything after the end position is not displayed. > > Yes, the start & end arguments are about which sub-region of the > linear sequence to draw. > > > I have attached my figure below. > > > > My requirement, I want to show all the ids with an arrow sigil wherever > it > > occurs on a genome(which I accomplished) BUT the arrows turn out to be > too > > small to make sense of > > The length of the sigils (here arrows) is determined by the length > of the feature (usually base pairs as we're normally drawing DNA), > relative to the length of the region shown. > > If you want to make the arrows look longer, define a larger feature > location (e.g. if the feature is from 1000 to 1010, exaggerate and > use 900 to 1020 - perhaps not a good idea?), or draw a smaller > region of interest, or make the whole diagram bigger etc. > > Or are you asking about the vertical height? > > Peter > > P.S. You seem to have sent this email multiple times, probably > confused my the automatic moderation of the message because > of the attachment. The delay is because a human (often me) > has to manually approve any suspicious emails (which are > usually spam). > From nje5 at georgetown.edu Fri Mar 28 15:28:14 2014 From: nje5 at georgetown.edu (Nathan Edwards) Date: Fri, 28 Mar 2014 15:28:14 -0400 Subject: [Biopython] Are there tools for automatically parsing glycan names into tree structures? In-Reply-To: References: Message-ID: <5335CD4E.5060200@georgetown.edu> > Many apologies if you have seen this post cross-posted elsewhere. I have > tried digging around but could not find an answer to my question. > > My colleague and I are working on a project involving data produced at a > glycan microarray facility. The array data that came back to us were a list > of glycan names (in the format (random example from the top of my head): > GlcNAc...). We would like to parse the list of 610 names into the graphical > representation of the glycan. > > Is this possible? If so, what tools are available to get this done? My now graduated student (Kevin Brown-Chandler) and I have been developing python tools for the interpretation of CID tandem mass-spectra of N-glycopeptides for a while now, and have a reasonably mature tool for working with these datasets. As part of this infrastructure are python modules for parsing a variety of (N- and O-) glycan structure description formats; glycan structure manipulation, fragmentation, and naming (oxford notation abbreviations); and glycan structure image generation (using the java libraries from GlycoWorkbench). The tools for indexing glycan structure databases and generating images from the indexed databases are distributed with the search software, and we currently distribute a pre-indexed glycan database of (most of) the glycans on the Consortium for Functional Glycomics Mammalian array (v5.1). Download GlycoPeptideSearch (GPS) here: http://grg.tn/GPS Since it is unlikely the current tools do exactly what you need, feel free to ping me back with more specifics, and I'll see what I can do to help. Cheers! - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From rpathmanaban1 at gmail.com Mon Mar 31 09:28:59 2014 From: rpathmanaban1 at gmail.com (Pathmanaban Ramasamy) Date: Mon, 31 Mar 2014 15:28:59 +0200 Subject: [Biopython] filtering by query coverage Message-ID: Hi i am new to biopython and i would like to filter my xml outfile based on Query coverage percentage. Can someone help me with this? thanks in advance -- Pathmanaban From p.j.a.cock at googlemail.com Mon Mar 31 09:33:26 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Mar 2014 14:33:26 +0100 Subject: [Biopython] filtering by query coverage In-Reply-To: References: Message-ID: On Mon, Mar 31, 2014 at 2:28 PM, Pathmanaban Ramasamy wrote: > Hi i am new to biopython and i would like to filter my xml outfile based on > Query coverage percentage. Can someone help me with this? thanks in advance How are you defining query coverage percentage? It might be simpler to use BLAST+ 2.2.28 or later with the tabular output and specifically include one of these columns: qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP Filtering tabular BLAST output on query coverage would then be easy. Peter From p.j.a.cock at googlemail.com Mon Mar 31 10:50:44 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Mar 2014 15:50:44 +0100 Subject: [Biopython] filtering by query coverage In-Reply-To: References: Message-ID: Hi Pathmanaban, No, I meant instead of using the BLAST XML output, you could run BLAST requesting tabular output. If you want to use the BLAST XML output, I think you will need to loop over each hit's HSPs and calculate the query coverage (since I don't think this information is provided precalculated) Regards, Peter P.S. Please CC the mailing list in your reply. On Mon, Mar 31, 2014 at 3:44 PM, Pathmanaban Ramasamy wrote: > Hi Peter , > Thanks for your mail. Yes query coverage i mean by the regions with hits > (no gaps). So u say that i can blast using standalone blast version and then > parse them as usual xml parse in biopython? > > > On Mon, Mar 31, 2014 at 3:33 PM, Peter Cock > wrote: >> >> On Mon, Mar 31, 2014 at 2:28 PM, Pathmanaban Ramasamy >> wrote: >> > Hi i am new to biopython and i would like to filter my xml outfile based >> > on >> > Query coverage percentage. Can someone help me with this? thanks in >> > advance >> >> How are you defining query coverage percentage? >> >> It might be simpler to use BLAST+ 2.2.28 or later with the tabular output >> and specifically include one of these columns: >> >> qcovs means Query Coverage Per Subject >> qcovhsp means Query Coverage Per HSP >> >> Filtering tabular BLAST output on query coverage would then be easy. >> >> Peter > > > > > -- > Pathmanaban From p.j.a.cock at googlemail.com Tue Mar 4 18:39:10 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Mar 2014 18:39:10 +0000 Subject: [Biopython] Fwd: [OBF Members] BOSC 2014 Call for Abstracts In-Reply-To: <5077D423-549B-4E80-B70A-D005F731E51D@gmail.com> References: <5077D423-549B-4E80-B70A-D005F731E51D@gmail.com> Message-ID: Dear Biopythoneers, I hope to see some of you in Boston this summer for BOSC and the Codefest :) Peter ---------- Forwarded message ---------- From: Nomi Harris Date: Tue, Mar 4, 2014 at 5:40 PM Subject: [OBF Members] BOSC 2014 Call for Abstracts To: bosc-announce at lists.open-bio.org, members at open-bio.org, GMOD Announcements List Cc: BOSC 2014 Call for Abstracts for the 15th Annual Bioinformatics Open Source Conference (BOSC 2014) A Special Interest Group (SIG) of ISMB 2014 Dates: July 11-12, 2014 Location: Boston, MA, USA Web site: http://www.open-bio.org/wiki/BOSC_2014 Email: bosc at open-bio.org BOSC announcements mailing list: http://lists.open-bio.org/mailman/listinfo/bosc-announce Important Dates: March 24, 2014: Registration opens for ISMB and BOSC (https://www.iscb.org/ismb2014-registration) April 4, 2014: Deadline for submitting BOSC abstracts (http://www.open-bio.org/wiki/BOSC_Abstract_Submission) May 1, 204: Notification of accepted talk abstracts emailed to authors July 9-10, 2014: Codefest 2014, Boston (http://www.open-bio.org/wiki/Codefest_2014) July 11-12, 2014: BOSC 2014, Boston (http://www.open-bio.org/wiki/BOSC_2014) July 11-15, 2014: ISMB 2014, Boston The Bioinformatics Open Source Conference (BOSC) covers the wide range of open source bioinformatics software being developed, and encompasses the growing movement of Open Science, with its focus on transparency, reproducibility, and data provenance. We welcome submissions relating to all aspects of bioinformatics and open science software, including new computational methods, reusable software components, visualization, interoperability, and other approaches that help to advance research in the biomolecular sciences. Two full days of talks, posters, panel discussions, and informal discussion groups will enable BOSC attendees to interact with other developers and share ideas and code, as well as learning about some of the latest developments in the field of open source bioinformatics. BOSC is sponsored by the Open Bioinformatics Foundation, a non-profit, volunteer-run group dedicated to promoting the practice and philosophy of Open Source software development and Open Scien! ce within the biological research community. We invite you to submit one-page abstracts for talks and posters. This year's session topics are: Open Science and Reproducible Research Software Interoperability Genome-scale Data and Beyond Visualization Translational Bioinformatics Bioinformatics Open Source Libraries and Projects Once again we thank Eagle Genomics for sponsoring the BOSC Student Travel Awards, and welcome the open access journal GigaScience as a new sponsor for BOSC 2014. BOSC 2014 Organizing Committee: Nomi Harris and Peter Cock (co-chairs), Raoul Jean Pierre Bonnal, Brad Chapman, Robert Davey, Christopher Fields, Hans-Rudolf Hotz, Hilmar Lapp _______________________________________________ Members mailing list Members at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/members From mmokrejs at fold.natur.cuni.cz Thu Mar 6 19:34:48 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 06 Mar 2014 20:34:48 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FF95A2.7070102@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FF95A2.7070102@fold.natur.cuni.cz> Message-ID: <5318CDD8.8050602@fold.natur.cuni.cz> Hi list mates, I am rather happy with SearchIO, I haven't found more issues while converting my code. Let's see what devs do with proposed sanitization of objects lacking certain attributes. More impressions at the very end of the email. Martin Mokrejs wrote: > Martin Mokrejs wrote: >> Hi, >> I am in the process of conversion to the new XML parsing code written by Bow. >> So far, I have deciphered the following replacement strings (somewhat written in sed(1) format): >> >> >> /hsp.identities/hsp.ident_num/ >> /hsp.score/hsp.bitscore/ >> /hsp.expect/hsp.evalue/ >> /hsp.bits/hsp.bitscore/ >> /hsp.gaps/hsp.gap_num/ >> /hsp.bits/hsp.bitscore_raw/ > > Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names: > > /_hsp.score/_hsp.bitscore_raw/ > /_hsp.bits/_hsp.bitscore/ > > >> /hsp.positives/hsp.pos_num/ >> /hsp.sbjct_start/hsp.hit_start/ >> /hsp.sbjct_end/hsp.hit_end/ >> # hsp.query_start # no change from NCBIXML >> # hsp.query_end # no change from NCBIXML >> /record.query.split()[0]/record.id/ >> /alignment.hit_def.split(' ')[0]/alignment.hit_id/ >> /record.alignments/record.hits/ >> >> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not) >> >> >> >> >> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence. >> >> Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) > > > Answering myself: > > /alignment.hit_id/alignment.id/ > /alignment.length/_record.hits[0].seq_len/ > > > Other changes: > > _hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] > _hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] > _hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| | ||| || ||||||| |||||' > > I think the dictionary key should have been better named "similarity". > > > > The strand does not translate simply to SearchIO, one needs to do: > /_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc. > > > >> >> >> >> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;) > > I got around with try/except although it is more expensive than previously sufficient if/else tests: > > # undo the off-by-one change in SearchIO and transform back to real-life numbers > _hit_start = _hsp.hit_start + 1 > _query_start = _hsp.query_start + 1 > > try: > _ident_num = _hsp.ident_num > except: > _ident_num = 0 > > try: > _pos_num = _hsp.pos_num > except: > _pos_num = 0 > > try: > _gap_num = _hsp.gap_num > except: > # calculate gaps count missing sometimes in legacy blast XML output > # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected > _gap_num = _hsp.aln_span - _ident_num > > > > > > > So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing. After a while of using I could say that now with SearchIO I get at least 2x, mostly 4x faster XML parsing speed (wallclock) and notably, 256GB large XML files from blastn take now only 200-300MB of RAM (unlike 25GB of RAM using NCBIXML before). Congratulations, Bow, seemed silly I couldn't use my laptop to parse such huge files which are generate in a few hours but parsing takes days! [ Why I need old blastn and XML is out of question here. ;) -- I just need them because blastn+ with tabular plaintext output does not give me required data. ] Martin From richard.squires at nih.gov Sun Mar 9 01:05:33 2014 From: richard.squires at nih.gov (Squires, Richard (NIH/NIAID) [C]) Date: Sun, 9 Mar 2014 01:05:33 +0000 Subject: [Biopython] Biopython tutorial at SciPy 2014 Message-ID: Hello, I thought I would check to see if anyone was already planning to offer a Biopython tutorial at the upcoming SciPy 2014 meeting in Austin, Texas USA in July? I am considering doing it just did not want to step on any toes. :-) Burke Squires -- R. Burke Squires Computational Genomics Specialist Contractor ? Medical Sciences & Computing, Inc. Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH 31 Center Drive, Room 3B62E.2 Bethesda, MD 20892 Office: 301-402-9408 Mobile: 240-454-4515 http://bioinformatics.niaid.nih.gov (Within NIH) http://exon.niaid.nih.gov (Public) NIAID Bioinfo Twitter: @niaidbioit Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. From dmccully at mail.nih.gov Mon Mar 10 19:14:59 2014 From: dmccully at mail.nih.gov (McCully, Dwayne (NIH/NIAMS) [C]) Date: Mon, 10 Mar 2014 19:14:59 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> References: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> Message-ID: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> During the testing of Biopython I get the following error. Should I be concerned? Linux 6.5 Python 2.7.6 Dwayne Bio.Statistics.lowess docstring test ... ok Bio.PDB.Polypeptide docstring test ... ok Bio.PDB.Selection docstring test ... ok ====================================================================== ERROR: test_read_from_url (test_Entrez_online.EntrezOnlineCase) Test Entrez.read from URL ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez_online.py", line 44, in test_read_from_url rec = Entrez.read(einfo) File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/__init__.py", line 372, in read record = handler.read(handle) File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 187, in read self.parser.ParseFile(handle) File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 529, in externalEntityRefHandler raise RuntimeException("Failed to access %s at %s" % (filename, url)) NameError: global name 'RuntimeException' is not defined ---------------------------------------------------------------------- Ran 223 tests in 513.473 seconds FAILED (failures = 1) From dmccully at mail.nih.gov Mon Mar 10 20:25:20 2014 From: dmccully at mail.nih.gov (McCully, Dwayne (NIH/NIAMS) [C]) Date: Mon, 10 Mar 2014 20:25:20 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <29944B51-B3EE-4A73-AE33-6FD590A938A2@vanderbilt.edu> References: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> <29944B51-B3EE-4A73-AE33-6FD590A938A2@vanderbilt.edu> Message-ID: <432A8E6B26DC62439F0201C069BE2B671B368188@MLBXV01.nih.gov> Thanks for the info. Not sure If I should deploy it! Dwayne From: Willis, Jordan R [mailto:jordan.r.willis at Vanderbilt.Edu] Sent: Monday, March 10, 2014 3:46 PM To: McCully, Dwayne (NIH/NIAMS) [C] Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] Biopython: python setup.py test I think this is a bug in the code. I think it should be RuntimeError from built in exceptions that is used in most of that file. Jordan On Mar 10, 2014, at 2:14 PM, McCully, Dwayne (NIH/NIAMS) [C] > wrote: RuntimeException From jordan.r.willis at Vanderbilt.Edu Mon Mar 10 19:46:24 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Mon, 10 Mar 2014 19:46:24 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> References: <432A8E6B26DC62439F0201C069BE2B671B367FD8@MLBXV01.nih.gov> <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> Message-ID: <29944B51-B3EE-4A73-AE33-6FD590A938A2@vanderbilt.edu> I think this is a bug in the code. I think it should be RuntimeError from built in exceptions that is used in most of that file. Jordan On Mar 10, 2014, at 2:14 PM, McCully, Dwayne (NIH/NIAMS) [C] > wrote: RuntimeException From mjldehoon at yahoo.com Tue Mar 11 01:43:16 2014 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 10 Mar 2014 18:43:16 -0700 (PDT) Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> Message-ID: <1394502196.27158.YahooMailBasic@web164006.mail.gq1.yahoo.com> I have fixed the typo in github. Best, -Michiel. -------------------------------------------- On Mon, 3/10/14, McCully, Dwayne (NIH/NIAMS) [C] wrote: Subject: [Biopython] Biopython: python setup.py test To: "'biopython at lists.open-bio.org'" Date: Monday, March 10, 2014, 3:14 PM During the testing of Biopython I get the following error. Should I be concerned? Linux 6.5 Python 2.7.6 Dwayne Bio.Statistics.lowess docstring test ... ok Bio.PDB.Polypeptide docstring test ... ok Bio.PDB.Selection docstring test ... ok ====================================================================== ERROR: test_read_from_url (test_Entrez_online.EntrezOnlineCase) Test Entrez.read from URL ---------------------------------------------------------------------- Traceback (most recent call last): ? File "test_Entrez_online.py", line 44, in test_read_from_url ? ? rec = Entrez.read(einfo) ? File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/__init__.py", line 372, in read ? ? record = handler.read(handle) ? File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 187, in read ? ? self.parser.ParseFile(handle) ? File "/home/dmccully/biopython-1.63/build/lib.linux-x86_64-2.7/Bio/Entrez/Parser.py", line 529, in externalEntityRefHandler ? ? raise RuntimeException("Failed to access %s at %s" % (filename, url)) NameError: global name 'RuntimeException' is not defined ---------------------------------------------------------------------- Ran 223 tests in 513.473 seconds FAILED (failures = 1) _______________________________________________ Biopython mailing list? -? Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Mar 11 10:49:49 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Mar 2014 10:49:49 +0000 Subject: [Biopython] Biopython: python setup.py test In-Reply-To: <1394502196.27158.YahooMailBasic@web164006.mail.gq1.yahoo.com> References: <432A8E6B26DC62439F0201C069BE2B671B3680CE@MLBXV01.nih.gov> <1394502196.27158.YahooMailBasic@web164006.mail.gq1.yahoo.com> Message-ID: On Tue, Mar 11, 2014 at 1:43 AM, Michiel de Hoon wrote: > I have fixed the typo in github. > Best, > -Michiel. Thanks Michiel & Dwayne, The typo was in an error message when there was a problem assessing a DTD file (describing the XML structure). In this case I would guess it was missing this file: http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd The NCBI has added several new DTD files which will be bundled with the next Biopython release (which will also cache missing DTD files automatically). Dwayne: You could fix this typo manually; download the missing DTD file; or install Biopython from GitHub. Regards, Peter From mike.thon at gmail.com Tue Mar 11 12:43:23 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 11 Mar 2014 13:43:23 +0100 Subject: [Biopython] parsing hmmer results Message-ID: <4E2F14C8-CE16-4F36-91A2-7D171C96947E@gmail.com> I?m trying to parse a batch hmmer v3.1b1 report but I keep getting this error. I think its happening when the parser hits a hmmer report with no hits, but I?m not sure. Here?s the error: Traceback (most recent call last): File "parse-dbcan.py", line 6, in for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): File "/Library/Python/2.7/site-packages/Bio/SearchIO/__init__.py", line 316, in parse for qresult in generator: File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__ for qresult in self._parse_qresult(): File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 109, in _parse_qresult qid = regx.group(1).strip() AttributeError: 'NoneType' object has no attribute ?group' Here?s my script: #!/usr/bin/python from Bio import SearchIO from sys import argv import pdb for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): hits = qresult.hits if len(hits) > 0: beste = hits[0].hsps[0].evalue query = hits[0].query_id hit = hits[0].id.replace('.hmm', '') print query + ',' + hit + ',' + str(beste) #pdb.set_trace() I ran hmmscan like this: hmmscan --cpu 2 -E 1e-3 HMMs.txt prots.fasta >oute-3.txt From w.arindrarto at gmail.com Tue Mar 11 13:18:25 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 11 Mar 2014 14:18:25 +0100 Subject: [Biopython] parsing hmmer results In-Reply-To: <4E2F14C8-CE16-4F36-91A2-7D171C96947E@gmail.com> References: <4E2F14C8-CE16-4F36-91A2-7D171C96947E@gmail.com> Message-ID: Hi Michael, Do you have an example file you can send over (sending it to me privately also works). The parser has not been tested with HMMER version 3.1, and I suppose they introduced some changes which breaks the parser. Best, Bow On Tue, Mar 11, 2014 at 1:43 PM, Michael Thon wrote: > I?m trying to parse a batch hmmer v3.1b1 report but I keep getting this error. I think its happening when the parser hits a hmmer report with no hits, but I?m not sure. > > Here?s the error: > > Traceback (most recent call last): > File "parse-dbcan.py", line 6, in > for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): > File "/Library/Python/2.7/site-packages/Bio/SearchIO/__init__.py", line 316, in parse > for qresult in generator: > File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__ > for qresult in self._parse_qresult(): > File "/Library/Python/2.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 109, in _parse_qresult > qid = regx.group(1).strip() > AttributeError: 'NoneType' object has no attribute ?group' > > > Here?s my script: > > #!/usr/bin/python > from Bio import SearchIO > from sys import argv > import pdb > > for qresult in SearchIO.parse(argv[1], 'hmmer3-text'): > hits = qresult.hits > if len(hits) > 0: > beste = hits[0].hsps[0].evalue > query = hits[0].query_id > hit = hits[0].id.replace('.hmm', '') > print query + ',' + hit + ',' + str(beste) > #pdb.set_trace() > > I ran hmmscan like this: > > hmmscan --cpu 2 -E 1e-3 HMMs.txt prots.fasta >oute-3.txt > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From philipp.schiffer at gmail.com Wed Mar 12 07:09:01 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Wed, 12 Mar 2014 08:09:01 +0100 Subject: [Biopython] Handle for OrthoXML Message-ID: Hi all! Is there a handle/module for OrthoXML parsing? http://orthoxml.org/xml/Documentation.html Cheers Philipp -- Sent from Gmail Mobile From p.j.a.cock at googlemail.com Wed Mar 12 09:26:20 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 12 Mar 2014 09:26:20 +0000 Subject: [Biopython] Handle for OrthoXML In-Reply-To: References: Message-ID: Hi Philipp, No, Biopython does not have a parser for OrtherXML, but Bio.SeqIO can read/write the related SeqXML format. You could try using the ElementTree XML parser from the Python standard library? Peter On Wed, Mar 12, 2014 at 7:09 AM, Philipp Schiffer wrote: > Hi all! > > Is there a handle/module for OrthoXML parsing? > http://orthoxml.org/xml/Documentation.html > > Cheers > > Philipp > > > -- > Sent from Gmail Mobile > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From philipp.schiffer at gmail.com Wed Mar 12 10:21:59 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Wed, 12 Mar 2014 11:21:59 +0100 Subject: [Biopython] Handle for OrthoXML In-Reply-To: References: Message-ID: <92559B2799CB4D1CBD790AEA1DCBA136@googlemail.com> Hi Peter, thanks for the info. I?ll check out the ElementTree parser and also see if some of the SeqXML functionality fulfils my purposes. Best Philipp -- Philipp Schiffer Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, 12 March 2014 at 10:26, Peter Cock wrote: > Hi Philipp, > > No, Biopython does not have a parser for OrtherXML, but > Bio.SeqIO can read/write the related SeqXML format. > > You could try using the ElementTree XML parser from the > Python standard library? > > Peter > > On Wed, Mar 12, 2014 at 7:09 AM, Philipp Schiffer > wrote: > > Hi all! > > > > Is there a handle/module for OrthoXML parsing? > > http://orthoxml.org/xml/Documentation.html > > > > Cheers > > > > Philipp > > > > > > -- > > Sent from Gmail Mobile > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org (mailto:Biopython at lists.open-bio.org) > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From kevin.rue at ucdconnect.ie Wed Mar 12 11:32:11 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 11:32:11 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? Message-ID: Hi all, Some may consider this a repeat of my StackOverflow post ( http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function) but over there I didn't mention the possibility of implementing the feature in Biopython. I am looking for a function which, given sequence1 and sequence2, would return whether sequence1 matches a subsequence of sequence2 allowing up to I insertions, D deletions, and S substitutions. So far, all I could find in Python were fuzzy matching functions using edit distances (Levenshtein and others), but none of those distances distinguish between insertions, deletions and substitution ( http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison ). There is a Perl module called String::Approx ( http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the function amatch() does exactly what I want.. except in Perl. A quick-and-dirty fix could be to make an external call to that Perl function from my Python script, but it would be so much cleaner (and probably faster) if I could avoid external calls and being dependent on multiple interpreters. I believe that such the feature I described could rapidly become popular if implemented in Biopython, but after reading the Perl module code and not understanding most of it, I think any Python module I could write to do the job wouldn't be nearly as optimised and fast. (an external call to the Perl module would surely be faster than my Python implementation) So.... - What are your thoughts? - Did I miss the magic Python package that does what I want? - Does anyone else think such a package would be useful to the bioinformatics community? - Did anyone solve the same issue I'm having in a different way? (I haven't found an "think out of the box" idea yet) - Does anyone feel like implementing this feature? Many thanks for your advice! -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From ivangreg at gmail.com Wed Mar 12 13:38:31 2014 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Wed, 12 Mar 2014 09:38:31 -0400 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: If that Perl function existed in Biopython, I would use it everyday, night and day. I sense that I would not be the only one. Ivan Ivan Gregoretti, PhD Bioinformatics On Wed, Mar 12, 2014 at 7:32 AM, Kevin Rue wrote: > Hi all, > > Some may consider this a repeat of my StackOverflow post ( > > http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function > ) > but over there I didn't mention the possibility of implementing the feature > in Biopython. > > I am looking for a function which, given sequence1 and sequence2, would > return whether sequence1 matches a subsequence of sequence2 allowing up to > I insertions, D deletions, and S substitutions. > > So far, all I could find in Python were fuzzy matching functions using edit > distances (Levenshtein and others), but none of those distances distinguish > between insertions, deletions and substitution ( > > http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison > ). > > There is a Perl module called String::Approx ( > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the > function amatch() does exactly what I want.. except in Perl. A > quick-and-dirty fix could be to make an external call to that Perl function > from my Python script, but it would be so much cleaner (and probably > faster) if I could avoid external calls and being dependent on multiple > interpreters. > > I believe that such the feature I described could rapidly become popular if > implemented in Biopython, but after reading the Perl module code and not > understanding most of it, I think any Python module I could write to do the > job wouldn't be nearly as optimised and fast. (an external call to the Perl > module would surely be faster than my Python implementation) > > So.... > - What are your thoughts? > - Did I miss the magic Python package that does what I want? > - Does anyone else think such a package would be useful to the > bioinformatics community? > - Did anyone solve the same issue I'm having in a different way? (I haven't > found an "think out of the box" idea yet) > - Does anyone feel like implementing this feature? > > Many thanks for your advice! > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From saketkc at gmail.com Wed Mar 12 13:46:41 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Wed, 12 Mar 2014 13:46:41 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi Kevin, There is a package which does something similar. https://github.com/taleinat/fuzzysearch Saket On 12 March 2014 11:32, Kevin Rue wrote: > Hi all, > > Some may consider this a repeat of my StackOverflow post ( > http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function) > but over there I didn't mention the possibility of implementing the feature > in Biopython. > > I am looking for a function which, given sequence1 and sequence2, would > return whether sequence1 matches a subsequence of sequence2 allowing up to > I insertions, D deletions, and S substitutions. > > So far, all I could find in Python were fuzzy matching functions using edit > distances (Levenshtein and others), but none of those distances distinguish > between insertions, deletions and substitution ( > http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison > ). > > There is a Perl module called String::Approx ( > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the > function amatch() does exactly what I want.. except in Perl. A > quick-and-dirty fix could be to make an external call to that Perl function > from my Python script, but it would be so much cleaner (and probably > faster) if I could avoid external calls and being dependent on multiple > interpreters. > > I believe that such the feature I described could rapidly become popular if > implemented in Biopython, but after reading the Perl module code and not > understanding most of it, I think any Python module I could write to do the > job wouldn't be nearly as optimised and fast. (an external call to the Perl > module would surely be faster than my Python implementation) > > So.... > - What are your thoughts? > - Did I miss the magic Python package that does what I want? > - Does anyone else think such a package would be useful to the > bioinformatics community? > - Did anyone solve the same issue I'm having in a different way? (I haven't > found an "think out of the box" idea yet) > - Does anyone feel like implementing this feature? > > Many thanks for your advice! > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From kevin.rue at ucdconnect.ie Wed Mar 12 15:16:03 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 15:16:03 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi, @Ivan: Glad to hear you confirm my thought! @Saket: You're right.. I have already been in touch for the past two days with "taleinat" the person who developped that code :) You will see in his github that in agreement with him, I suggested my feature as a possible enhancement of his package (issue #2 https://github.com/taleinat/fuzzysearch/issues), and he agreed to consider it for future development. No promised release date, but: 1) I wouldn't dare to ask for one as I am already asking for a huge favor for someone else to program that "for me" and the community 2) I am not particularly rushed, his Levenshtein distance does an acceptable job for the time being. I would love to be able to write the code myself, but my PhD thesis is more about using scripts to gain biology knowledge, while my issue would be better dealt with by someone with a much stronger low-level programming skillset using abstract mathematical notions to optimise the code beyond anything I could do with my scripting skills. Cheers Kevin PhD candidate :) On 12 March 2014 13:46, Saket Choudhary wrote: > Hi Kevin, > > There is a package which does something similar. > > https://github.com/taleinat/fuzzysearch > > > Saket > > On 12 March 2014 11:32, Kevin Rue wrote: > > Hi all, > > > > Some may consider this a repeat of my StackOverflow post ( > > > http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function > ) > > but over there I didn't mention the possibility of implementing the > feature > > in Biopython. > > > > I am looking for a function which, given sequence1 and sequence2, would > > return whether sequence1 matches a subsequence of sequence2 allowing up > to > > I insertions, D deletions, and S substitutions. > > > > So far, all I could find in Python were fuzzy matching functions using > edit > > distances (Levenshtein and others), but none of those distances > distinguish > > between insertions, deletions and substitution ( > > > http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison > > ). > > > > There is a Perl module called String::Approx ( > > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the > > function amatch() does exactly what I want.. except in Perl. A > > quick-and-dirty fix could be to make an external call to that Perl > function > > from my Python script, but it would be so much cleaner (and probably > > faster) if I could avoid external calls and being dependent on multiple > > interpreters. > > > > I believe that such the feature I described could rapidly become popular > if > > implemented in Biopython, but after reading the Perl module code and not > > understanding most of it, I think any Python module I could write to do > the > > job wouldn't be nearly as optimised and fast. (an external call to the > Perl > > module would surely be faster than my Python implementation) > > > > So.... > > - What are your thoughts? > > - Did I miss the magic Python package that does what I want? > > - Does anyone else think such a package would be useful to the > > bioinformatics community? > > - Did anyone solve the same issue I'm having in a different way? (I > haven't > > found an "think out of the box" idea yet) > > - Does anyone feel like implementing this feature? > > > > Many thanks for your advice! > > > > > > -- > > K?vin RUE-ALBRECHT > > Wellcome Trust Computational Infection Biology PhD Programme > > University College Dublin > > Ireland > > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From p.j.a.cock at googlemail.com Wed Mar 12 15:48:21 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 12 Mar 2014 15:48:21 +0000 Subject: [Biopython] SciPy 2014 (July 6-12, Austin, Texas, USA) Message-ID: Hi all, It is a bit short notice, but some of you may be interested in attending SciPy 2014, which will again have a bioinformatics session. There is still time to submit an abstract (deadline 14 March): https://conference.scipy.org/scipy2014/participate/presentations/ "SciPy 2014, the thirteenth annual Scientific Computing with Python conference, will be held this July 6th-12th in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest projects, learn from skilled users and developers, and collaborate on code development." Unfortunately SciPy 2014 clashes with BOSC 2014 in Boston, which you may prefer to attend, which is also currently accepting abstracts: http://www.open-bio.org/wiki/BOSC_2014 http://www.open-bio.org/wiki/Codefest_2014 *Disclaimer*: I am co-chairing BOSC this year. Regards, Peter From taleinat at gmail.com Wed Mar 12 15:55:33 2014 From: taleinat at gmail.com (Tal Einat) Date: Wed, 12 Mar 2014 17:55:33 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? Message-ID: Kevin wrote: > @Saket: You're right.. I have already been in touch for the past two days > with "taleinat" the person who developped that code :) You will see in his > github that in agreement with him, I suggested my feature as a possible > enhancement of his package (issue #2 > https://github.com/taleinat/fuzzysearch/issues), and he agreed to consider > it for future development. No promised release date, but: > 1) I wouldn't dare to ask for one as I am already asking for a huge favor > for someone else to program that "for me" and the community > 2) I am not particularly rushed, his Levenshtein distance does an > acceptable job for the time being. I would love to be able to write the > code myself, but my PhD thesis is more about using scripts to gain biology > knowledge, while my issue would be better dealt with by someone with a much > stronger low-level programming skillset using abstract mathematical notions > to optimise the code beyond anything I could do with my scripting skills. Hi again guys, I'm the author of the fuzzysearch Python library. I mentioned it on this list a few months ago thinking it might be useful. The fuzzysearch library is meant to be used for searching, which isn't really what you're doing. As far as I can tell it isn't really good enough for your purpose. I'll be happy to help if I can, however, especially given the additional interest expressed here! The python-Levenshtein library supports generating a sequence of operations transforming one string into another. For example (from the docs): >>> editops('spam', 'park') [('delete', 0, 0), ('insert', 3, 2), ('replace', 3, 3)] However, the requirement you described is significantly different: telling whether a string can be transformed into another using a maximum allowed number of replacements and insertions, but no deletions. For the above example, it could also be transformed without deletions using 4 substitutions! I'd be happy to collaborate on this, including writing code, if you like. I believe that what you need can be implemented relatively easily. - Tal Einat From taleinat at gmail.com Wed Mar 12 18:00:16 2014 From: taleinat at gmail.com (Tal Einat) Date: Wed, 12 Mar 2014 20:00:16 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: > Kevin wrote: > > I am looking for a function which, given sequence1 and sequence2, would > return whether sequence1 matches a subsequence of sequence2 allowing up to > I insertions, D deletions, and S substitutions. Kevin, if you don't want to allow deletions at all (I got the impression this is what you're looking for) then the following code will do the trick (and should be quite fast). Is this generally useful, or would it usually be more useful to also allow a (possibly limited) number of deletions? from array import array def kevin(str1, str2, max_substitutions, max_insertions): """check if it is possible to transform str1 into str2 given limitations The limiations are the maximum allowed number of new characters inserted and the maximum allowed number of character substitutions. """ # check simple cases which are obviously impossible if not len(str1) <= len(str2) <= len(str1) + max_insertions: return False scores = array('L', [0] * (len(str2) - len(str1) + 1)) new_scores = scores[:] for (str1_idx, char1) in enumerate(str1): # make min() always take the other value in the first iteration of the # inner loop prev_score = len(str2) for (n_insertions, char2) in enumerate( str2[str1_idx:len(str2)-len(str1)+str1_idx+1] ): new_scores[n_insertions] = prev_score = min( scores[n_insertions] + (0 if char1 == char2 else 1), prev_score ) # swap scores <-> new_scores scores, new_scores = new_scores, scores return min(scores) <= max_substitutions - Tal From kevin.rue at ucdconnect.ie Wed Mar 12 18:28:53 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 18:28:53 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: HI Tal, To answer your question, I currently don't expect any deletion indeed. It is possible that other users would like that feature, maybe me in the future. Therefore, this code could be a very convenient temporary fix that I could include directly in the code as an annex function. Thank you very much. I'll have a closer look at how it works to teach myself :) You actually (positively) surprised me by taking this approach: I didn't even think of taking the advantage of 0 deletions in my particular use case ! I can imagine that it does reduce the possible combination of edits. I guess I was too impressed by the Perl function that seemed to be doing all at once, but indeed code can be much simpler and faster when one takes advantage of the known pre-conditions and specific use cases. I can imagine that a master function could handle Sub, Del and Ins together, but if Del=0 then it could call this one if it solves the specific problem faster. Now, in a moment of inspiration, am I wrong in saying that similarly, a case where no insertion is allowed would be the same problem while switching sequence1 and sequence2 in the function call? PS: your updated fuzzysearch 0.2.0 package works great for its job. I just haven't had time to check the documentation updates. Cheers! On 12 March 2014 18:00, Tal Einat wrote: > > Kevin wrote: > > > > I am looking for a function which, given sequence1 and sequence2, would > > return whether sequence1 matches a subsequence of sequence2 allowing up > to > > I insertions, D deletions, and S substitutions. > > Kevin, if you don't want to allow deletions at all (I got the > impression this is what you're looking for) then the following code > will do the trick (and should be quite fast). > > Is this generally useful, or would it usually be more useful to also > allow a (possibly limited) number of deletions? > > > from array import array > > def kevin(str1, str2, max_substitutions, max_insertions): > """check if it is possible to transform str1 into str2 given > limitations > > The limiations are the maximum allowed number of new characters > inserted > and the maximum allowed number of character substitutions. > """ > # check simple cases which are obviously impossible > if not len(str1) <= len(str2) <= len(str1) + max_insertions: > return False > > scores = array('L', [0] * (len(str2) - len(str1) + 1)) > new_scores = scores[:] > > for (str1_idx, char1) in enumerate(str1): > # make min() always take the other value in the first iteration of > the > # inner loop > prev_score = len(str2) > for (n_insertions, char2) in enumerate( > str2[str1_idx:len(str2)-len(str1)+str1_idx+1] > ): > new_scores[n_insertions] = prev_score = min( > scores[n_insertions] + (0 if char1 == char2 else 1), > prev_score > ) > > # swap scores <-> new_scores > scores, new_scores = new_scores, scores > > return min(scores) <= max_substitutions > > > - Tal > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Wed Mar 12 18:43:10 2014 From: taleinat at gmail.com (Tal Einat) Date: Wed, 12 Mar 2014 20:43:10 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue wrote: > HI Tal, > > To answer your question, I currently don't expect any deletion indeed. It is > possible that other users would like that feature, maybe me in the future. > Therefore, this code could be a very convenient temporary fix that I could > include directly in the code as an annex function. Thank you very much. I'll > have a closer look at how it works to teach myself :) If anyone else on this list thinks this would be useful, I'd be happy to publish it as a publicly available library. For now, consider the code I posted freely available in the public domain (i.e. use it at will but don't take credit for it in my stead and don't sell it without my consent). > Now, in a moment of inspiration, am I wrong in saying that similarly, a case > where no insertion is allowed would be the same problem while switching > sequence1 and sequence2 in the function call? Indeed, you are correct :) - Tal Einat From kevin.rue at ucdconnect.ie Wed Mar 12 23:56:10 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Wed, 12 Mar 2014 23:56:10 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: HI Tal, I just tested your function and it is doing something slightly different than what I had in mind. I need a few simple examples to illustrate my point: The string "TEST" is present in "TESTER" with 0 substitutions/0 insertions. Therefore I expect the call below to return TRUE. >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, max_insertions=0) but instead it returns FALSE. Meanwhile, >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, max_insertions=2) returns TRUE and >>> kevin("TEST", "TESTER", 0, max_insertions=1) returns FALSE Now, I haven't decrypted your code yet, but my guess is that what your function does is answer the question "Is str1 approximately EQUAL to str2 while allowing a maximum of S substitutions and I insertions". My problem (and believe most bioinformaticians like Ivan who answered earlier) is formulated "Is str1 present somewhere in str2 while allowing S sub and I ins?" In fact, it's your previous answer that made me realise that the function solving my problem "str1 in str2 with max of i insertions and s substitutions" should not be able to solve the problem "str1 in str2 with max of d deletions and s substitutions". Meanwhile, you're answer seems right for the function you sent us "str1 == str2 with max of i insertions and s substitutions" should be solved by the same function called by switching the strings "str2 == str1 with max of d deletions and s substitutions". (Just a guess, but it makes sense to me) If I am right, one solution to solve my problem (with only substitutions and insertions) using your function "kevin" is: - set str1 as the string I am trying to match - call your function for each substring of str2 of length [len(str1) : len(str1)+max_insertions+1] and set each of those as str2 - i can save time by returning TRUE the first time I find a match, because I don't care if there are more This would compare str1 to all possible str2 substrings that could be "approximately EQUAL to str1 allowing up to I insertions and S substitutions in str1". Obviously, another option is to design another function (say.. "kevin2" ^_^) which addresses directly my problem. I don't mind using the solution above (I appreciate your help and time), but I believe an implementation dealing directly with my problem would be faster to solve it, right? Looking forward to your answer! Cheers Kevin On 12 March 2014 18:43, Tal Einat wrote: > On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue > wrote: > > HI Tal, > > > > To answer your question, I currently don't expect any deletion indeed. > It is > > possible that other users would like that feature, maybe me in the > future. > > Therefore, this code could be a very convenient temporary fix that I > could > > include directly in the code as an annex function. Thank you very much. > I'll > > have a closer look at how it works to teach myself :) > > If anyone else on this list thinks this would be useful, I'd be happy > to publish it as a publicly available library. For now, consider the > code I posted freely available in the public domain (i.e. use it at > will but don't take credit for it in my stead and don't sell it > without my consent). > > > Now, in a moment of inspiration, am I wrong in saying that similarly, a > case > > where no insertion is allowed would be the same problem while switching > > sequence1 and sequence2 in the function call? > > Indeed, you are correct :) > > - Tal Einat > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From hlapp at drycafe.net Thu Mar 13 12:50:09 2014 From: hlapp at drycafe.net (Hilmar Lapp) Date: Thu, 13 Mar 2014 08:50:09 -0400 Subject: [Biopython] Fwd: [numfocus] 2014 John Hunter Fellowship - Call for Applications In-Reply-To: References: Message-ID: Some of you folks have probably already seen this. If someone at postdoc or senior PhD student level would love to focus a couple months on furthering Biopython development, this would seem like an excellent opportunity to get some support for that. And I'm sure mentors wouldn't be hard to come by :-) And I suppose I don't need to tell this community who John Hunter is (or, unfortunately, was). -hilmar ---------- Forwarded message ---------- From: Ralf Gommers Date: Tue, Mar 11, 2014 at 3:22 PM Subject: [numfocus] 2014 John Hunter Fellowship - Call for Applications To: numfocus at googlegroups.com, Discussion of Numerical Python < numpy-discussion at scipy.org>, SciPy Users List , matplotlib-users , ipython-dev at scipy.org, sympy at googlegroups.com Hi all, I'm excited to announce, on behalf of the Numfocus board, that applications for the 2014 John Hunter Technology Fellowship are now being accepted. This is the first fellowship Numfocus is able to offer, which we see as a significant milestone. The John Hunter Technology Fellowship aims to bridge the gap between academia and real-world, open-source scientific computing projects by providing a capstone experience for individuals coming from a scientific, engineering or mathematics background. The program consists of a 6 month project-based training program for postdoctoral scientists or senior graduate students. Fellows work on scientific computing open source projects under the guidance of mentors who are leading scientists and software engineers. The aim of the Fellowship is to enable Fellows to develop the skills needed to contribute to cutting-edge open source software projects while at the same time advancing or supporting the research program they and their mentor are involved in. While proposals in any area of science and engineering are welcome, the following areas are encouraged in particular: - Accessible and reproducible computing - Enabling technology for open access publishing - Infrastructural technology supporting open-source scientific software stacks - Core open-source projects promoted by NumFOCUS Eligible applicants are postdoctoral scientists or senior PhD students, or have equivalent experience in physics, mathematics, engineering, statistics, or a related science. The program is open to applicants from any nationality and can be performed at any university or institute world-wide (US export laws permitting). All applications are due May 15, 2014 by 11:59 p.m. Central Standard Time. For more details on the program see: http://numfocus.org/john_hunter_fellowship_2014.html (this call) http://numfocus.org/fellowships.html (program) And for some background see this blog post: http://numfocus.org/announcing-the-numfocus-technology-fellowship-program.html We're looking forward to receiving your applications! Ralf From kevin.rue at ucdconnect.ie Thu Mar 13 18:08:18 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Thu, 13 Mar 2014 18:08:18 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi Tal, I finally had the time to look at your function in details, and understood how it works and why each line is there. Thanks, even though it doesn't do exactly what I had in mind, it does what the docstring says :) I'd call your "kevin" function "approxEqual(str1, str2, max_ins, max_sub)". When I understood your code, I thought of a way to increase the speed of your function, but ended with a (fast) function actually doing something slightly different. This one would actually only look at the str2 substrings from str1_idx but which are no longer than len(str1)+max_insertions. Longer str2 are pointless as they imply that str1 requires more insertions at the start than allowed to match str2. I would call that function "approxStartsWith(str1, str2, max_ins, max_sub)" def approxStartsWith(str1, str2, max_substitutions, max_insertions): """check if it is possible to map str1 to the start of str2 given limitations The limitations are the maximum allowed number of new characters inserted and the maximum allowed number of character substitutions. """ # check simple cases which are obviously impossible if not len(str1) <= len(str2): return False scores = array('L', [0] * (max_insertions + 1)) new_scores = scores[:] for (str1_idx, char1) in enumerate(str1): # make min() always take the other value in the first iteration of the # inner loop prev_score = len(str2) for (n_insertions, char2) in enumerate( str2[str1_idx:max_insertions+str1_idx+1] ): new_scores[n_insertions] = prev_score = min(scores[n_insertions] + (0 if char1 == char2 else 1), prev_score ) # swap scores <-> new_scores scores, new_scores = new_scores, scores return min(scores) <= max_substitutions Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply implemented as: def approxWithin(str1, str2, max_ins, max_sub): """check if it is possible to find str1 within str2 given limitations The limitations are the maximum allowed number of new characters inserted and the maximum allowed number of character substitutions. """ for str2_idx in range(len(str2)-len(str1)+1): print (str2_idx) result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub) if result: return result return False Comments? Cheers, kevin On 12 March 2014 23:56, Kevin Rue wrote: > HI Tal, > > I just tested your function and it is doing something slightly different > than what I had in mind. > > I need a few simple examples to illustrate my point: > > The string "TEST" is present in "TESTER" with 0 substitutions/0 > insertions. Therefore I expect the call below to return TRUE. > >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, > max_insertions=0) > but instead it returns FALSE. > > Meanwhile, > >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, > max_insertions=2) > returns TRUE > > and > >>> kevin("TEST", "TESTER", 0, max_insertions=1) > returns FALSE > > Now, I haven't decrypted your code yet, but my guess is that what your > function does is answer the question "Is str1 approximately EQUAL to str2 > while allowing a maximum of S substitutions and I insertions". > My problem (and believe most bioinformaticians like Ivan who answered > earlier) is formulated "Is str1 present somewhere in str2 while allowing S > sub and I ins?" > > > In fact, it's your previous answer that made me realise that the function > solving my problem "str1 in str2 with max of i insertions and s > substitutions" should not be able to solve the problem "str1 in str2 with > max of d deletions and s substitutions". > Meanwhile, you're answer seems right for the function you sent us "str1 > == str2 with max of i insertions and s substitutions" should be solved by > the same function called by switching the strings "str2 == str1 with max of > d deletions and s substitutions". (Just a guess, but it makes sense to me) > > If I am right, one solution to solve my problem (with only substitutions > and insertions) using your function "kevin" is: > - set str1 as the string I am trying to match > - call your function for each substring of str2 of length [len(str1) : > len(str1)+max_insertions+1] and set each of those as str2 > - i can save time by returning TRUE the first time I find a match, because > I don't care if there are more > This would compare str1 to all possible str2 substrings that could be > "approximately EQUAL to str1 allowing up to I insertions and S > substitutions in str1". > > Obviously, another option is to design another function (say.. "kevin2" > ^_^) which addresses directly my problem. I don't mind using the solution > above (I appreciate your help and time), but I believe an implementation > dealing directly with my problem would be faster to solve it, right? > > Looking forward to your answer! > Cheers > Kevin > > > > > > > > > On 12 March 2014 18:43, Tal Einat wrote: > >> On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue >> wrote: >> > HI Tal, >> > >> > To answer your question, I currently don't expect any deletion indeed. >> It is >> > possible that other users would like that feature, maybe me in the >> future. >> > Therefore, this code could be a very convenient temporary fix that I >> could >> > include directly in the code as an annex function. Thank you very much. >> I'll >> > have a closer look at how it works to teach myself :) >> >> If anyone else on this list thinks this would be useful, I'd be happy >> to publish it as a publicly available library. For now, consider the >> code I posted freely available in the public domain (i.e. use it at >> will but don't take credit for it in my stead and don't sell it >> without my consent). >> >> > Now, in a moment of inspiration, am I wrong in saying that similarly, a >> case >> > where no insertion is allowed would be the same problem while switching >> > sequence1 and sequence2 in the function call? >> >> Indeed, you are correct :) >> >> - Tal Einat >> > > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From mary.kindall at gmail.com Thu Mar 13 19:57:19 2014 From: mary.kindall at gmail.com (Mary Kindall) Date: Thu, 13 Mar 2014 15:57:19 -0400 Subject: [Biopython] Get all alignments of a sequence against another Message-ID: This is a primitive question but somehow I could not find a solution to it. I have two sequences 'large' and 'small' as given below. >large XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS >small GGGTTVTTSS I need to align the 'small' sequence to the 'large' sequence. Clearly there are two places where it can be aligned. I need to get indices of both the locations. I was trying BioPython's "pairwise2.align.globalms" function but it is only able to align to the second position. pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0, penalize_end_gaps=False) Ans: [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS', '-----------------------------------------------------------------------------------------GGGTTLTTSS', 20.0, 0, 99)] Which parameter can I change here or which other pachage/lightweight free software can compute this? -- Mary From kevin.rue at ucdconnect.ie Fri Mar 14 09:16:45 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 09:16:45 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Mary, There is one blurry area in your question: how exactly do you define "a location where your small_sequence aligns" ? >From your example, it seems you're not looking for exact matches, but you allow in this case 1 mismatch. Is it a maximal number of mismatches? Do you also want to allow indels? Do you want to control the number of insertions, deletions, substitutions separately? Is a match a local alignment above a score threshold? I would suggest that you have a look at the definition of the Levenshtein distance.( see the example: http://en.wikipedia.org/wiki/Levenshtein_distance#Example). If this metric suits you, for instance to find all the matches of the small_sequences in the large_sequence with a maximal edit distance of 1, you can use one of the Python packages implementing the Levenshtein distance, like "fuzzysearch" (https://pypi.python.org/pypi/fuzzysearch/0.2.0) this way: >>> import fuzzysearch >>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1) The output will find two matches. Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)] BUG: I did notice that the second match is reported twice instead and I assume this is a bug where the first match was somehow replaced by the second, which is why I copied Tal (the developer of this package) to this email Another example where I added you sequence (with a mismatch) a third time: >>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1) returns Out[9]: [Match(start=42, end=52, dist=1), Match(start=99, end=109, dist=0), Match(start=99, end=109, dist=0)] You can see three matches, one of the mismatched sequence was detected correctly (edit distance of 1), but the bug seems to duplicate the last match and replace the one before the last match with it. Tal, can you fix that? I will add the issue to your repository :) Cheers Kevin On 13 March 2014 19:57, Mary Kindall wrote: > This is a primitive question but somehow I could not find a solution to it. > I have two sequences 'large' and 'small' as given below. > > >large > > XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS > > > >small > GGGTTVTTSS > > > I need to align the 'small' sequence to the 'large' sequence. Clearly there > are two places where it can be aligned. I need to get indices of both the > locations. I was trying BioPython's "pairwise2.align.globalms" function but > it is only able to align to the second position. > > > > pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0, > penalize_end_gaps=False) > Ans: > > [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS', > > '-----------------------------------------------------------------------------------------GGGTTLTTSS', > 20.0, > 0, > 99)] > > > > Which parameter can I change here or which other pachage/lightweight free > software can compute this? > > -- > Mary > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From kevin.rue at ucdconnect.ie Fri Mar 14 09:29:26 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 09:29:26 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Sorry for multiple emails: My mistake, the duplication of the last one does not replace the one before the last, but instead the first match is simply not returned in the output list (even though the right NUMBER of matches is returned). On 14 March 2014 09:16, Kevin Rue wrote: > Hi Mary, > > There is one blurry area in your question: how exactly do you define "a > location where your small_sequence aligns" ? > From your example, it seems you're not looking for exact matches, but you > allow in this case 1 mismatch. Is it a maximal number of mismatches? Do you > also want to allow indels? Do you want to control the number of insertions, > deletions, substitutions separately? Is a match a local alignment above a > score threshold? > > I would suggest that you have a look at the definition of the Levenshtein > distance.( see the example: > http://en.wikipedia.org/wiki/Levenshtein_distance#Example). > If this metric suits you, for instance to find all the matches of the > small_sequences in the large_sequence with a maximal edit distance of 1, > you can use one of the Python packages implementing the Levenshtein > distance, like "fuzzysearch" ( > https://pypi.python.org/pypi/fuzzysearch/0.2.0) this way: > > >>> import fuzzysearch > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > 1) > > The output will find two matches. > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)] > > BUG: > I did notice that the second match is reported twice instead and I assume > this is a bug where the first match was somehow replaced by the second, > which is why I copied Tal (the developer of this package) to this email > > Another example where I added you sequence (with a mismatch) a third time: > > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > 1) > > returns > Out[9]: > [Match(start=42, end=52, dist=1), > Match(start=99, end=109, dist=0), > Match(start=99, end=109, dist=0)] > > You can see three matches, one of the mismatched sequence was detected > correctly (edit distance of 1), but the bug seems to duplicate the last > match and replace the one before the last match with it. > > Tal, can you fix that? I will add the issue to your repository :) > > Cheers > Kevin > > > > > On 13 March 2014 19:57, Mary Kindall wrote: > >> This is a primitive question but somehow I could not find a solution to >> it. >> I have two sequences 'large' and 'small' as given below. >> >> >large >> >> XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS >> >> >> >small >> GGGTTVTTSS >> >> >> I need to align the 'small' sequence to the 'large' sequence. Clearly >> there >> are two places where it can be aligned. I need to get indices of both the >> locations. I was trying BioPython's "pairwise2.align.globalms" function >> but >> it is only able to align to the second position. >> >> >> >> pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0, >> penalize_end_gaps=False) >> Ans: >> >> [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS', >> >> '-----------------------------------------------------------------------------------------GGGTTLTTSS', >> 20.0, >> 0, >> 99)] >> >> >> >> Which parameter can I change here or which other pachage/lightweight free >> software can compute this? >> >> -- >> Mary >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > K?vin RUE-ALBRECHT > Wellcome Trust Computational Infection Biology PhD Programme > University College Dublin > Ireland > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Fri Mar 14 10:53:18 2014 From: taleinat at gmail.com (Tal Einat) Date: Fri, 14 Mar 2014 12:53:18 +0200 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue wrote: > >>> import fuzzysearch > >>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>> 1) > > The output will find two matches. > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)] > > BUG: > I did notice that the second match is reported twice instead and I assume > this is a bug where the first match was somehow replaced by the second, > which is why I copied Tal (the developer of this package) to this email > > Another example where I added you sequence (with a mismatch) a third time: > >>>> >>>> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", >>>> 1) > > returns > Out[9]: > [Match(start=42, end=52, dist=1), > Match(start=99, end=109, dist=0), > Match(start=99, end=109, dist=0)] > > You can see three matches, one of the mismatched sequence was detected > correctly (edit distance of 1), but the bug seems to duplicate the last > match and replace the one before the last match with it. > > Tal, can you fix that? I will add the issue to your repository :) Thanks for bringing this to my attention! Fixed. Upgrade to version 0.2.1 and your example will work as expected. (To upgrade, run: pip install --upgrade fuzzysearch) - Tal Einat From kevin.rue at ucdconnect.ie Fri Mar 14 10:57:36 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 10:57:36 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Cheers! (Man, we're a team ;-) ) Kevin On 14 March 2014 10:53, Tal Einat wrote: > On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue > wrote: > > >>> import fuzzysearch > > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > > >>> 1) > > > > The output will find two matches. > > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, > dist=0)] > > > > BUG: > > I did notice that the second match is reported twice instead and I assume > > this is a bug where the first match was somehow replaced by the second, > > which is why I copied Tal (the developer of this package) to this email > > > > Another example where I added you sequence (with a mismatch) a third > time: > > > >>>> > >>>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>>> 1) > > > > returns > > Out[9]: > > [Match(start=42, end=52, dist=1), > > Match(start=99, end=109, dist=0), > > Match(start=99, end=109, dist=0)] > > > > You can see three matches, one of the mismatched sequence was detected > > correctly (edit distance of 1), but the bug seems to duplicate the last > > match and replace the one before the last match with it. > > > > Tal, can you fix that? I will add the issue to your repository :) > > Thanks for bringing this to my attention! Fixed. > > Upgrade to version 0.2.1 and your example will work as expected. > > (To upgrade, run: pip install --upgrade fuzzysearch) > > - Tal Einat > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From kevin.rue at ucdconnect.ie Fri Mar 14 11:07:46 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 11:07:46 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Mary, Please do let us know if that solution suits you or if the Levenshtein distance metric does not fit your needs. The approach below gives you the number of matches (length of the output list), the start and stop positions of the match (be careful about Python 0-based indexing), and the edit distance between each match and the sequence you search for. It's already a good place to start from. Best Kevin On 14 March 2014 10:53, Tal Einat wrote: > On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue > wrote: > > >>> import fuzzysearch > > >>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > > >>> 1) > > > > The output will find two matches. > > Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, > dist=0)] > > > > BUG: > > I did notice that the second match is reported twice instead and I assume > > this is a bug where the first match was somehow replaced by the second, > > which is why I copied Tal (the developer of this package) to this email > > > > Another example where I added you sequence (with a mismatch) a third > time: > > > >>>> > >>>> > fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>>> 1) > > > > returns > > Out[9]: > > [Match(start=42, end=52, dist=1), > > Match(start=99, end=109, dist=0), > > Match(start=99, end=109, dist=0)] > > > > You can see three matches, one of the mismatched sequence was detected > > correctly (edit distance of 1), but the bug seems to duplicate the last > > match and replace the one before the last match with it. > > > > Tal, can you fix that? I will add the issue to your repository :) > > Thanks for bringing this to my attention! Fixed. > > Upgrade to version 0.2.1 and your example will work as expected. > > (To upgrade, run: pip install --upgrade fuzzysearch) > > - Tal Einat > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Fri Mar 14 11:11:42 2014 From: taleinat at gmail.com (Tal Einat) Date: Fri, 14 Mar 2014 13:11:42 +0200 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue wrote: > >>> import fuzzysearch > >>> fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", >>>> 1) By the way, you should usually just use fuzzysearch.find_near_matches(...), which will choose an appropriate search method for you depending on the given parameters. - Tal Einat From mary.kindall at gmail.com Fri Mar 14 15:14:01 2014 From: mary.kindall at gmail.com (Mary Kindall) Date: Fri, 14 Mar 2014 11:14:01 -0400 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Tal and Kevin, Thanks for mail and the user friendly package. I was not aware of the existence of "fuzzysearch" package. Kevin, I am allowing the mismatches (to a defined maximum which is a function of the length of the pattern) but there is a strict 'no' for insertions and deletions. I do see that the functions 'fuzzysearch.find_near_matches' and 'fuzzysearch.find_near_matches_with_ngrams' works perfect for mismatches. However, I could not find a way to avoid alignment when there is an insertion or deletion. Is there a way to restrict the maximum distance to mismatches only? Levenshtein distance seems to have the same issue. Thanks and regards Mary On Fri, Mar 14, 2014 at 7:11 AM, Tal Einat wrote: > On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue > wrote: > > >>> import fuzzysearch > > >>> > fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", > >>>> 1) > > By the way, you should usually just use > fuzzysearch.find_near_matches(...), which will choose an appropriate > search method for you depending on the given parameters. > > - Tal Einat > -- ------------- Mary Kindall Yorktown Heights, NY USA From kevin.rue at ucdconnect.ie Fri Mar 14 15:52:12 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Fri, 14 Mar 2014 15:52:12 +0000 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: Hi Mary, Tal, In that case you describe, the solution to your problem is rather straightforward to implement.. I split it in two functions pasted below. I actually implemented yesterday the function "approx_substitute(str1, str2, max_substitutions)", see below. This function requires two strings of the same length, and will tell you TRUE if there are less than N mismatches between them, comparing the characters at the same position in the two strings. Now the function that answers your question is something I just implemented for you: "list_start_approx_matches_substitutions(str1, str2, max_mismatches)" see below. This function will use the previous one to compare your small_string to each substring of str2 of the same length of str1, and keep the start position of all positive matches. It will return an "array" object, which you can easily turn into a regular list using array.tolist() See http://docs.python.org/2/library/array.html In short, run the fwo function definitions below. And then use: >>>list_start_approx_matches_substitutions("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1) it will return Out[20]: array('L', [19, 42, 99]) if that's scary, just run: >>>list_start_approx_matches_substitutions("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", 1).tolist() Out[21]: [19, 42, 99] Kevin def approx_substitute(str1, str2, max_substitutions): """Checks that str1 is less than max_substitutions away from str2. Note: No insertions or deletions are allowed. Sequences of different length will automatically return FALSE. Args: str1, str2, max_substitutions Returns: Boolean. TRUE if str1 is less than ax_substitutions away from str2, FALSE otherwise. """ # Solves a simple scenario which does not require to parse the sequences. if len(str1) != len(str2): return False # If we reach here, we know that the two strings are the same length, therefore len(str1) is synonym to len(str2) # Initialise a counter of substitutions between the two strings substitutions = 0 # For each (index, character) value pair in str1 for (index, str1_char) in enumerate(str1): # add 1 if the characters from str1 and str2 are different, add 0 otherwise substitutions += (0 if str1_char == str2[index] else 1) # time saver: if the counter exceeds max_substitutions at some stage, don't bother checking the rest... if substitutions > max_substitutions: # ... just return FALSE return False # If max_substitutions is never reached, the function will eventually leave the loop above # The simple fact of arriving here proves that str1 is less than max_substitutions away from str2, # therefore return TRUE return True from array import array # define a function that returns the start position of all matches of a given str1 # in a given larger str2, given a maximum number of mismatches allowed def list_start_approx_matches_substitutions(str1, str2, max_mismatches): # Initialise an empty list to save the start positions of the matches starts = array('L') # for each substring of str2 which is the same length as str1 for i in range(len(str2)-len(str1)+1): # if there are less than N mismatches between str1 and substr2 if approx_substitute(str1, str2[i:i+len(str1)], max_mismatches): # save the start position of the match (the end position can be guessed # from the length of str1, the only information lost is the number of mismatches # between str1 and substr2) starts.append(i) return starts On 14 March 2014 15:14, Mary Kindall wrote: > Hi Tal and Kevin, > Thanks for mail and the user friendly package. I was not aware of the > existence of "fuzzysearch" package. > > Kevin, I am allowing the mismatches (to a defined maximum which is a > function of the length of the pattern) but there is a strict 'no' for > insertions and deletions. > > I do see that the functions 'fuzzysearch.find_near_matches' and > 'fuzzysearch.find_near_matches_with_ngrams' works perfect for mismatches. > However, I could not find a way to avoid alignment when there is an > insertion or deletion. > > Is there a way to restrict the maximum distance to mismatches only? > Levenshtein distance seems to have the same issue. > > Thanks and regards > Mary > > > > > > On Fri, Mar 14, 2014 at 7:11 AM, Tal Einat wrote: > >> On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue >> wrote: >> > >>> import fuzzysearch >> > >>> >> fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS", >> >>>> 1) >> >> By the way, you should usually just use >> fuzzysearch.find_near_matches(...), which will choose an appropriate >> search method for you depending on the given parameters. >> >> - Tal Einat >> > > > > -- > ------------- > Mary Kindall > Yorktown Heights, NY > USA > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From taleinat at gmail.com Fri Mar 14 18:01:51 2014 From: taleinat at gmail.com (Tal Einat) Date: Fri, 14 Mar 2014 20:01:51 +0200 Subject: [Biopython] Get all alignments of a sequence against another In-Reply-To: References: Message-ID: On Fri, Mar 14, 2014 at 5:52 PM, Kevin Rue wrote: > Hi Mary, Tal, > > In that case you describe, the solution to your problem is rather > straightforward to implement.. I split it in two functions pasted below. Kevin, that's a nice solution! Here's a somewhat more efficient solution, based on the same basic principals but implemented with some optimizations and non-trivial index juggling. This will be included in future versions of fuzzysearch. Raw code is below; for the next month or so a highlighted version can be found at: dpaste.com/1728155/ I still feel we're reinventing the wheel here. Surely it is possible to do this with BioPython. Unfortunately, I too couldn't easily figure out how to do so from reading the documentation and a bit of trial and error. - Tal Einat from collections import deque, defaultdict, namedtuple from itertools import islice Match = namedtuple('Match', ['start', 'end', 'dist']) def find_near_matches_only_substitutions(subsequence, sequence, max_substitutions): """search for near-matches of subsequence in sequence This searches for near-matches, where the nearly-matching parts of the sequence must meet the following limitations (relative to the subsequence): * the number of character substitutions must be less than max_substitutions * no deletions or insertions are allowed """ if not subsequence: raise ValueError('Given subsequence is empty!') # simple optimization: prepare some often used things in advance _SUBSEQ_LEN = len(subsequence) _SUBSEQ_LEN_MINUS_ONE = _SUBSEQ_LEN - 1 # prepare quick lookup of where a character appears in the subsequence char_indexes_in_subsequence = defaultdict(list) for (index, char) in enumerate(subsequence): char_indexes_in_subsequence[char].append(index) # we'll iterate over the sequence once, but the iteration is split into two # for loops; therefore we prepare an iterator in advance which will be used # in for of the loops sequence_enum_iter = enumerate(sequence) # We'll count the number of matching characters assuming various attempted # alignments of the subsequence to the sequence. At any point in the # sequence there will be N such alignments to update. We'll keep # these in a "circular array" (a.k.a. a ring) which we'll rotate after each # iteration to re-align the indexing. # Initialize the candidate counts by iterating over the first N-1 items in # the sequence. No possible matches in this step! candidates = deque([0], maxlen=_SUBSEQ_LEN) for (index, char) in islice(sequence_enum_iter, _SUBSEQ_LEN_MINUS_ONE): for subseq_index in [idx for idx in char_indexes_in_subsequence[char] if idx <= index]: candidates[subseq_index] += 1 candidates.appendleft(0) matches = [] # From the N-th item onwards, we'll update the candidate counts exactly as # above, and additionally check if the part of the sequence whic began N-1 # items before the current index was a near enough match to the given # sub-sequence. for (index, char) in sequence_enum_iter: for subseq_index in char_indexes_in_subsequence[char]: candidates[subseq_index] += 1 # rotate the ring of candidate counts candidates.rotate(1) # fetch the count for the candidate which started N-1 items ago n_substitutions = _SUBSEQ_LEN - candidates[0] # set the count for the next index to zero candidates[0] = 0 # if the candidate had few enough mismatches, yield a match if n_substitutions <= max_substitutions: matches.append(Match( start=index - _SUBSEQ_LEN_MINUS_ONE, end=index + 1, dist=n_substitutions, )) return matches From eric.talevich at gmail.com Sat Mar 15 05:29:21 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 14 Mar 2014 22:29:21 -0700 Subject: [Biopython] Google Summer of Code 2014: Call for student applications Message-ID: Hi everyone, Google Summer of Code is an annual program that funds students all over the world to work with open-source software projects to develop new code. This summer, the Open Bioinformatics Foundation (OBF) is taking on students through the Google Summer of Code program to work with mentors on established bioinformatics software projects including BioPython. We invite students to submit applications by Friday, March 21. Full details are here: http://news.open-bio.org/news/2014/03/obf-gsoc-2014-call-for-student-applications/ All the best, Eric & Raoul OBF GSoC organization admins From taleinat at gmail.com Sat Mar 15 18:59:07 2014 From: taleinat at gmail.com (Tal Einat) Date: Sat, 15 Mar 2014 20:59:07 +0200 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: On Thu, Mar 13, 2014 at 8:08 PM, Kevin Rue wrote: > Hi Tal, > > I finally had the time to look at your function in details, and understood > how it works and why each line is there. Thanks, even though it doesn't do > exactly what I had in mind, it does what the docstring says :) > I'd call your "kevin" function "approxEqual(str1, str2, max_ins, max_sub)". > > > When I understood your code, I thought of a way to increase the speed of > your function, but ended with a (fast) function actually doing something > slightly different. This one would actually only look at the str2 substrings > from str1_idx but which are no longer than len(str1)+max_insertions. Longer > str2 are pointless as they imply that str1 requires more insertions at the > start than allowed to match str2. > I would call that function "approxStartsWith(str1, str2, max_ins, max_sub)" > > def approxStartsWith(str1, str2, max_substitutions, max_insertions): > """check if it is possible to map str1 to the start of str2 given > limitations > > The limitations are the maximum allowed number of new characters > inserted > and the maximum allowed number of character substitutions. > """ > # check simple cases which are obviously impossible > if not len(str1) <= len(str2): > return False > > scores = array('L', [0] * (max_insertions + 1)) > new_scores = scores[:] > > for (str1_idx, char1) in enumerate(str1): > # make min() always take the other value in the first iteration of > the > # inner loop > prev_score = len(str2) > for (n_insertions, char2) in enumerate( > str2[str1_idx:max_insertions+str1_idx+1] > ): > new_scores[n_insertions] = prev_score = min(scores[n_insertions] > + (0 if char1 == char2 else 1), prev_score ) > > # swap scores <-> new_scores > scores, new_scores = new_scores, scores > > return min(scores) <= max_substitutions > > > Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply > implemented as: > > def approxWithin(str1, str2, max_ins, max_sub): > """check if it is possible to find str1 within str2 given limitations > > The limitations are the maximum allowed number of new characters inserted > and the maximum allowed number of character substitutions. > """ > for str2_idx in range(len(str2)-len(str1)+1): > print (str2_idx) > result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub) > if result: > return result > return False > > Comments? Aha! So you are, in fact, searching for a short sequence in a long sequence (or many such long sequences). This -is- exactly what fuzzysearch is meant for! Regarding the code you posted, it looks like it would work (though I haven't tested it). It certainly isn't very efficient, however, since it makes many copies of long parts of str2 (see "str2[str2_idx:]"). It is also fairly straightforward, leaving some room for further optimization. I do like that it is very readable and easy to understand! Well, except for the loop in approxStartsWith(), but that's based on my own code... Inspired by your use-case, I've added highly generic fuzzy searching functionality to fuzzysearch. You can now limit the number of substitutions, insertions and deletions as well as their total (i.e. limit the Levenshtein distance). You can also limit only some of these as you like. Specifically, this supports your use-case of searching for fuzzy matches allowing only a limited number of substitutions and insertions, but no deletions. The user-friendly utility function fuzzysearch.find_near_matches() now accepts parameters for limiting the substitutions etc., and chooses a suitable implementation based on the given parameters. I haven't yet implemented an optimized search function allowing only substitutions and insertions. If the current version not fast enough for your needs, there are plenty of optimizations still to be done. I'd be happy if you could give it a whirl and tell me what happens! I haven't released it yet, but you can install the latest development version using pip (you'll need to have git installed for this): pip install --upgrade git+git://github.com/taleinat/fuzzysearch.git#egg=fuzzysearch - Tal From lluis.revilla at gmail.com Mon Mar 17 19:09:31 2014 From: lluis.revilla at gmail.com (=?ISO-8859-1?Q?Llu=EDs_Revilla?=) Date: Mon, 17 Mar 2014 20:09:31 +0100 Subject: [Biopython] Google Summer of Code 2014: Student application Message-ID: Hi everyone, I am a Biotechnology student and I want to contribute to Biopython. I have read the wiki GSoC page and I found two ideas. But I think I don't have the desired skills, I am not much familiarized with the Biopython's existing sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with javascript ("Interactive GenomeDiagram Module"). So I am thinking to make a proposal for the Google Summer of Code about a comparing tool. My idea comes from the following: I have been several time in charge of selecting a tool to do a certain process e.g.: A list of predicted genes, a list of possible structures, a list of alignments... But usually in bioinformatics there are many programs to do the same thing, usually they use a different algorithm a different training set data (prokaryote, eukaryote ), or have different specifications. And they return a more or less sophisticated list, in some standard format, FASTA, GFF, Genebank... The problem when starting a project is to select from this different programs which one use for the task, e.g.: Which gene predictor is better for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The answer will be specific to the project but sometimes its difficult to ensure that it is a good selection. (Other times it is good enough to do what the majority do.) But does not solve the problem when new algorithms appears, or even to compare between different program versions. To cover this problem I would like to develop for Biopython a module to compare between the different programs output to asses which one is better for the task. Currently I developed a parser for the afford mentioned programs and it compares them in a (very) rude way. I would like to develop further and release it to the Biopython community. What are your thoughts about this idea? Thanks, Llu?s From kevin.rue at ucdconnect.ie Tue Mar 18 10:34:21 2014 From: kevin.rue at ucdconnect.ie (Kevin Rue) Date: Tue, 18 Mar 2014 10:34:21 +0000 Subject: [Biopython] Python equivalent of the Perl String::Approx module for approximate matching? In-Reply-To: References: Message-ID: Hi Tal, In my particular case, I am searching for a 36-characters sequence in a 90-characters one. If you're really curious, you can have a look at RNA-sequencing. It's technique in biology where we obtain the sequence of RNA molecules expressed from the genome. But before the analysis of those sequences (we usually call them "reads"), we have to identify and filter out reads that contain an "adapter" sequence used by our machines. The problem is that the "adapter" sequence can have mistakes in the read, hence the fuzzy matching. Haven't tried your code yet, I'll try when I get a chance, but I've got a few other tasks to deal with for my own work first . Cheers, Kevin On 15 March 2014 18:59, Tal Einat wrote: > On Thu, Mar 13, 2014 at 8:08 PM, Kevin Rue > wrote: > > Hi Tal, > > > > I finally had the time to look at your function in details, and > understood > > how it works and why each line is there. Thanks, even though it doesn't > do > > exactly what I had in mind, it does what the docstring says :) > > I'd call your "kevin" function "approxEqual(str1, str2, max_ins, > max_sub)". > > > > > > When I understood your code, I thought of a way to increase the speed of > > your function, but ended with a (fast) function actually doing something > > slightly different. This one would actually only look at the str2 > substrings > > from str1_idx but which are no longer than len(str1)+max_insertions. > Longer > > str2 are pointless as they imply that str1 requires more insertions at > the > > start than allowed to match str2. > > I would call that function "approxStartsWith(str1, str2, max_ins, > max_sub)" > > > > def approxStartsWith(str1, str2, max_substitutions, max_insertions): > > """check if it is possible to map str1 to the start of str2 given > > limitations > > > > The limitations are the maximum allowed number of new characters > > inserted > > and the maximum allowed number of character substitutions. > > """ > > # check simple cases which are obviously impossible > > if not len(str1) <= len(str2): > > return False > > > > scores = array('L', [0] * (max_insertions + 1)) > > new_scores = scores[:] > > > > for (str1_idx, char1) in enumerate(str1): > > # make min() always take the other value in the first iteration > of > > the > > # inner loop > > prev_score = len(str2) > > for (n_insertions, char2) in enumerate( > > str2[str1_idx:max_insertions+str1_idx+1] > > ): > > new_scores[n_insertions] = prev_score = > min(scores[n_insertions] > > + (0 if char1 == char2 else 1), prev_score ) > > > > # swap scores <-> new_scores > > scores, new_scores = new_scores, scores > > > > return min(scores) <= max_substitutions > > > > > > Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply > > implemented as: > > > > def approxWithin(str1, str2, max_ins, max_sub): > > """check if it is possible to find str1 within str2 given limitations > > > > The limitations are the maximum allowed number of new characters inserted > > and the maximum allowed number of character substitutions. > > """ > > for str2_idx in range(len(str2)-len(str1)+1): > > print (str2_idx) > > result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub) > > if result: > > return result > > return False > > > > Comments? > > Aha! So you are, in fact, searching for a short sequence in a long > sequence (or many such long sequences). This -is- exactly what > fuzzysearch is meant for! > > > Regarding the code you posted, it looks like it would work (though I > haven't tested it). It certainly isn't very efficient, however, since > it makes many copies of long parts of str2 (see "str2[str2_idx:]"). It > is also fairly straightforward, leaving some room for further > optimization. I do like that it is very readable and easy to > understand! Well, except for the loop in approxStartsWith(), but > that's based on my own code... > > > Inspired by your use-case, I've added highly generic fuzzy searching > functionality to fuzzysearch. You can now limit the number of > substitutions, insertions and deletions as well as their total (i.e. > limit the Levenshtein distance). You can also limit only some of these > as you like. > > Specifically, this supports your use-case of searching for fuzzy > matches allowing only a limited number of substitutions and > insertions, but no deletions. > > The user-friendly utility function fuzzysearch.find_near_matches() now > accepts parameters for limiting the substitutions etc., and chooses a > suitable implementation based on the given parameters. > > I haven't yet implemented an optimized search function allowing only > substitutions and insertions. If the current version not fast enough > for your needs, there are plenty of optimizations still to be done. > > I'd be happy if you could give it a whirl and tell me what happens! I > haven't released it yet, but you can install the latest development > version using pip (you'll need to have git installed for this): > > pip install --upgrade > git+git://github.com/taleinat/fuzzysearch.git#egg=fuzzysearch > > - Tal > -- K?vin RUE-ALBRECHT Wellcome Trust Computational Infection Biology PhD Programme University College Dublin Ireland http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en From eric.talevich at gmail.com Tue Mar 18 23:30:42 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 18 Mar 2014 16:30:42 -0700 Subject: [Biopython] Google Summer of Code 2014: Student application In-Reply-To: References: Message-ID: On Mon, Mar 17, 2014 at 12:09 PM, Llu?s Revilla wrote: > Hi everyone, > > I am a Biotechnology student and I want to contribute to Biopython. I have > read the wiki GSoC page and I found two ideas. But I think I don't have the > desired skills, I am not much familiarized with the Biopython's existing > sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with > javascript ("Interactive GenomeDiagram Module"). So I am thinking to make > a proposal for the Google Summer of Code about a comparing tool. > > My idea comes from the following: I have been several time in charge of > selecting a tool to do a certain process e.g.: A list of predicted genes, a > list of possible structures, a list of alignments... > > But usually in bioinformatics there are many programs to do the same thing, > usually they use a different algorithm a different training set data > (prokaryote, eukaryote ), or have different specifications. And they return > a more or less sophisticated list, in some standard format, FASTA, GFF, > Genebank... > > The problem when starting a project is to select from this different > programs which one use for the task, e.g.: Which gene predictor is better > for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The > answer will be specific to the project but sometimes its difficult to > ensure that it is a good selection. (Other times it is good enough to do > what the majority do.) But does not solve the problem when new algorithms > appears, or even to compare between different program versions. > > To cover this problem I would like to develop for Biopython a module to > compare between the different programs output to asses which one is better > for the task. > Currently I developed a parser for the afford mentioned programs and it > compares them in a (very) rude way. I would like to develop further and > release it to the Biopython community. > > What are your thoughts about this idea? > Thanks, > > Llu?s > Hi Llu?s, This is an interesting idea, though a bit broad. You could maybe find some inspiration or focus by looking at Critical Assessment of Function Prediction (CAFA): http://biofunctionprediction.org/ Perhaps Iddo Friedberg or another AFP enthusiast could comment on how this project could support benchmarking of automated annotations. On the technical side, I also recommend looking at nestly, a program that will execute another specific command-line program with a variety of different parameters and automatically organize, summarize and compare the outputs. http://fhcrc.github.io/nestly/ All the best, Eric From lluis.revilla at gmail.com Wed Mar 19 10:12:26 2014 From: lluis.revilla at gmail.com (=?ISO-8859-1?Q?Llu=EDs_Revilla?=) Date: Wed, 19 Mar 2014 11:12:26 +0100 Subject: [Biopython] Google Summer of Code 2014: Student application In-Reply-To: References: Message-ID: Dear Eric and all. I summarize here some of the comments you made to the proposal: 1. It is a bit broad (Eric) 2. Provides a common visual representation of the different inputs? (Christian) 3. Supposed to actually rank different tools / outputs? If so is a surprisingly hard problem (Christian and bow) 4. Difficult and difficult to fit in Biopython (bow) 5. Useful just once for each task (bow) 6. More useful to write parsers using a common object mode, but generalizing their outputs is also not a trivial task (bow) And here my comments: 1. It is intended to be broad, to be applied not just to Gene Predictors but also to RNA secondary structure predictors, or ncRNA predictors, functional site predictors or secondary or even protein tertiary structure predictors. 2. Well, my initial thought was to compare their results, but to do so they need to be in the same format so adding a common visual representation it could be added. 3. If there is a reference to which compare the programs it is not so hard, but then it loses the point to compare the programs. But I actually ranked them according of how much they share between them and how much they differ. If they are supposed to do the same thing their results should tend to be the same, at least this can set apart some very deviated programs, although it doesn't ensure that the other ones are the wrong ones. 4. I agree, that is way I mailed it, to know if it would fit or not, and how useful it would be. 5. Even it is useful once, the program versions can change and then they will need to be evaluated again (If they keep the output format it would work) and not all the project search the same type of result even with the same task to do. Some would like to test with a reference what happen with the false positives genes predicted, or want the minimum false rate even if they get just 40% of the annotated genes. But mainly it is true that it is to use just once. 6. As it would be part of my idea I could make the parsers. The common object could include the essential information and for each parser then add the particular output information of each program. In short: It either seems to difficult or out of my skills to complete my idea and there are doubts if it fits in Biopython library. If it is more useful I can change my proposal to code parsers for gene predictors or any other program not already parsed in Biopython. Thanks all for your comments and feed-back, I will be glad to read more comments and improve or change my proposal. Best, Llu?s 2014-03-19 0:30 GMT+01:00 Eric Talevich : > On Mon, Mar 17, 2014 at 12:09 PM, Llu?s Revilla wrote: > >> Hi everyone, >> >> I am a Biotechnology student and I want to contribute to Biopython. I have >> read the wiki GSoC page and I found two ideas. But I think I don't have >> the >> desired skills, I am not much familiarized with the Biopython's existing >> sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with >> javascript ("Interactive GenomeDiagram Module"). So I am thinking to make >> a proposal for the Google Summer of Code about a comparing tool. >> >> My idea comes from the following: I have been several time in charge of >> selecting a tool to do a certain process e.g.: A list of predicted genes, >> a >> list of possible structures, a list of alignments... >> >> But usually in bioinformatics there are many programs to do the same >> thing, >> usually they use a different algorithm a different training set data >> (prokaryote, eukaryote ), or have different specifications. And they >> return >> a more or less sophisticated list, in some standard format, FASTA, GFF, >> Genebank... >> >> The problem when starting a project is to select from this different >> programs which one use for the task, e.g.: Which gene predictor is better >> for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The >> answer will be specific to the project but sometimes its difficult to >> ensure that it is a good selection. (Other times it is good enough to do >> what the majority do.) But does not solve the problem when new algorithms >> appears, or even to compare between different program versions. >> >> To cover this problem I would like to develop for Biopython a module to >> compare between the different programs output to asses which one is better >> for the task. >> Currently I developed a parser for the afford mentioned programs and it >> compares them in a (very) rude way. I would like to develop further and >> release it to the Biopython community. >> >> What are your thoughts about this idea? >> Thanks, >> >> Llu?s >> > > Hi Llu?s, > > This is an interesting idea, though a bit broad. You could maybe find some > inspiration or focus by looking at Critical Assessment of Function > Prediction (CAFA): > http://biofunctionprediction.org/ > > Perhaps Iddo Friedberg or another AFP enthusiast could comment on how this > project could support benchmarking of automated annotations. > > On the technical side, I also recommend looking at nestly, a program that > will execute another specific command-line program with a variety of > different parameters and automatically organize, summarize and compare the > outputs. > http://fhcrc.github.io/nestly/ > > All the best, > Eric > From ericmajinglong at gmail.com Thu Mar 20 15:15:46 2014 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 20 Mar 2014 11:15:46 -0400 Subject: [Biopython] Are there tools for automatically parsing glycan names into tree structures? Message-ID: Hi everybody, Many apologies if you have seen this post cross-posted elsewhere. I have tried digging around but could not find an answer to my question. My colleague and I are working on a project involving data produced at a glycan microarray facility. The array data that came back to us were a list of glycan names (in the format (random example from the top of my head): GlcNAc...). We would like to parse the list of 610 names into the graphical representation of the glycan. Is this possible? If so, what tools are available to get this done? Thank you! Cheers, Eric ---------- w: http://about.me/ericmjl L: http://www.linkedin.com/in/ericmjl #: (857) 209-1375 From zruan1991 at gmail.com Thu Mar 20 20:21:23 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Thu, 20 Mar 2014 16:21:23 -0400 Subject: [Biopython] Exonerate Parser Error Message-ID: Hi, I'm trying to use Bio.SearchIO to parse a file generated by exonerate. It is a pairwise alignment between a protein sequence and nucleotide sequence. I notice that if I put the protein sequence first, SearchIO can happily parse the file, but if the nucleotide sequence comes first, it will raise an error. Here is an example exonerate output that failed the parser: Command line: [exonerate --showvulgar no --showalignment yes nuc.fa pro.fa] Hostname: [localhost.localdomain] C4 Alignment: ------------ Query: dna Target: protein Model: ungapped:dna2protein Raw score: 214 Query range: 2 -> 116 Target range: 314 -> 352 3 : CAGTCCGTTCCNAAAAGGCCCGCTGGCTCTGTGCAGAATCCTGTCTATCACAATCAGCCTCTGA : 66 GlnSerValProLysArgProAlaGlySerValGlnAsnProValTyrHisAsnGlnProLeuA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 315 : GlnSerValProLysArgProAlaGlySerValGlnAsnProValTyrHisAsnGlnProLeuA : 336 67 : ACCCCGCGCCCAGCAGAGACCCACACTACCAGGACCCCCACAGCACTGCA : 116 snProAlaProSerArgAspProHisTyrGlnAspProHisSerThrAla |||||||||||||||||||||||||||||||||||||||||||||||||| 337 : snProAlaProSerArgAspProHisTyrGlnAspProHisSerThrAla : 352 -- completed exonerate analysis Here is the error I get: >>> from Bio.SearchIO import read /home/rz/code/biopython/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> p = read('nuc_pro.exn', 'exonerate-text') Traceback (most recent call last): File "", line 1, in File "/home/rz/code/biopython/Bio/SearchIO/__init__.py", line 359, in read first = next(generator) File "/home/rz/code/biopython/Bio/SearchIO/__init__.py", line 316, in parse for qresult in generator: File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 235, in __iter__ for qresult in self._parse_qresult(): File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 361, in _parse_qresult hsp = _create_hsp(prev_hid, prev_qid, prev['hsp']) File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 187, in _create_hsp frags = _adjust_aa_seq(frags) File "/home/rz/code/biopython/Bio/SearchIO/ExonerateIO/_base.py", line 41, in _adjust_aa_seq assert frag.query_strand == 0 AssertionError Thanks, Zheng Ruan From w.arindrarto at gmail.com Thu Mar 20 20:31:05 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 20 Mar 2014 21:31:05 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Zheng, > I'm trying to use Bio.SearchIO to parse a file generated by exonerate. It > is a pairwise alignment between a protein sequence and nucleotide sequence. > I notice that if I put the protein sequence first, SearchIO can happily > parse the file, but if the nucleotide sequence comes first, it will raise > an error. Formatting on the plaintext mail seems inadequate for me at the moment. Would you mind sending me the file that contains the alignment? If it's too big, partial files are ok, too. Looking at our test cases, this particular case may have slipped testing. We do test for several cases of dna2protein (which could explain why it works when the nucleotide sequence comes first), but not protein2dna. Please let me know if I can also use your example as a test in our test corpus :). Cheers, Bow From w.arindrarto at gmail.com Thu Mar 20 20:33:38 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 20 Mar 2014 21:33:38 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: > Looking at our test cases, this particular case may have slipped > testing. We do test for several cases of dna2protein (which could > explain why it works when the nucleotide sequence comes first), but > not protein2dna. Please let me know if I can also use your example as > a test in our test corpus :). Oops, I meant the reverse ~ we have several test cases for protein2dna which may explain why it works when the protein sequence comes first ;). From w.arindrarto at gmail.com Thu Mar 20 23:30:40 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 21 Mar 2014 00:30:40 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Zheng, Thank you for the files :). I found out what was causing the error and have pushed a patch along with some tests to our codebase (https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d). You should be able to parse your file using the latest `master` branch. Hope this helps, Bow On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan wrote: > Hi Bow, > > I'm happy to provide the example for testing. See attachment. > > The command to generate the output above. > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa > > I'll check the test suite to see if I can find why. > > Best, > Zheng > > > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto > wrote: >> >> > Looking at our test cases, this particular case may have slipped >> > testing. We do test for several cases of dna2protein (which could >> > explain why it works when the nucleotide sequence comes first), but >> > not protein2dna. Please let me know if I can also use your example as >> > a test in our test corpus :). >> >> Oops, I meant the reverse ~ we have several test cases for protein2dna >> which may explain why it works when the protein sequence comes first >> ;). > > From zruan1991 at gmail.com Fri Mar 21 14:39:29 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 21 Mar 2014 10:39:29 -0400 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Thanks Bow, That works for me. But it seems the parser doesn't take the nucleotide information into the hsps. All I get is a pairwise alignment between two proteins. Nucleotide information is useful because I want to know the codon -- amino acid correspondence. In the case of frameshift the situation may not be that straightforward. Maybe you have other concern of not doing this. Best, Zheng On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto wrote: > Hi Zheng, > > Thank you for the files :). I found out what was causing the error and > have pushed a patch along with some tests to our codebase > ( > https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d > ). > You should be able to parse your file using the latest `master` > branch. > > Hope this helps, > Bow > > On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan wrote: > > Hi Bow, > > > > I'm happy to provide the example for testing. See attachment. > > > > The command to generate the output above. > > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa > > > > I'll check the test suite to see if I can find why. > > > > Best, > > Zheng > > > > > > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto < > w.arindrarto at gmail.com> > > wrote: > >> > >> > Looking at our test cases, this particular case may have slipped > >> > testing. We do test for several cases of dna2protein (which could > >> > explain why it works when the nucleotide sequence comes first), but > >> > not protein2dna. Please let me know if I can also use your example as > >> > a test in our test corpus :). > >> > >> Oops, I meant the reverse ~ we have several test cases for protein2dna > >> which may explain why it works when the protein sequence comes first > >> ;). > > > > > From w.arindrarto at gmail.com Fri Mar 21 14:59:40 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 21 Mar 2014 15:59:40 +0100 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Zheng, The nucleotide information is stored as the alignment annotation. You can access it using hsp.aln_annotation['query_annotation']. There, they are stored as triplets, reprensenting the codons. This is indeed a tradeoff that I had to make because there is no proper model yet to represent alignment objects containing sequences with different length in our master branch. In this case, the length of the DNA is most of the time 3x the length of the protein. And yes, this is not ideal since the actual query are now stored as an annotation ~ trading places with the translated query. HSPs themselves are basically modelled based on our MultipleSeqAlignment objects (you can get such objects when accessing the `aln` attribute from an HSP object). I think in order to properly model these types of alignment, we need to have a proper model of three-letter protein Seq objects as well. Your CodonSeqAlignment object may help here :), but I have not looked into it that much to be honest. How does it work with Seq objects with ProteinAlphabet? Is it possible to align protein and codon sequences? I tried storing as much information as possible using the current approach (e.g. notice the start and end coordinates of each hit and query, they are parsed from the file and the difference is not the same as the value you get when doing a `len` on hsp.query and/or hsp.hit). Note also that when dealing with frameshifts, you may want to access the hsp.fragments attribute, since frameshifts mean that you can break further your HSP alignment into multiple subalignments (fragments as it is called in SearchIO). Hope this helps :), Bow P.S. Also CC-ing the Development list ~ this looks like something interesting for dev in general. On Fri, Mar 21, 2014 at 3:39 PM, Zheng Ruan wrote: > Thanks Bow, > > That works for me. But it seems the parser doesn't take the nucleotide > information into the hsps. All I get is a pairwise alignment between two > proteins. Nucleotide information is useful because I want to know the codon > -- amino acid correspondence. In the case of frameshift the situation may > not be that straightforward. Maybe you have other concern of not doing this. > > Best, > Zheng > > > On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto > wrote: >> >> Hi Zheng, >> >> Thank you for the files :). I found out what was causing the error and >> have pushed a patch along with some tests to our codebase >> >> (https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d). >> You should be able to parse your file using the latest `master` >> branch. >> >> Hope this helps, >> Bow >> >> On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan wrote: >> > Hi Bow, >> > >> > I'm happy to provide the example for testing. See attachment. >> > >> > The command to generate the output above. >> > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa >> > >> > I'll check the test suite to see if I can find why. >> > >> > Best, >> > Zheng >> > >> > >> > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto >> > >> > wrote: >> >> >> >> > Looking at our test cases, this particular case may have slipped >> >> > testing. We do test for several cases of dna2protein (which could >> >> > explain why it works when the nucleotide sequence comes first), but >> >> > not protein2dna. Please let me know if I can also use your example as >> >> > a test in our test corpus :). >> >> >> >> Oops, I meant the reverse ~ we have several test cases for protein2dna >> >> which may explain why it works when the protein sequence comes first >> >> ;). >> > >> > > > From zruan1991 at gmail.com Fri Mar 21 19:32:33 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 21 Mar 2014 15:32:33 -0400 Subject: [Biopython] Exonerate Parser Error In-Reply-To: References: Message-ID: Hi Bow, I have the same problem when trying to model codon alignment with frameshift being considered. Basically, I have a CodonSeq object to store a coding sequence. The only difference between CodonSeq and Seq object is that CodonSeq has an attribute -- `rf_table` (reading frame table). It's actually a list of positions each codon starts with, so that translate() method will go through the list to translate codon into amino acid. In this case, it is easy to store a coding sequence with frameshift events. And it's not necessary to split the protein to dna alignment into multiple part when frameshift occurs. However, the problem now becomes how to obtain such information (`rf_table`). I find exonerate is quite capable of handling this task, especially with introns in the dna. I do think an object to store protein to dna alignment is necessary in this scenario. Best, Zheng On Fri, Mar 21, 2014 at 10:59 AM, Wibowo Arindrarto wrote: > Hi Zheng, > > The nucleotide information is stored as the alignment annotation. You > can access it using hsp.aln_annotation['query_annotation']. There, > they are stored as triplets, reprensenting the codons. > > This is indeed a tradeoff that I had to make because there is no > proper model yet to represent alignment objects containing sequences > with different length in our master branch. In this case, the length > of the DNA is most of > the time 3x the length of the protein. And yes, this is not ideal > since the actual query are now stored as an annotation ~ trading > places with the translated query. HSPs themselves are basically > modelled based on our MultipleSeqAlignment objects (you can get such > objects when accessing the `aln` attribute from an HSP object). I > think in order to properly model these types of alignment, we need to > have a proper model of three-letter protein Seq objects as well. > > Your CodonSeqAlignment object may help here :), but I have not looked > into it that much to be honest. How does it work with Seq objects with > ProteinAlphabet? Is it possible to align protein and codon sequences? > > I tried storing as much information as possible using the current > approach (e.g. notice the start and end coordinates of each hit and > query, they are parsed from the file and the difference is not the > same as the value you get when doing a `len` on hsp.query and/or > hsp.hit). Note also that when dealing with frameshifts, you may want > to access the hsp.fragments attribute, since frameshifts mean that you > can break further your HSP alignment into multiple subalignments > (fragments as it is called in SearchIO). > > Hope this helps :), > Bow > > P.S. Also CC-ing the Development list ~ this looks like something > interesting for dev in general. > > On Fri, Mar 21, 2014 at 3:39 PM, Zheng Ruan wrote: > > Thanks Bow, > > > > That works for me. But it seems the parser doesn't take the nucleotide > > information into the hsps. All I get is a pairwise alignment between two > > proteins. Nucleotide information is useful because I want to know the > codon > > -- amino acid correspondence. In the case of frameshift the situation may > > not be that straightforward. Maybe you have other concern of not doing > this. > > > > Best, > > Zheng > > > > > > On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto < > w.arindrarto at gmail.com> > > wrote: > >> > >> Hi Zheng, > >> > >> Thank you for the files :). I found out what was causing the error and > >> have pushed a patch along with some tests to our codebase > >> > >> ( > https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d > ). > >> You should be able to parse your file using the latest `master` > >> branch. > >> > >> Hope this helps, > >> Bow > >> > >> On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan > wrote: > >> > Hi Bow, > >> > > >> > I'm happy to provide the example for testing. See attachment. > >> > > >> > The command to generate the output above. > >> > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa > >> > > >> > I'll check the test suite to see if I can find why. > >> > > >> > Best, > >> > Zheng > >> > > >> > > >> > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto > >> > > >> > wrote: > >> >> > >> >> > Looking at our test cases, this particular case may have slipped > >> >> > testing. We do test for several cases of dna2protein (which could > >> >> > explain why it works when the nucleotide sequence comes first), but > >> >> > not protein2dna. Please let me know if I can also use your example > as > >> >> > a test in our test corpus :). > >> >> > >> >> Oops, I meant the reverse ~ we have several test cases for > protein2dna > >> >> which may explain why it works when the protein sequence comes first > >> >> ;). > >> > > >> > > > > > > From ferreirafm at usp.br Tue Mar 25 13:48:35 2014 From: ferreirafm at usp.br (Frederico Moraes Ferreira) Date: Tue, 25 Mar 2014 10:48:35 -0300 Subject: [Biopython] Levenshtein vs. blast sequence similarity Message-ID: <53318933.90203@usp.br> Biopython list, Sorry about this perhaps off-topic question concerning more to the use than the algorithm implementation of sequence similarity tools. Feel free to send answers directly to my e-mail if you judge it's inappropriate to the list contends. I would like to compare the sequence similarity (Blast "Positive" output) and/or the Levenshtein score of four groups of sequences (variable region!) against a given peptide and use a multiple comparison test to support the hypothesis that such peptide is more closely relate to one group than another. My original implementation was done using the ratio between the Blast positive score and the peptide length. Well, I've read that the Levenshtein distance is generally considered to be more suitable for distance measures of biological sequences. On the other side, similarity includes additional information like conservative and semi-conservative replacements. So, I'm writing to ask your opinion about this topic and perhaps get another score function to tackle this problem. Any comments are appreciated. Best, Fred P.S.: at the moment I'm ignoring the multiple Blats hsps matches and considering only the highest positives per comparison mate. From asmariyaz23 at gmail.com Wed Mar 26 17:39:17 2014 From: asmariyaz23 at gmail.com (Asma Riyaz) Date: Wed, 26 Mar 2014 13:39:17 -0400 Subject: [Biopython] GenomeDiagram: scale down the the track Message-ID: Hi, I am using GenomeDiagram to show some gene Id I was interested in. However the length of the sequences I am using are huge and hence when I add a feature, and in my case the feature may run only a couple of thousand base pairs due to which the sigil is smaller. Here is a snippet of my code: * gd_track_for_features= gd_diagram.new_track(1,name=seq_record.name ,greytrack=True,start=0,end=len(seq_record))* * gd_feature_set=gd_track_for_features.new_set() * * if rec in oo_list:* * max_len=max(max_len,len(seq_record))* * for feature in seq_record.features:* * if feature.type == "gene":* * try: * * name=feature.qualifiers['gene']* * if name[0] in ids:* * gd_feature_set.add_feature(feature,sigil="ARROW",arrowshaft_height=1.0,arrowhead_length=1.0,color=idToColorDict[name[0]],* * label=True,name=name[0],label_position="start",* * label_color=idToColorDict[name[0]],label_size=10,label_angle=0)* * except KeyError:* * pass * * gd_diagram.draw(format="linear",pagesize='A4',fragments=1,start=0,end=7000000)* Here is one thing I tried to do: when adding a new track, i specified start=0 and end=100 or 1000 or 10,0000-----> however this doesn't seem to scale down the track in anyway, instead what I see are empty tracks as the feature I am looking at do not exist in 1 to 1000 or 1 to 10,000. I attempted the same with "draw" and specified a different end BUT neither of it worked. How could I display all the features that I need to ? Thanks, Asma From p.j.a.cock at googlemail.com Wed Mar 26 17:47:56 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Mar 2014 17:47:56 +0000 Subject: [Biopython] GenomeDiagram: scale down the the track In-Reply-To: References: Message-ID: On Wed, Mar 26, 2014 at 5:39 PM, Asma Riyaz wrote: > Hi, > > I am using GenomeDiagram to show some gene Id I was interested in. However > the length of the sequences I am using are huge and hence when I add a > feature, and in my case the feature may run only a couple of thousand base > pairs due to which the sigil is smaller. > Here is a snippet of my code: > > ... > gd_feature_set.add_feature(feature,sigil="ARROW",arrowshaft_height=1.0,arrowhead_length=1.0,color=idToColorDict[name[0]],* > ... > gd_diagram.draw(format="linear",pagesize='A4',fragments=1,start=0,end=7000000)* > > Here is one thing I tried to do: > > when adding a new track, i specified start=0 and end=100 or 1000 or > 10,0000-----> however this doesn't seem to scale down the track in anyway, > instead what I see are empty tracks as the feature I am looking at do not > exist in 1 to 1000 or 1 to 10,000. > I attempted the same with "draw" and specified a different end BUT neither > of it worked. > > How could I display all the features that I need to ? > > Thanks, > Asma See the "Multiple tracks" example in the tutorial - you can show just sub-regions of different tracks (giving white space to the left and/or right). This is useful when the tracks do not show the same sequence exactly (e.g. with cross-links). What you want to change is the start/end in this line: gd_diagram.draw(...) Peter From asmariyaz23 at gmail.com Wed Mar 26 18:26:55 2014 From: asmariyaz23 at gmail.com (Asma Riyaz) Date: Wed, 26 Mar 2014 14:26:55 -0400 Subject: [Biopython] Graphics Message-ID: GenomeDiagram I am using the multiple tracks example in the tutorial as my base, selecting only "gene" whose id exist in my list and hence I can see the white space to the left and right of the feature. I specified a lower "end" in gd_diagram.draw() but this shows up in such a way that everything after the end position is not displayed even though there a more features. I have attached my figure below. My requirement: I want to show all the ids with an arrow sigil wherever it occurs on a genome(which I accomplished) BUT the arrows turn out to be too small to make sense of Any ideas to make it look better? Asma -------------- next part -------------- A non-text attachment was scrubbed... Name: wrong.pdf Type: application/pdf Size: 89796 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Wed Mar 26 23:05:19 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Mar 2014 23:05:19 +0000 Subject: [Biopython] GenomeDiagram: scale down the the track In-Reply-To: References: Message-ID: On Wed, Mar 26, 2014 at 6:15 PM, Asma Riyaz wrote: > I am using the multiple tracks example as my base, selecting only "gene" > whose id exist in my list and hence I can see the white space to the left > and right of the feature. > > I specified a lower "end" in gd_diagram.draw() but this shows up in such a > way that everything after the end position is not displayed. Yes, the start & end arguments are about which sub-region of the linear sequence to draw. > I have attached my figure below. > > My requirement, I want to show all the ids with an arrow sigil wherever it > occurs on a genome(which I accomplished) BUT the arrows turn out to be too > small to make sense of The length of the sigils (here arrows) is determined by the length of the feature (usually base pairs as we're normally drawing DNA), relative to the length of the region shown. If you want to make the arrows look longer, define a larger feature location (e.g. if the feature is from 1000 to 1010, exaggerate and use 900 to 1020 - perhaps not a good idea?), or draw a smaller region of interest, or make the whole diagram bigger etc. Or are you asking about the vertical height? Peter P.S. You seem to have sent this email multiple times, probably confused my the automatic moderation of the message because of the attachment. The delay is because a human (often me) has to manually approve any suspicious emails (which are usually spam). From asmariyaz23 at gmail.com Wed Mar 26 23:27:38 2014 From: asmariyaz23 at gmail.com (Asma Riyaz) Date: Wed, 26 Mar 2014 19:27:38 -0400 Subject: [Biopython] GenomeDiagram: scale down the the track Message-ID: Hi, Thank you for replying. I was asking to make arrows appear longer even the feature location is smaller when compared to the length of the genome. I will try exaggerating the feature location but don't know if it would turn out right scientifically. Drawing a smaller region of interest is not really what I am hoping for as the gene id's I am looking for are located all over the genome, moreover since these are tracks across multiple organisms, its difficult to focus on a particular region Also, sorry for sending multiple mails as for some reason mailing list rejected each of my mail saying it had a inappropriate subject line :"[Biopython] GenomeDiagram: scale down the the track". Thanks Asma On Wed, Mar 26, 2014 at 7:05 PM, Peter Cock wrote: > On Wed, Mar 26, 2014 at 6:15 PM, Asma Riyaz wrote: > > I am using the multiple tracks example as my base, selecting only "gene" > > whose id exist in my list and hence I can see the white space to the left > > and right of the feature. > > > > I specified a lower "end" in gd_diagram.draw() but this shows up in such > a > > way that everything after the end position is not displayed. > > Yes, the start & end arguments are about which sub-region of the > linear sequence to draw. > > > I have attached my figure below. > > > > My requirement, I want to show all the ids with an arrow sigil wherever > it > > occurs on a genome(which I accomplished) BUT the arrows turn out to be > too > > small to make sense of > > The length of the sigils (here arrows) is determined by the length > of the feature (usually base pairs as we're normally drawing DNA), > relative to the length of the region shown. > > If you want to make the arrows look longer, define a larger feature > location (e.g. if the feature is from 1000 to 1010, exaggerate and > use 900 to 1020 - perhaps not a good idea?), or draw a smaller > region of interest, or make the whole diagram bigger etc. > > Or are you asking about the vertical height? > > Peter > > P.S. You seem to have sent this email multiple times, probably > confused my the automatic moderation of the message because > of the attachment. The delay is because a human (often me) > has to manually approve any suspicious emails (which are > usually spam). > From nje5 at georgetown.edu Fri Mar 28 19:28:14 2014 From: nje5 at georgetown.edu (Nathan Edwards) Date: Fri, 28 Mar 2014 15:28:14 -0400 Subject: [Biopython] Are there tools for automatically parsing glycan names into tree structures? In-Reply-To: References: Message-ID: <5335CD4E.5060200@georgetown.edu> > Many apologies if you have seen this post cross-posted elsewhere. I have > tried digging around but could not find an answer to my question. > > My colleague and I are working on a project involving data produced at a > glycan microarray facility. The array data that came back to us were a list > of glycan names (in the format (random example from the top of my head): > GlcNAc...). We would like to parse the list of 610 names into the graphical > representation of the glycan. > > Is this possible? If so, what tools are available to get this done? My now graduated student (Kevin Brown-Chandler) and I have been developing python tools for the interpretation of CID tandem mass-spectra of N-glycopeptides for a while now, and have a reasonably mature tool for working with these datasets. As part of this infrastructure are python modules for parsing a variety of (N- and O-) glycan structure description formats; glycan structure manipulation, fragmentation, and naming (oxford notation abbreviations); and glycan structure image generation (using the java libraries from GlycoWorkbench). The tools for indexing glycan structure databases and generating images from the indexed databases are distributed with the search software, and we currently distribute a pre-indexed glycan database of (most of) the glycans on the Consortium for Functional Glycomics Mammalian array (v5.1). Download GlycoPeptideSearch (GPS) here: http://grg.tn/GPS Since it is unlikely the current tools do exactly what you need, feel free to ping me back with more specifics, and I'll see what I can do to help. Cheers! - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From rpathmanaban1 at gmail.com Mon Mar 31 13:28:59 2014 From: rpathmanaban1 at gmail.com (Pathmanaban Ramasamy) Date: Mon, 31 Mar 2014 15:28:59 +0200 Subject: [Biopython] filtering by query coverage Message-ID: Hi i am new to biopython and i would like to filter my xml outfile based on Query coverage percentage. Can someone help me with this? thanks in advance -- Pathmanaban From p.j.a.cock at googlemail.com Mon Mar 31 13:33:26 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Mar 2014 14:33:26 +0100 Subject: [Biopython] filtering by query coverage In-Reply-To: References: Message-ID: On Mon, Mar 31, 2014 at 2:28 PM, Pathmanaban Ramasamy wrote: > Hi i am new to biopython and i would like to filter my xml outfile based on > Query coverage percentage. Can someone help me with this? thanks in advance How are you defining query coverage percentage? It might be simpler to use BLAST+ 2.2.28 or later with the tabular output and specifically include one of these columns: qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP Filtering tabular BLAST output on query coverage would then be easy. Peter From p.j.a.cock at googlemail.com Mon Mar 31 14:50:44 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Mar 2014 15:50:44 +0100 Subject: [Biopython] filtering by query coverage In-Reply-To: References: Message-ID: Hi Pathmanaban, No, I meant instead of using the BLAST XML output, you could run BLAST requesting tabular output. If you want to use the BLAST XML output, I think you will need to loop over each hit's HSPs and calculate the query coverage (since I don't think this information is provided precalculated) Regards, Peter P.S. Please CC the mailing list in your reply. On Mon, Mar 31, 2014 at 3:44 PM, Pathmanaban Ramasamy wrote: > Hi Peter , > Thanks for your mail. Yes query coverage i mean by the regions with hits > (no gaps). So u say that i can blast using standalone blast version and then > parse them as usual xml parse in biopython? > > > On Mon, Mar 31, 2014 at 3:33 PM, Peter Cock > wrote: >> >> On Mon, Mar 31, 2014 at 2:28 PM, Pathmanaban Ramasamy >> wrote: >> > Hi i am new to biopython and i would like to filter my xml outfile based >> > on >> > Query coverage percentage. Can someone help me with this? thanks in >> > advance >> >> How are you defining query coverage percentage? >> >> It might be simpler to use BLAST+ 2.2.28 or later with the tabular output >> and specifically include one of these columns: >> >> qcovs means Query Coverage Per Subject >> qcovhsp means Query Coverage Per HSP >> >> Filtering tabular BLAST output on query coverage would then be easy. >> >> Peter > > > > > -- > Pathmanaban